DefiledAI Research
BENCHMARK MATRIX
Real-world inference benchmarks measured on consumer and prosumer hardware. All results are first-token-excluded sustained throughput at default sampling settings.
Methodology
MetricSustained tok/s, excluding first token (TTFT)
Prompt512-token fixed input, 256-token output
Runs5 iterations, median reported
DriverCUDA 12.4 / ROCm 6.1
Inference ResultsLast updated: 2026-05-28
| Model | Quant | VRAM | Backend | GPU | Tok/s | Date |
|---|---|---|---|---|---|---|
| Llama 3.1 70B | Q4_K_M | 48GB | ExLlamaV2 | 2× RTX 3090 | 21 | 2026-05-28 |
| Llama 3.1 70B | Q5_K_M | 56GB | ExLlamaV2 | 2× RTX 3090 | 16 | 2026-05-28 |
| Qwen 3 72B | Q5_K_M | 64GB | llama.cpp | 2× RTX 3090 | 18 | 2026-05-27 |
| DeepSeek V3 | MoE Q4 | Multi-GPU | TensorRT-LLM | 4× A100 | 39 | 2026-05-26 |
| Mixtral 8x22B | Q4 | 48GB | ExLlamaV2 | 2× RTX 3090 | 27 | 2026-05-25 |
| Phi-3 Medium | Q6_K | 14GB | llama.cpp | RTX 4090 | 68 | 2026-05-24 |
| Gemma 2 27B | Q4_K_M | 18GB | llama.cpp | RTX 4090 | 44 | 2026-05-23 |
| Mistral 7B | Q8_0 | 8GB | llama.cpp | RTX 3080 | 112 | 2026-05-22 |
GPU Comparison — Llama Family (tok/s)
| GPU | VRAM | 7B Q4 | 13B Q4 | 70B Q4 | Street Price |
|---|---|---|---|---|---|
| RTX 4090 | 24GB | 112 | 72 | OOM | $1,600 |
| RTX 3090 | 24GB | 89 | 58 | OOM | $700 |
| 2× RTX 3090 | 48GB | 94 | 61 | 21 | $1,400 |
| RTX 4080 | 16GB | 98 | 61 | OOM | $1,000 |
| RX 7900 XTX | 24GB | 71 | 44 | OOM | $800 |
Submit your benchmark results to the forum.