DefiledAI Research

BENCHMARK MATRIX

Real-world inference benchmarks measured on consumer and prosumer hardware. All results are first-token-excluded sustained throughput at default sampling settings.

Methodology

MetricSustained tok/s, excluding first token (TTFT)

Prompt512-token fixed input, 256-token output

Runs5 iterations, median reported

DriverCUDA 12.4 / ROCm 6.1

Inference ResultsLast updated: 2026-05-28

Model	Quant	VRAM	Backend	GPU	Tok/s	Date
Llama 3.1 70B	Q4_K_M	48GB	ExLlamaV2	2× RTX 3090	21	2026-05-28
Llama 3.1 70B	Q5_K_M	56GB	ExLlamaV2	2× RTX 3090	16	2026-05-28
Qwen 3 72B	Q5_K_M	64GB	llama.cpp	2× RTX 3090	18	2026-05-27
DeepSeek V3	MoE Q4	Multi-GPU	TensorRT-LLM	4× A100	39	2026-05-26
Mixtral 8x22B	Q4	48GB	ExLlamaV2	2× RTX 3090	27	2026-05-25
Phi-3 Medium	Q6_K	14GB	llama.cpp	RTX 4090	68	2026-05-24
Gemma 2 27B	Q4_K_M	18GB	llama.cpp	RTX 4090	44	2026-05-23
Mistral 7B	Q8_0	8GB	llama.cpp	RTX 3080	112	2026-05-22

GPU Comparison — Llama Family (tok/s)

GPU	VRAM	7B Q4	13B Q4	70B Q4	Street Price
RTX 4090	24GB	112	72	OOM	$1,600
RTX 3090	24GB	89	58	OOM	$700
2× RTX 3090	48GB	94	61	21	$1,400
RTX 4080	16GB	98	61	OOM	$1,000
RX 7900 XTX	24GB	71	44	OOM	$800

Submit your benchmark results to the forum.