DefiledAI Research

QUANTIZATION GUIDE

GGUF quantization lets you run large models on consumer hardware by reducing weight precision. This guide covers every major format, their quality tradeoffs, and how to choose the right one for your hardware.

WHAT IS QUANTIZATION?

Neural network weights are typically stored as 16-bit or 32-bit floating point numbers. Quantization reduces each weight to fewer bits — trading a small amount of model quality for dramatically lower VRAM usage and faster inference.

The GGUF format (used by llama.cpp, Ollama, LM Studio, and ExLlamaV2) supports a wide range of quantization levels. K-quants (Q4_K_M, Q5_K_M etc.) use a mixed-precision approach that preserves quality in the most important layers.

Format Comparison
FormatBitsVRAM vs F16QualitySpeedNotes
F16161.0×
100
60
Highest fidelity. Only viable for small models on high-VRAM cards.
Q8_080.5×
99
75
Near-lossless. Best quality/size tradeoff for models that fit in VRAM.
Q6_K60.38×
98
82
Excellent quality with meaningful VRAM savings. Recommended for 13B.
Q5_K_M50.31×
96
88
Strong quality. Good default for 7-13B models when VRAM is limited.
Q4_K_M40.25×
92
95
Most popular. Best balance of quality, speed, and VRAM for 70B class.
Q3_K_M30.19×
83
100
Noticeable quality degradation. Use only when VRAM is severely constrained.
IQ3_M~3.50.22×
87
92
Importance-matrix quantization. Better quality than Q3_K_M at similar size.
Q2_K20.13×
65
100
Severe quality loss. Last resort for fitting very large models on limited VRAM.
IQ1_M~1.50.09×
45
100
Extreme compression. Only useful for 405B/671B models on consumer hardware.
VRAM Requirements by Model Size
ModelF16Q8_0Q5_K_MQ4_K_MQ2_K
7B14GB7GB5GB4GB2.5GB
13B26GB13GB9GB7GB4GB
30B60GB30GB22GB17GB9GB
70B140GB70GB48GB40GB22GB
405B810GB405GB280GB220GB110GB

QUICK RECOMMENDATION

Consumer GPU (≤24GB)
Use Q4_K_M for 7-13B models. For 70B you'll need dual GPUs or NVLink.
Dual GPU (48GB)
Run 70B at Q4_K_M comfortably. Q5_K_M if you want better quality at 56GB.
Quality Priority
Always use the highest quant that fits. Q6_K or Q8_0 for smaller models if VRAM allows.