Definition
Quantization is the process of reducing the numerical precision of a model's weights (and sometimes activations) from higher-bit formats (float32, float16) to lower-bit formats (int8, int4). This shrinks memory usage and increases inference speed, with controlled tradeoffs in model quality.
Why Quantization Is Essential
A 70B parameter model in float16 requires ~140GB of GPU memory — that's two A100 80GB GPUs just to load it. After quantization:
| Precision | Bits/param | 70B Model Size | Runs On |
|-----------|-----------|---------------|---------|
| float32 | 32 | ~280 GB | 4× A100 |
| float16/bfloat16 | 16 | ~140 GB | 2× A100 |
| int8 | 8 | ~70 GB | 1× A100 |
| int4 | 4 | ~35 GB | 2× RTX 4090 |
| int3 | 3 | ~26 GB | 1× RTX 3090 |
| int2 | 2 | ~17 GB | Consumer GPU |
Number Format Basics
| Format | Range | Precision | Use |
|--------|-------|-----------|-----|
| float32 (FP32) | ±3.4×10^38 | ~7 decimal digits | Full training |
| float16 (FP16) | ±65504 | ~3 decimal digits | Training/inference |
| bfloat16 (BF16) | ±3.4×10^38 | ~2 decimal digits | Training preferred |
| int8 | -128 to 127 | 256 discrete values | Efficient inference |
| int4 | -8 to 7 | 16 discrete values | Aggressive inference |
Quantization Methods
Post-Training Quantization (PTQ)
Apply after training is complete — no additional training required:
GPTQ (Generative Pre-trained Transformer Quantization)
- Uses sample calibration data to minimize quantization error
- Layer-by-layer quantization using second-order information
- INT4 quality close to FP16
- Used in: AWQ, many GGUF models
- Finds the most important weights (high activation magnitude) and protects them
- Better quality than naive INT4
- Used widely in production
- Format used by llama.cpp for CPU+GPU inference
- Multiple quantization levels: Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_0
- Q4_K_M (4-bit with mixed precision) is popular sweet spot for quality/size
- Dynamic INT8 quantization with LLM.int8()
- NF4 (Normal Float 4) for QLoRA fine-tuning
- Easy integration with HuggingFace Transformers
- Simulate quantization during training so the model learns to be robust to it
- Better quality than PTQ but requires training
- Less common for LLMs due to training cost
- Weights stored in INT4/INT8
- Activations remain in float16 at inference
- Simpler to implement, good quality
- Most common in practice
- Both weights AND activations quantized
- More hardware-efficient (INT8 matrix multiply is fast on modern hardware)
- Harder to implement without quality loss
- Used by NVIDIA TensorRT-LLM, Intel Neural Compressor
- Quantized weights (INT4 or INT8) for memory efficiency
- Flash Attention for compute efficiency
- KV cache in fp16 (often the memory bottleneck for long contexts)
- Parameters, Inference, Latency, KV Cache, LoRA/PEFT, Memory, Model Deployment
AWQ (Activation-Aware Weight Quantization)
GGUF / llama.cpp quantization
BitsAndBytes (bitsandbytes library)
Quantization-Aware Training (QAT)
Weight-Only vs. Weight + Activation Quantization
Weight-Only (W4A16, W8A16)
Full Quantization (W8A8, W4A8)
Quality vs. Size Tradeoff
Rule of thumb for language models:
| Quantization | Quality Loss | Use When |
|-------------|-------------|---------|
| INT8 (8-bit) | < 1% perplexity degradation | Safe default, minimal loss |
| INT4 (4-bit) | ~1–5% perplexity degradation | Good balance, widely used |
| INT3 (3-bit) | ~5–15% degradation | Only when size is critical |
| INT2 (2-bit) | Severe quality loss | Experimental |
Larger models tolerate quantization better — a 70B INT4 often beats a 13B FP16.
Quantization Formats in the Wild
| Format | Tool | Description |
|--------|------|-------------|
| GGUF | llama.cpp, Ollama | CPU/GPU, many quant levels, widely used |
| GPTQ | AutoGPTQ, HuggingFace | GPU inference, INT4/INT8 |
| AWQ | AutoAWQ | GPU, activation-aware INT4 |
| EXL2 | ExLlamaV2 | Flexible mixed-precision, high quality |
| MLX (4-bit) | Apple MLX | Apple Silicon optimized |
Flash Attention and Quantization Combined
Modern inference stacks combine:
When to Quantize
| Scenario | Recommendation |
|----------|---------------|
| Running locally (consumer GPU) | INT4 (GGUF Q4_K_M or AWQ) |
| Cloud inference at scale | INT8 or FP16 depending on GPU |
| Fine-tuning with QLoRA | NF4 (4-bit) base + bf16 adapters |
| Production API serving | FP16 with optional INT8 for large models |
| Research / accuracy-critical | FP16 or BF16 |
Practical Tools
| Tool | Use Case |
|------|---------|
| bitsandbytes | HuggingFace integration, NF4/INT8 |
| AutoGPTQ | GPTQ quantization |
| AutoAWQ | AWQ quantization |
| llama.cpp | GGUF format, cross-platform |
| Ollama | GGUF-based local inference, user-friendly |
| TensorRT-LLM | NVIDIA production quantization |