Quantization — FDE@ProdAI Blog

Definition

Quantization is the process of reducing the numerical precision of a model's weights (and sometimes activations) from higher-bit formats (float32, float16) to lower-bit formats (int8, int4). This shrinks memory usage and increases inference speed, with controlled tradeoffs in model quality.

Why Quantization Is Essential

A 70B parameter model in float16 requires ~140GB of GPU memory — that's two A100 80GB GPUs just to load it. After quantization:

|-----------|-----------|---------------|---------|

| float32 | 32 | ~280 GB | 4× A100 |

| float16/bfloat16 | 16 | ~140 GB | 2× A100 |

| int8 | 8 | ~70 GB | 1× A100 |

| int4 | 4 | ~35 GB | 2× RTX 4090 |

| int3 | 3 | ~26 GB | 1× RTX 3090 |

| int2 | 2 | ~17 GB | Consumer GPU |

Number Format Basics

|--------|-------|-----------|-----|

| int8 | -128 to 127 | 256 discrete values | Efficient inference |

| int4 | -8 to 7 | 16 discrete values | Aggressive inference |

Quantization Methods

Post-Training Quantization (PTQ)

Apply after training is complete — no additional training required:

GPTQ (Generative Pre-trained Transformer Quantization)

Uses sample calibration data to minimize quantization error
Layer-by-layer quantization using second-order information
INT4 quality close to FP16
Used in: AWQ, many GGUF models

AWQ (Activation-Aware Weight Quantization)

Finds the most important weights (high activation magnitude) and protects them
Better quality than naive INT4
Used widely in production

GGUF / llama.cpp quantization

Format used by llama.cpp for CPU+GPU inference
Multiple quantization levels: Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_0
Q4_K_M (4-bit with mixed precision) is popular sweet spot for quality/size

BitsAndBytes (bitsandbytes library)

Dynamic INT8 quantization with LLM.int8()
NF4 (Normal Float 4) for QLoRA fine-tuning
Easy integration with HuggingFace Transformers

Quantization-Aware Training (QAT)

Simulate quantization during training so the model learns to be robust to it
Better quality than PTQ but requires training
Less common for LLMs due to training cost

Weight-Only vs. Weight + Activation Quantization

Weight-Only (W4A16, W8A16)

Weights stored in INT4/INT8
Activations remain in float16 at inference
Simpler to implement, good quality
Most common in practice

Full Quantization (W8A8, W4A8)

Both weights AND activations quantized
More hardware-efficient (INT8 matrix multiply is fast on modern hardware)
Harder to implement without quality loss
Used by NVIDIA TensorRT-LLM, Intel Neural Compressor

Quality vs. Size Tradeoff

Rule of thumb for language models:

| Quantization | Quality Loss | Use When |

|-------------|-------------|---------|

| INT8 (8-bit) | < 1% perplexity degradation | Safe default, minimal loss |

| INT4 (4-bit) | ~1–5% perplexity degradation | Good balance, widely used |

| INT3 (3-bit) | ~5–15% degradation | Only when size is critical |

| INT2 (2-bit) | Severe quality loss | Experimental |

Larger models tolerate quantization better — a 70B INT4 often beats a 13B FP16.

Quantization Formats in the Wild

| Format | Tool | Description |

|--------|------|-------------|

| GGUF | llama.cpp, Ollama | CPU/GPU, many quant levels, widely used |

| GPTQ | AutoGPTQ, HuggingFace | GPU inference, INT4/INT8 |

| AWQ | AutoAWQ | GPU, activation-aware INT4 |

| EXL2 | ExLlamaV2 | Flexible mixed-precision, high quality |

| MLX (4-bit) | Apple MLX | Apple Silicon optimized |

Flash Attention and Quantization Combined

Modern inference stacks combine:

Quantized weights (INT4 or INT8) for memory efficiency
Flash Attention for compute efficiency
KV cache in fp16 (often the memory bottleneck for long contexts)

When to Quantize

| Scenario | Recommendation |

|----------|---------------|

| Running locally (consumer GPU) | INT4 (GGUF Q4_K_M or AWQ) |

| Cloud inference at scale | INT8 or FP16 depending on GPU |

| Fine-tuning with QLoRA | NF4 (4-bit) base + bf16 adapters |

| Production API serving | FP16 with optional INT8 for large models |

| Research / accuracy-critical | FP16 or BF16 |

Practical Tools

| Tool | Use Case |

|------|---------|

| bitsandbytes | HuggingFace integration, NF4/INT8 |

| AutoGPTQ | GPTQ quantization |

| AutoAWQ | AWQ quantization |

| llama.cpp | GGUF format, cross-platform |

| Ollama | GGUF-based local inference, user-friendly |

| TensorRT-LLM | NVIDIA production quantization |

Related Concepts

Parameters, Inference, Latency, KV Cache, LoRA/PEFT, Memory, Model Deployment