Intermediate·5 min read

Scaling Laws

Scaling laws are empirical relationships that describe how LLM performance (measured by loss) improves predictably and smoothly as a function of three

Definition

Scaling laws are empirical relationships that describe how LLM performance (measured by loss) improves predictably and smoothly as a function of three resources: model size (parameters), training data (tokens), and compute (FLOPs). They allow researchers to forecast model capability before training, and to optimally allocate a compute budget.

Why Scaling Laws Matter

Before scaling laws, building better models was trial-and-error. Scaling laws revealed:

  • Bigger models with more data = predictably better, following a power law
  • You can extrapolate small-model training runs to predict large-model performance
  • There are optimal ratios between model size and data for a given compute budget
  • This predictability enabled the "just scale it" approach that produced GPT-3, GPT-4, and beyond
  • The Two Landmark Papers

    1. Kaplan et al. (OpenAI, 2020) — "Scaling Laws for Neural Language Models"

    Key findings:

  • Loss scales as a power law in N (parameters), D (data tokens), and C (compute)
  • Performance improves smoothly and predictably with scale — no abrupt phase changes (mostly)
  • Larger models are more sample-efficient — they learn more from each token
  • Implication: given a fixed compute budget, use the largest model possible even with less data
  • `

    L(N) ∝ N^(-0.076) (model size scaling)

    L(D) ∝ D^(-0.095) (data size scaling)

    L(C) ∝ C^(-0.050) (compute scaling)

    `

    2. Chinchilla (Hoffmann et al., DeepMind, 2022) — "Training Compute-Optimal Large Language Models"

    Corrected Kaplan's recommendation:

    Key finding: For a fixed compute budget, parameters and tokens should scale equally.

  • Rule of thumb: ~20 tokens per parameter for compute-optimal training
  • GPT-3 (175B params, 300B tokens) was significantly undertrained
  • Chinchilla (70B params, 1.4T tokens) outperformed Gopher (280B params) using same compute
  • Chinchilla formula:

    `

    N_optimal = C^0.49 × 0.56

    D_optimal = C^0.51 × 1.78

    Approximately: D_optimal ≈ 20 × N_optimal

    `

    Practical Implications of Scaling Laws

    For Model Training

    | Decision | Guidance from Scaling Laws |

    |----------|---------------------------|

    | Model size | Larger is better, but must be matched with enough data |

    | Data quantity | ~20 tokens/param minimum; frontier labs use 10–100× more |

    | Training duration | Don't stop early — more steps = lower loss |

    | Compute budget | Split roughly equally between model size and data |

    For Inference Efficiency

    After Chinchilla, the field shifted to overtrained smaller models:

  • Train a smaller model on far more tokens than compute-optimal
  • Result: the model is "overtrained" for its size, but more efficient at inference
  • Examples: LLaMA (7B trained on 1–2T tokens >> 140B optimal), Mistral 7B
  • Inference compute matters too — a smaller overtrained model may outperform a compute-optimal larger model while being cheaper to serve
  • Emergent Abilities and Scaling Laws

    Some capabilities don't follow smooth power laws — they appear abruptly:

  • Phase transitions: a capability is absent at small scale, then appears sharply at a threshold
  • Examples: in-context learning, chain-of-thought reasoning, multi-step arithmetic
  • Debated: some argue these are measurement artifacts of benchmark thresholds
  • Beyond Loss: Downstream Task Scaling

    Scaling laws on perplexity (language modeling loss) correlate with downstream task performance — but not perfectly:

  • Some tasks improve smoothly with scale (MMLU)
  • Others show emergent threshold behavior (GSM8K, BIG-Bench Hard)
  • Quality of training data can shift the scaling curve significantly
  • Data Scaling Laws

  • Quality > Quantity: high-quality data (books, Wikipedia) is worth more per token than web crawl
  • Diversity: model capabilities track data diversity
  • Repetition hurts: seeing the same data >1–3 times degrades performance
  • Data mixture matters: the proportion of code, math, multilingual data in training shapes capabilities
  • Compute Scaling (Chinchilla Formula in Practice)

    For a training budget of C FLOPs:

    `

    Optimal N (parameters) ≈ (C / 6)^0.5

    Optimal D (tokens) ≈ (C × 6)^0.5 × 20

    `

    Example: 10^23 FLOPs budget

  • Optimal model size: ~13B parameters
  • Optimal training tokens: ~260B tokens
  • The "Compute-Optimal" vs. "Inference-Optimal" Distinction

    | Approach | Model Size | Tokens | Result |

    |----------|-----------|--------|--------|

    | Compute-optimal | Large | Fewer | Best performance per training FLOP |

    | Inference-optimal | Small | Many (overtrained) | Best performance per inference FLOP |

    Frontier labs are shifting from compute-optimal to inference-optimal training as deployment costs dominate.

    Scaling Laws Limitations

    | Limitation | Notes |

    |------------|-------|

    | Architecture changes reset the curve | New architectures (MoE, Mamba) shift the power law |

    | Data quality is not accounted for | Laws assume uniform quality corpora |

    | Benchmark saturation | At some scale, benchmarks max out |

    | Emergent abilities are discontinuous | Not everything follows a smooth power law |

    | Don't predict reasoning ability directly | GPT-4 quality jump required more than just scale |

    Related Concepts

  • Pre-training, Parameters, Compute, Chinchilla, Emergent Abilities, Model Selection, Loss Function

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 9).