Definition
Next-token prediction (also called autoregressive language modeling or causal language modeling) is the core training objective of most modern LLMs: given a sequence of tokens, predict the probability distribution over all possible next tokens. This deceptively simple objective, applied at massive scale, is the foundation of GPT, Claude, LLaMA, and nearly all other frontier LLMs.
The Formal Definition
Given a sequence of tokens [t₁, t₂, ..., tₙ], the model learns to predict:
`
P(tₙ₊₁ | t₁, t₂, ..., tₙ)
`
For a document of N tokens, the model predicts every token given all previous ones:
`
P(document) = P(t₁) × P(t₂|t₁) × P(t₃|t₁,t₂) × ... × P(tₙ|t₁,...,tₙ₋₁)
`
Training Process
1. Take a document: "The capital of France is Paris."
2. Create training pairs:
`
Input: "The" → Target: "capital"
Input: "The capital" → Target: "of"
Input: "The capital of" → Target: "France"
Input: "The capital of France" → Target: "is"
Input: "The capital of France is" → Target: "Paris"
Input: "The capital of France is Paris" → Target: "."
`
3. Model predicts the probability of the target token
4. Loss = cross-entropy between prediction and true next token
5. Backpropagate and update weights
One document with N tokens produces N training examples — massively efficient.
Why This Objective is So Powerful
Next-token prediction seems narrow, but to predict the next word well, the model must understand:
| To predict next token in... | Model must learn... |
|-----------------------------|---------------------|
| "The capital of France is ___" | World knowledge (geography) |
| "def fibonacci(n): if n <= 1: return ___" | Code logic |
| "She took the umbrella because it was ___" | Common sense reasoning |
| "The defendant argued that the law ___" | Legal language patterns |
| "2 + 2 = ___" | Arithmetic |
| "if condition: \n ___" | Syntax rules |
Every type of knowledge is needed to predict text well.
Self-Supervised Learning
A key property: no human labels are needed. The supervision signal comes directly from the data:
- Input: the first N tokens of any document
- Label: the (N+1)th token
- This label exists for every document ever written
- Perplexity (prediction loss) decreases smoothly
- Every decrease in perplexity represents better language understanding
- At sufficient scale, this produces models that can reason, code, and converse
- Bidirectional: model sees context from both sides
- Better for classification/understanding
- Cannot generate text (no left-to-right generation order)
- Used by T5 in "span corruption" mode
- More efficient use of training signal
- Used in code generation models (Code Llama, StarCoder)
- Enables completing code at a cursor position
- LLM, Token, Pre-training, Inference, Loss Function, Autoregressive, Perplexity
This is why training data can be sourced from the internet at essentially unlimited scale.
The "Blessing of Scale" in Next-Token Prediction
As training scales:
The link from next-token prediction to general intelligence is empirical but robust — models trained on this objective develop remarkably broad capabilities.
Autoregressive Generation (The Inference Connection)
The same objective used in training is used at inference:
`
Training: Predict P(t_{n+1} | t_1...t_n) for all n in the document
Inference: Sample t_{n+1} from P(t_{n+1} | t_1...t_n), append, repeat
`
This is why inference is autoregressive — generation uses exactly the same mechanism as training.
Variants of the Core Objective
Masked Language Modeling (MLM) — BERT
Instead of predicting the next token, predict randomly masked tokens:
`
"The [MASK] of France is [MASK]." → "capital", "Paris"
`
Prefix Language Modeling
Predict tokens given a full prefix — hybrid of MLM and CLM:
Fill-in-the-Middle (FIM)
Model learns to fill in a middle section given prefix and suffix:
`
Input: "def add(a, [FIM] return a + b" → Predict: "b):\n "
`
Limitations of Next-Token Prediction
No Explicit Planning
The model doesn't "plan ahead" — each token is generated left-to-right without a global outline. This is why CoT prompting helps: intermediate tokens provide planning.
Frequency Bias
The model learns to predict frequent patterns well. Rare facts, tail knowledge, and low-frequency patterns are predicted less reliably → one source of hallucination.
No Ground Truth for Open-Ended Questions
"What is the best solution to this problem?" has no unique next token. The model learns from corpus patterns — which may not match the "best" answer.