Next-Token Prediction — FDE@ProdAI Blog

Definition

Next-token prediction (also called autoregressive language modeling or causal language modeling) is the core training objective of most modern LLMs: given a sequence of tokens, predict the probability distribution over all possible next tokens. This deceptively simple objective, applied at massive scale, is the foundation of GPT, Claude, LLaMA, and nearly all other frontier LLMs.

The Formal Definition

Given a sequence of tokens [t₁, t₂, ..., tₙ], the model learns to predict:

P(tₙ₊₁ | t₁, t₂, ..., tₙ)

For a document of N tokens, the model predicts every token given all previous ones:

P(document) = P(t₁) × P(t₂|t₁) × P(t₃|t₁,t₂) × ... × P(tₙ|t₁,...,tₙ₋₁)

Training Process

1. Take a document: "The capital of France is Paris."

2. Create training pairs:

Input: "The" → Target: "capital"

Input: "The capital" → Target: "of"

Input: "The capital of" → Target: "France"

Input: "The capital of France" → Target: "is"

Input: "The capital of France is" → Target: "Paris"

Input: "The capital of France is Paris" → Target: "."

3. Model predicts the probability of the target token

4. Loss = cross-entropy between prediction and true next token

5. Backpropagate and update weights

One document with N tokens produces N training examples — massively efficient.

Why This Objective is So Powerful

Next-token prediction seems narrow, but to predict the next word well, the model must understand:

| To predict next token in... | Model must learn... |

|-----------------------------|---------------------|

| "The capital of France is ___" | World knowledge (geography) |

| "def fibonacci(n): if n <= 1: return ___" | Code logic |

| "She took the umbrella because it was ___" | Common sense reasoning |

| "The defendant argued that the law ___" | Legal language patterns |

| "2 + 2 = ___" | Arithmetic |

| "if condition: \n ___" | Syntax rules |

Every type of knowledge is needed to predict text well.

Self-Supervised Learning

A key property: no human labels are needed. The supervision signal comes directly from the data:

Input: the first N tokens of any document
Label: the (N+1)th token
This label exists for every document ever written

This is why training data can be sourced from the internet at essentially unlimited scale.

The "Blessing of Scale" in Next-Token Prediction

As training scales:

Perplexity (prediction loss) decreases smoothly
Every decrease in perplexity represents better language understanding
At sufficient scale, this produces models that can reason, code, and converse

The link from next-token prediction to general intelligence is empirical but robust — models trained on this objective develop remarkably broad capabilities.

Autoregressive Generation (The Inference Connection)

The same objective used in training is used at inference:

Training: Predict P(t_{n+1} | t_1...t_n) for all n in the document

Inference: Sample t_{n+1} from P(t_{n+1} | t_1...t_n), append, repeat

This is why inference is autoregressive — generation uses exactly the same mechanism as training.

Variants of the Core Objective

Masked Language Modeling (MLM) — BERT

Instead of predicting the next token, predict randomly masked tokens:

"The [MASK] of France is [MASK]." → "capital", "Paris"

Bidirectional: model sees context from both sides
Better for classification/understanding
Cannot generate text (no left-to-right generation order)

Prefix Language Modeling

Predict tokens given a full prefix — hybrid of MLM and CLM:

Used by T5 in "span corruption" mode
More efficient use of training signal

Fill-in-the-Middle (FIM)

Model learns to fill in a middle section given prefix and suffix:

Input: "def add(a, [FIM] return a + b" → Predict: "b):\n "

Used in code generation models (Code Llama, StarCoder)
Enables completing code at a cursor position

Limitations of Next-Token Prediction

No Explicit Planning

The model doesn't "plan ahead" — each token is generated left-to-right without a global outline. This is why CoT prompting helps: intermediate tokens provide planning.

Frequency Bias

The model learns to predict frequent patterns well. Rare facts, tail knowledge, and low-frequency patterns are predicted less reliably → one source of hallucination.

No Ground Truth for Open-Ended Questions

"What is the best solution to this problem?" has no unique next token. The model learns from corpus patterns — which may not match the "best" answer.

Related Concepts

LLM, Token, Pre-training, Inference, Loss Function, Autoregressive, Perplexity