Beginner·4 min read

Next-Token Prediction

Next-token prediction (also called autoregressive language modeling or causal language modeling) is the core training objective of most modern LLMs: g

Definition

Next-token prediction (also called autoregressive language modeling or causal language modeling) is the core training objective of most modern LLMs: given a sequence of tokens, predict the probability distribution over all possible next tokens. This deceptively simple objective, applied at massive scale, is the foundation of GPT, Claude, LLaMA, and nearly all other frontier LLMs.

The Formal Definition

Given a sequence of tokens [t₁, t₂, ..., tₙ], the model learns to predict:

`

P(tₙ₊₁ | t₁, t₂, ..., tₙ)

`

For a document of N tokens, the model predicts every token given all previous ones:

`

P(document) = P(t₁) × P(t₂|t₁) × P(t₃|t₁,t₂) × ... × P(tₙ|t₁,...,tₙ₋₁)

`

Training Process

1. Take a document: "The capital of France is Paris."

2. Create training pairs:

`

Input: "The" → Target: "capital"

Input: "The capital" → Target: "of"

Input: "The capital of" → Target: "France"

Input: "The capital of France" → Target: "is"

Input: "The capital of France is" → Target: "Paris"

Input: "The capital of France is Paris" → Target: "."

`

3. Model predicts the probability of the target token

4. Loss = cross-entropy between prediction and true next token

5. Backpropagate and update weights

One document with N tokens produces N training examples — massively efficient.

Why This Objective is So Powerful

Next-token prediction seems narrow, but to predict the next word well, the model must understand:

| To predict next token in... | Model must learn... |

|-----------------------------|---------------------|

| "The capital of France is ___" | World knowledge (geography) |

| "def fibonacci(n): if n <= 1: return ___" | Code logic |

| "She took the umbrella because it was ___" | Common sense reasoning |

| "The defendant argued that the law ___" | Legal language patterns |

| "2 + 2 = ___" | Arithmetic |

| "if condition: \n ___" | Syntax rules |

Every type of knowledge is needed to predict text well.

Self-Supervised Learning

A key property: no human labels are needed. The supervision signal comes directly from the data:

  • Input: the first N tokens of any document
  • Label: the (N+1)th token
  • This label exists for every document ever written
  • This is why training data can be sourced from the internet at essentially unlimited scale.

    The "Blessing of Scale" in Next-Token Prediction

    As training scales:

  • Perplexity (prediction loss) decreases smoothly
  • Every decrease in perplexity represents better language understanding
  • At sufficient scale, this produces models that can reason, code, and converse
  • The link from next-token prediction to general intelligence is empirical but robust — models trained on this objective develop remarkably broad capabilities.

    Autoregressive Generation (The Inference Connection)

    The same objective used in training is used at inference:

    `

    Training: Predict P(t_{n+1} | t_1...t_n) for all n in the document

    Inference: Sample t_{n+1} from P(t_{n+1} | t_1...t_n), append, repeat

    `

    This is why inference is autoregressive — generation uses exactly the same mechanism as training.

    Variants of the Core Objective

    Masked Language Modeling (MLM) — BERT

    Instead of predicting the next token, predict randomly masked tokens:

    `

    "The [MASK] of France is [MASK]." → "capital", "Paris"

    `

  • Bidirectional: model sees context from both sides
  • Better for classification/understanding
  • Cannot generate text (no left-to-right generation order)
  • Prefix Language Modeling

    Predict tokens given a full prefix — hybrid of MLM and CLM:

  • Used by T5 in "span corruption" mode
  • More efficient use of training signal
  • Fill-in-the-Middle (FIM)

    Model learns to fill in a middle section given prefix and suffix:

    `

    Input: "def add(a, [FIM] return a + b" → Predict: "b):\n "

    `

  • Used in code generation models (Code Llama, StarCoder)
  • Enables completing code at a cursor position
  • Limitations of Next-Token Prediction

    No Explicit Planning

    The model doesn't "plan ahead" — each token is generated left-to-right without a global outline. This is why CoT prompting helps: intermediate tokens provide planning.

    Frequency Bias

    The model learns to predict frequent patterns well. Rare facts, tail knowledge, and low-frequency patterns are predicted less reliably → one source of hallucination.

    No Ground Truth for Open-Ended Questions

    "What is the best solution to this problem?" has no unique next token. The model learns from corpus patterns — which may not match the "best" answer.

    Related Concepts

  • LLM, Token, Pre-training, Inference, Loss Function, Autoregressive, Perplexity

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 13).