Intermediate·4 min read

Logits and Softmax

**Logits** are the raw, unnormalized output scores the LLM produces for every token in its vocabulary at each generation step. **Softmax** is the math

Definition

Logits are the raw, unnormalized output scores the LLM produces for every token in its vocabulary at each generation step. Softmax is the mathematical function that converts these logits into a proper probability distribution over the vocabulary — the numbers you actually sample from to pick the next token.

The Full Generation Pipeline

`

Transformer forward pass

Hidden state vector (d_model dimensions)

[LM Head: Linear layer W_vocab] (d_model → vocab_size)

Logits: [z_1, z_2, ..., z_V] (one raw score per vocab token, V ≈ 32K–128K)

[Temperature scaling: z_i / T]

[Softmax]

Probabilities: [p_1, p_2, ..., p_V] (sum to 1.0)

[Top-K / Top-P filtering]

Sample → next token

`

Logits Explained

  • A logit for token i represents how "confident" the model is that token i should come next
  • Higher logit = the model thinks this token is more likely
  • Logits can be any real number: negative, zero, or positive
  • They have NO direct probabilistic interpretation until passed through softmax
  • Example (conceptual, for "The capital of France is ___"):

    `

    Token | Logit

    -----------|-------

    "Paris" | 12.4

    "France" | 8.1

    "Lyon" | 6.2

    "a" | 2.1

    "the" | 1.8

    "dog" | -5.2

    ...

    `

    Softmax Explained

    Softmax converts the vector of logits into probabilities:

    `

    softmax(z_i) = exp(z_i) / Σ_j exp(z_j)

    `

    Properties:

  • All output values are in (0, 1)
  • All output values sum to exactly 1.0
  • Preserves relative ordering (highest logit → highest probability)
  • Exponential function amplifies differences — a logit gap of 1.0 means ~2.7× higher probability
  • Temperature's Effect on Logits

    Temperature T scales the logits before softmax:

    `

    p_i = softmax(z_i / T)

    `

    | T | Effect on Logits | Effect on Distribution |

    |---|-----------------|----------------------|

    | T → 0 | Logits → ±∞ | Argmax (one token gets all probability) |

    | T = 0.5 | Logits doubled in magnitude | Sharper — top tokens dominate more |

    | T = 1.0 | No change | Default model distribution |

    | T = 2.0 | Logits halved in magnitude | Flatter — more uniform, more random |

    From Logits to Sampling: The Full Sequence

    `

    1. Compute logits for position n+1

    2. Divide by temperature T

    3. Apply softmax → probabilities

    4. Apply Top-K (keep top K, zero rest) or Top-P (keep tokens summing to P)

    5. Re-normalize remaining probabilities

    6. Sample one token from the distribution

    7. Append to sequence, repeat

    `

    Why Logits Matter for Practitioners

    Log Probabilities

    Many APIs expose log probabilities (logprobs) — the log of softmax probabilities:

    `

    logprob(token_i) = log(softmax(z_i))

    `

  • Always ≤ 0 (since probabilities ≤ 1)
  • Close to 0 → high probability (e.g., -0.01 → 99% probability)
  • Very negative → low probability (e.g., -10 → ~0.005% probability)
  • Used for: confidence estimation, reranking, output scoring
  • Greedy Decoding

    `python

    next_token = argmax(logits) # or argmax(softmax(logits)) — same result

    `

    Greedy = always pick the single highest logit token. Equivalent to T=0.

    Calibration

    A well-calibrated model's probabilities reflect actual accuracy:

  • If the model assigns 80% probability to an answer, it should be correct ~80% of the time
  • Poorly calibrated models are overconfident or underconfident
  • Logit Lens (Interpretability Technique)

    At each Transformer layer, you can project the intermediate hidden state through the LM head to see what the model "thinks" the next token will be at that layer. This reveals how the model's prediction evolves through the layers — called the logit lens.

    Vocabulary and Logit Dimensions

    | Model | Vocabulary Size | Logit Vector Size |

    |-------|----------------|-------------------|

    | GPT-2 | 50,257 | 50,257 |

    | LLaMA 3 | 128,256 | 128,256 |

    | Mistral | 32,000 | 32,000 |

    | Claude | ~100K (estimated) | ~100K |

    The LM head matrix (vocab_size × d_model) is the largest single weight matrix in most models.

    Related Concepts

  • Inference, Temperature, Top-P, Top-K, Sampling, Next-Token Prediction, Transformer, Calibration

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 9).