Definition
Logits are the raw, unnormalized output scores the LLM produces for every token in its vocabulary at each generation step. Softmax is the mathematical function that converts these logits into a proper probability distribution over the vocabulary — the numbers you actually sample from to pick the next token.
The Full Generation Pipeline
`
Transformer forward pass
↓
Hidden state vector (d_model dimensions)
↓
[LM Head: Linear layer W_vocab] (d_model → vocab_size)
↓
Logits: [z_1, z_2, ..., z_V] (one raw score per vocab token, V ≈ 32K–128K)
↓
[Temperature scaling: z_i / T]
↓
[Softmax]
↓
Probabilities: [p_1, p_2, ..., p_V] (sum to 1.0)
↓
[Top-K / Top-P filtering]
↓
Sample → next token
`
Logits Explained
- A logit for token i represents how "confident" the model is that token i should come next
- Higher logit = the model thinks this token is more likely
- Logits can be any real number: negative, zero, or positive
- They have NO direct probabilistic interpretation until passed through softmax
- All output values are in (0, 1)
- All output values sum to exactly 1.0
- Preserves relative ordering (highest logit → highest probability)
- Exponential function amplifies differences — a logit gap of 1.0 means ~2.7× higher probability
- Always ≤ 0 (since probabilities ≤ 1)
- Close to 0 → high probability (e.g., -0.01 → 99% probability)
- Very negative → low probability (e.g., -10 → ~0.005% probability)
- Used for: confidence estimation, reranking, output scoring
- If the model assigns 80% probability to an answer, it should be correct ~80% of the time
- Poorly calibrated models are overconfident or underconfident
- Inference, Temperature, Top-P, Top-K, Sampling, Next-Token Prediction, Transformer, Calibration
Example (conceptual, for "The capital of France is ___"):
`
Token | Logit
-----------|-------
"Paris" | 12.4
"France" | 8.1
"Lyon" | 6.2
"a" | 2.1
"the" | 1.8
"dog" | -5.2
...
`
Softmax Explained
Softmax converts the vector of logits into probabilities:
`
softmax(z_i) = exp(z_i) / Σ_j exp(z_j)
`
Properties:
Temperature's Effect on Logits
Temperature T scales the logits before softmax:
`
p_i = softmax(z_i / T)
`
| T | Effect on Logits | Effect on Distribution |
|---|-----------------|----------------------|
| T → 0 | Logits → ±∞ | Argmax (one token gets all probability) |
| T = 0.5 | Logits doubled in magnitude | Sharper — top tokens dominate more |
| T = 1.0 | No change | Default model distribution |
| T = 2.0 | Logits halved in magnitude | Flatter — more uniform, more random |
From Logits to Sampling: The Full Sequence
`
1. Compute logits for position n+1
2. Divide by temperature T
3. Apply softmax → probabilities
4. Apply Top-K (keep top K, zero rest) or Top-P (keep tokens summing to P)
5. Re-normalize remaining probabilities
6. Sample one token from the distribution
7. Append to sequence, repeat
`
Why Logits Matter for Practitioners
Log Probabilities
Many APIs expose log probabilities (logprobs) — the log of softmax probabilities:
`
logprob(token_i) = log(softmax(z_i))
`
Greedy Decoding
`python
next_token = argmax(logits) # or argmax(softmax(logits)) — same result
`
Greedy = always pick the single highest logit token. Equivalent to T=0.
Calibration
A well-calibrated model's probabilities reflect actual accuracy:
Logit Lens (Interpretability Technique)
At each Transformer layer, you can project the intermediate hidden state through the LM head to see what the model "thinks" the next token will be at that layer. This reveals how the model's prediction evolves through the layers — called the logit lens.
Vocabulary and Logit Dimensions
| Model | Vocabulary Size | Logit Vector Size |
|-------|----------------|-------------------|
| GPT-2 | 50,257 | 50,257 |
| LLaMA 3 | 128,256 | 128,256 |
| Mistral | 32,000 | 32,000 |
| Claude | ~100K (estimated) | ~100K |
The LM head matrix (vocab_size × d_model) is the largest single weight matrix in most models.