Logits and Softmax — FDE@ProdAI Blog

Definition

Logits are the raw, unnormalized output scores the LLM produces for every token in its vocabulary at each generation step. Softmax is the mathematical function that converts these logits into a proper probability distribution over the vocabulary — the numbers you actually sample from to pick the next token.

The Full Generation Pipeline

Transformer forward pass

↓

Hidden state vector (d_model dimensions)

↓

[LM Head: Linear layer W_vocab] (d_model → vocab_size)

↓

Logits: [z_1, z_2, ..., z_V] (one raw score per vocab token, V ≈ 32K–128K)

↓

[Temperature scaling: z_i / T]

↓

[Softmax]

↓

Probabilities: [p_1, p_2, ..., p_V] (sum to 1.0)

↓

[Top-K / Top-P filtering]

↓

Sample → next token

Logits Explained

A logit for token i represents how "confident" the model is that token i should come next
Higher logit = the model thinks this token is more likely
Logits can be any real number: negative, zero, or positive
They have NO direct probabilistic interpretation until passed through softmax

Example (conceptual, for "The capital of France is ___"):

Token | Logit

-----------|-------

"Paris" | 12.4

"France" | 8.1

"Lyon" | 6.2

"a" | 2.1

"the" | 1.8

"dog" | -5.2

...

Softmax Explained

Softmax converts the vector of logits into probabilities:

softmax(z_i) = exp(z_i) / Σ_j exp(z_j)

Properties:

All output values are in (0, 1)
All output values sum to exactly 1.0
Preserves relative ordering (highest logit → highest probability)
Exponential function amplifies differences — a logit gap of 1.0 means ~2.7× higher probability

Temperature's Effect on Logits

Temperature T scales the logits before softmax:

p_i = softmax(z_i / T)

| T | Effect on Logits | Effect on Distribution |

|---|-----------------|----------------------|

| T → 0 | Logits → ±∞ | Argmax (one token gets all probability) |

| T = 0.5 | Logits doubled in magnitude | Sharper — top tokens dominate more |

| T = 1.0 | No change | Default model distribution |

| T = 2.0 | Logits halved in magnitude | Flatter — more uniform, more random |

From Logits to Sampling: The Full Sequence

1. Compute logits for position n+1

2. Divide by temperature T

3. Apply softmax → probabilities

4. Apply Top-K (keep top K, zero rest) or Top-P (keep tokens summing to P)

5. Re-normalize remaining probabilities

6. Sample one token from the distribution

7. Append to sequence, repeat

Why Logits Matter for Practitioners

Log Probabilities

Many APIs expose log probabilities (logprobs) — the log of softmax probabilities:

logprob(token_i) = log(softmax(z_i))

Always ≤ 0 (since probabilities ≤ 1)
Close to 0 → high probability (e.g., -0.01 → 99% probability)
Very negative → low probability (e.g., -10 → ~0.005% probability)
Used for: confidence estimation, reranking, output scoring

Greedy Decoding

`python

next_token = argmax(logits) # or argmax(softmax(logits)) — same result

Greedy = always pick the single highest logit token. Equivalent to T=0.

Calibration

A well-calibrated model's probabilities reflect actual accuracy:

If the model assigns 80% probability to an answer, it should be correct ~80% of the time
Poorly calibrated models are overconfident or underconfident

Logit Lens (Interpretability Technique)

At each Transformer layer, you can project the intermediate hidden state through the LM head to see what the model "thinks" the next token will be at that layer. This reveals how the model's prediction evolves through the layers — called the logit lens.

Vocabulary and Logit Dimensions

| Model | Vocabulary Size | Logit Vector Size |

|-------|----------------|-------------------|

| GPT-2 | 50,257 | 50,257 |

| LLaMA 3 | 128,256 | 128,256 |

| Mistral | 32,000 | 32,000 |

| Claude | ~100K (estimated) | ~100K |

The LM head matrix (vocab_size × d_model) is the largest single weight matrix in most models.

Related Concepts

Inference, Temperature, Top-P, Top-K, Sampling, Next-Token Prediction, Transformer, Calibration