Top-P and Top-K Sampling — FDE@ProdAI Blog

Definition

Top-K and Top-P (nucleus sampling) are token filtering strategies applied after temperature scaling that restrict which tokens can be sampled as the next output — preventing the model from selecting very low-probability, incoherent tokens while preserving diversity.

The Problem They Solve

After temperature scaling, the full vocabulary distribution may include thousands of tokens with small but non-zero probabilities. Sampling from the full distribution can produce random, incoherent words. Top-K and Top-P restrict sampling to the "reasonable" candidates.

Top-K Sampling

Keep the K tokens with the highest probability; discard all others.

1. Sort tokens by probability (descending)

2. Keep only the top K tokens

3. Set all other probabilities to 0

4. Re-normalize the remaining K probabilities to sum to 1

5. Sample from the truncated distribution

Example with K=3:

Before: Paris(0.50), France(0.25), Lyon(0.12), dog(0.08), car(0.03), ...

After: Paris(0.576), France(0.288), Lyon(0.138) [re-normalized]

Problem with Top-K: K is fixed, but the natural distribution width varies:

When the model is confident (one obvious answer), K=50 still lets in unlikely tokens
When the model is uncertain (many valid options), K=50 may cut off valid choices

Top-P (Nucleus Sampling)

Keep the smallest set of tokens whose cumulative probability ≥ P.

1. Sort tokens by probability (descending)

2. Accumulate probabilities until sum ≥ P

3. Keep only those tokens

4. Discard the rest (zero them out)

5. Re-normalize and sample

Example with P=0.9:

Confident case:

Paris(0.85), France(0.06) → cumsum = 0.91 ≥ 0.9 → keep 2 tokens

[Tight nucleus: prevents unlikely tokens]

Uncertain case:

"she"(0.08), "it"(0.07), "he"(0.07), "they"(0.06)... → need 20 tokens to reach 0.9

[Wide nucleus: allows diversity]

Advantage: Top-P adapts dynamically to the model's confidence.

Comparison

| Aspect | Top-K | Top-P |

|--------|-------|-------|

| Pool size | Fixed K tokens | Variable — depends on distribution |

| Adapts to confidence | No | Yes |

| Behavior when certain | May include bad tokens | Tight nucleus |

| Behavior when uncertain | May exclude valid tokens | Wide nucleus |

| Default recommendation | Less preferred | Preferred (more adaptive) |

| Typical value | K=40–100 | P=0.9–0.95 |

Using Both Together

Top-K and Top-P can be combined:

1. Apply Top-K first → restricts to at most K tokens

2. Apply Top-P second → further restricts to cumulative P

3. Sample from the intersection

This prevents extremely wide nuclei while maintaining adaptability.

Parameter Interaction with Temperature

The full sampling pipeline:

logits → divide by T (temperature) → softmax → Top-K filter → Top-P filter → re-normalize → sample

Combined recommendations:

|----------|------------|-------|-------|

| Factual/code | 0.0–0.2 | 1.0 | 1 (greedy) |

| General chat | 0.7–1.0 | 0.9 | 50 |

| Creative writing | 1.0–1.2 | 0.95 | 100 |

| Brainstorming | 1.2–1.5 | 1.0 | 100 |

Min-P Sampling (Newer Alternative)

Filters out tokens whose probability is below min_p × max_probability:

threshold = min_p × p_max_token

keep only tokens where p_i ≥ threshold

Relative threshold: adapts like Top-P but from the top down
min_p = 0.05 is a common default
Gaining adoption in open-source inference (llama.cpp, Ollama)

Greedy vs. Sampling vs. Beam Search

| Strategy | Description | Deterministic? |

|----------|-------------|----------------|

| Greedy (T=0) | Always pick highest-prob token | Yes |

| Top-K/Top-P sampling | Sample from filtered distribution | No |

| Beam search | Maintain B candidates, pick best sequence | Yes (for B=1, same as greedy) |

API Defaults (2024)

|----------|-------------------|---------------|---------------|

| Ollama | 0.8 | 0.9 | 40 |

Note: using both temperature=1 and top_p=1 means full-distribution sampling (no restriction).

Related Concepts

Temperature, Logits and Softmax, Inference, Greedy Decoding, Sampling, Beam Search