Intermediate·4 min read

Top-P and Top-K Sampling

**Top-K** and **Top-P (nucleus sampling)** are token filtering strategies applied after temperature scaling that restrict which tokens can be sampled

Definition

Top-K and Top-P (nucleus sampling) are token filtering strategies applied after temperature scaling that restrict which tokens can be sampled as the next output — preventing the model from selecting very low-probability, incoherent tokens while preserving diversity.

The Problem They Solve

After temperature scaling, the full vocabulary distribution may include thousands of tokens with small but non-zero probabilities. Sampling from the full distribution can produce random, incoherent words. Top-K and Top-P restrict sampling to the "reasonable" candidates.

Top-K Sampling

Keep the K tokens with the highest probability; discard all others.

`

1. Sort tokens by probability (descending)

2. Keep only the top K tokens

3. Set all other probabilities to 0

4. Re-normalize the remaining K probabilities to sum to 1

5. Sample from the truncated distribution

`

Example with K=3:

`

Before: Paris(0.50), France(0.25), Lyon(0.12), dog(0.08), car(0.03), ...

After: Paris(0.576), France(0.288), Lyon(0.138) [re-normalized]

`

Problem with Top-K: K is fixed, but the natural distribution width varies:

  • When the model is confident (one obvious answer), K=50 still lets in unlikely tokens
  • When the model is uncertain (many valid options), K=50 may cut off valid choices
  • Top-P (Nucleus Sampling)

    Keep the smallest set of tokens whose cumulative probability ≥ P.

    `

    1. Sort tokens by probability (descending)

    2. Accumulate probabilities until sum ≥ P

    3. Keep only those tokens

    4. Discard the rest (zero them out)

    5. Re-normalize and sample

    `

    Example with P=0.9:

    `

    Confident case:

    Paris(0.85), France(0.06) → cumsum = 0.91 ≥ 0.9 → keep 2 tokens

    [Tight nucleus: prevents unlikely tokens]

    Uncertain case:

    "she"(0.08), "it"(0.07), "he"(0.07), "they"(0.06)... → need 20 tokens to reach 0.9

    [Wide nucleus: allows diversity]

    `

    Advantage: Top-P adapts dynamically to the model's confidence.

    Comparison

    | Aspect | Top-K | Top-P |

    |--------|-------|-------|

    | Pool size | Fixed K tokens | Variable — depends on distribution |

    | Adapts to confidence | No | Yes |

    | Behavior when certain | May include bad tokens | Tight nucleus |

    | Behavior when uncertain | May exclude valid tokens | Wide nucleus |

    | Default recommendation | Less preferred | Preferred (more adaptive) |

    | Typical value | K=40–100 | P=0.9–0.95 |

    Using Both Together

    Top-K and Top-P can be combined:

    1. Apply Top-K first → restricts to at most K tokens

    2. Apply Top-P second → further restricts to cumulative P

    3. Sample from the intersection

    This prevents extremely wide nuclei while maintaining adaptability.

    Parameter Interaction with Temperature

    The full sampling pipeline:

    `

    logits → divide by T (temperature) → softmax → Top-K filter → Top-P filter → re-normalize → sample

    `

    Combined recommendations:

    | Use Case | Temperature | Top-P | Top-K |

    |----------|------------|-------|-------|

    | Factual/code | 0.0–0.2 | 1.0 | 1 (greedy) |

    | General chat | 0.7–1.0 | 0.9 | 50 |

    | Creative writing | 1.0–1.2 | 0.95 | 100 |

    | Brainstorming | 1.2–1.5 | 1.0 | 100 |

    Min-P Sampling (Newer Alternative)

    Filters out tokens whose probability is below min_p × max_probability:

    `

    threshold = min_p × p_max_token

    keep only tokens where p_i ≥ threshold

    `

  • Relative threshold: adapts like Top-P but from the top down
  • min_p = 0.05 is a common default
  • Gaining adoption in open-source inference (llama.cpp, Ollama)
  • Greedy vs. Sampling vs. Beam Search

    | Strategy | Description | Deterministic? |

    |----------|-------------|----------------|

    | Greedy (T=0) | Always pick highest-prob token | Yes |

    | Top-K/Top-P sampling | Sample from filtered distribution | No |

    | Beam search | Maintain B candidates, pick best sequence | Yes (for B=1, same as greedy) |

    API Defaults (2024)

    | Platform | Default Temperature | Default Top-P | Default Top-K |

    |----------|-------------------|---------------|---------------|

    | OpenAI | 1.0 | 1.0 (disabled) | Not exposed |

    | Anthropic | 1.0 | Not default | Not default |

    | AWS Bedrock | Model-dependent | Model-dependent | Model-dependent |

    | Ollama | 0.8 | 0.9 | 40 |

    Note: using both temperature=1 and top_p=1 means full-distribution sampling (no restriction).

    Related Concepts

  • Temperature, Logits and Softmax, Inference, Greedy Decoding, Sampling, Beam Search

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 9).