Definition
Temperature is a hyperparameter that controls the randomness (or "creativity") of an LLM's output during token sampling. It scales the model's raw output scores (logits) before converting them to probabilities, thereby controlling how peaked or flat the probability distribution over the vocabulary is.
The Mathematics
After the final layer, the model produces logits — raw unnormalized scores for each vocabulary token.
Temperature is applied before the softmax:
`
probability(token_i) = softmax(logits / T)[i]
= exp(logit_i / T) / Σ exp(logit_j / T)
`
Effect of Temperature
T = 0 (Greedy)
- All probability mass on the single highest-scoring token
- Completely deterministic — same prompt always produces the same output
- Very focused, but can be repetitive and "safe"
- Distribution becomes more peaked (high-prob tokens get even more probability)
- More deterministic, consistent, conservative outputs
- Good for: factual Q&A, code generation, structured data extraction
- Use the model's raw probability distribution as-is
- Balanced creativity and coherence
- Distribution becomes flatter (low-prob tokens become more likely)
- More random, creative, surprising, but also more incoherent
- Good for: creative writing, brainstorming, diverse outputs
- High temperatures can produce gibberish or off-topic content
- Top-P: applies after temperature scaling — samples from top cumulative P% probability
- Top-K: also applies after temperature — restricts to top K tokens
- Common production combination:
temperature=0.7, top_p=0.9 - GPU floating-point non-associativity
- Parallel computation ordering
- Different hardware backends
- Set temperature per use case — don't use one global temperature
- Evaluation prompts: T=0 for reproducible eval results
- User-facing generation: T=0.7–1.0 for natural variation
- Multiple completion APIs (OpenAI
nparam): high T to get diverse options - Low T → low entropy (predictable)
- High T → high entropy (uncertain/random)
- "Creative" == "high entropy" in information-theoretic terms
- Inference, Sampling, Top-P, Top-K, Greedy Decoding, Logits, Randomness
T < 1.0 (e.g., 0.2–0.7)
T = 1.0
T > 1.0 (e.g., 1.2–2.0)
Visualization
`
Vocab token probabilities at different temperatures (example):
Token: "Paris" "France" "London" "dog" "purple"
T=0.1: 0.97 0.02 0.01 0.00 0.00
T=0.7: 0.65 0.20 0.12 0.02 0.01
T=1.0: 0.50 0.25 0.15 0.07 0.03
T=1.5: 0.35 0.28 0.20 0.12 0.05
T=2.0: 0.22 0.22 0.21 0.18 0.17
`
Recommended Temperature by Use Case
| Use Case | Recommended T | Reason |
|----------|--------------|--------|
| Code generation | 0.0–0.2 | Deterministic, correct syntax |
| Factual Q&A | 0.0–0.3 | Accurate, consistent facts |
| Structured data extraction | 0.0–0.2 | Consistent format |
| Summarization | 0.3–0.5 | Coherent, slight variation ok |
| Conversational chat | 0.7–1.0 | Natural, varied responses |
| Creative writing | 0.8–1.2 | Expressive, imaginative |
| Brainstorming / ideation | 1.0–1.4 | Diverse ideas |
| Poetry / experimental | 1.0–2.0 | Maximum creativity |
Temperature + Other Sampling Parameters
Temperature works alongside other sampling methods:
For deterministic output: temperature=0 (or equivalent top_p=1, top_k=1)
Temperature = 0 is NOT True Zero
In practice, floating-point precision means temperature=0 is implemented as argmax (always pick the top token). Some slight non-determinism may still occur due to:
Temperature in System Design
For production applications:
Relationship to Entropy
Temperature directly controls the entropy of the output distribution:
API Parameter Names
| Platform | Parameter Name |
|----------|---------------|
| OpenAI | temperature |
| Anthropic Claude | temperature |
| AWS Bedrock | temperature |
| HuggingFace | temperature |
| Ollama | temperature |
Range: 0.0–2.0 (OpenAI); 0.0–1.0 (Anthropic)