Temperature — FDE@ProdAI Blog

Definition

Temperature is a hyperparameter that controls the randomness (or "creativity") of an LLM's output during token sampling. It scales the model's raw output scores (logits) before converting them to probabilities, thereby controlling how peaked or flat the probability distribution over the vocabulary is.

The Mathematics

After the final layer, the model produces logits — raw unnormalized scores for each vocabulary token.

Temperature is applied before the softmax:

probability(token_i) = softmax(logits / T)[i]

= exp(logit_i / T) / Σ exp(logit_j / T)

Effect of Temperature

T = 0 (Greedy)

All probability mass on the single highest-scoring token
Completely deterministic — same prompt always produces the same output
Very focused, but can be repetitive and "safe"

T < 1.0 (e.g., 0.2–0.7)

Distribution becomes more peaked (high-prob tokens get even more probability)
More deterministic, consistent, conservative outputs
Good for: factual Q&A, code generation, structured data extraction

T = 1.0

Use the model's raw probability distribution as-is
Balanced creativity and coherence

T > 1.0 (e.g., 1.2–2.0)

Distribution becomes flatter (low-prob tokens become more likely)
More random, creative, surprising, but also more incoherent
Good for: creative writing, brainstorming, diverse outputs
High temperatures can produce gibberish or off-topic content

Visualization

Vocab token probabilities at different temperatures (example):

Token: "Paris" "France" "London" "dog" "purple"

T=0.1: 0.97 0.02 0.01 0.00 0.00

T=0.7: 0.65 0.20 0.12 0.02 0.01

T=1.0: 0.50 0.25 0.15 0.07 0.03

T=1.5: 0.35 0.28 0.20 0.12 0.05

T=2.0: 0.22 0.22 0.21 0.18 0.17

Recommended Temperature by Use Case

| Use Case | Recommended T | Reason |

|----------|--------------|--------|

| Code generation | 0.0–0.2 | Deterministic, correct syntax |

| Factual Q&A | 0.0–0.3 | Accurate, consistent facts |

| Structured data extraction | 0.0–0.2 | Consistent format |

| Summarization | 0.3–0.5 | Coherent, slight variation ok |

| Conversational chat | 0.7–1.0 | Natural, varied responses |

| Creative writing | 0.8–1.2 | Expressive, imaginative |

| Brainstorming / ideation | 1.0–1.4 | Diverse ideas |

| Poetry / experimental | 1.0–2.0 | Maximum creativity |

Temperature + Other Sampling Parameters

Temperature works alongside other sampling methods:

Top-P: applies after temperature scaling — samples from top cumulative P% probability
Top-K: also applies after temperature — restricts to top K tokens
Common production combination: temperature=0.7, top_p=0.9

For deterministic output: temperature=0 (or equivalent top_p=1, top_k=1)

Temperature = 0 is NOT True Zero

In practice, floating-point precision means temperature=0 is implemented as argmax (always pick the top token). Some slight non-determinism may still occur due to:

GPU floating-point non-associativity
Parallel computation ordering
Different hardware backends

Temperature in System Design

For production applications:

Set temperature per use case — don't use one global temperature
Evaluation prompts: T=0 for reproducible eval results
User-facing generation: T=0.7–1.0 for natural variation
Multiple completion APIs (OpenAI n param): high T to get diverse options

Relationship to Entropy

Temperature directly controls the entropy of the output distribution:

Low T → low entropy (predictable)
High T → high entropy (uncertain/random)
"Creative" == "high entropy" in information-theoretic terms

API Parameter Names

| Platform | Parameter Name |

|----------|---------------|

| OpenAI | temperature |

| Anthropic Claude | temperature |

| AWS Bedrock | temperature |

| HuggingFace | temperature |

| Ollama | temperature |

Range: 0.0–2.0 (OpenAI); 0.0–1.0 (Anthropic)

Related Concepts

Inference, Sampling, Top-P, Top-K, Greedy Decoding, Logits, Randomness