Definition
Chain of Thought (CoT) prompting is a technique that encourages an LLM to produce intermediate reasoning steps before arriving at a final answer. By "thinking out loud," the model decomposes complex problems into manageable steps, dramatically improving accuracy on tasks requiring multi-step reasoning.
The Core Discovery
Introduced in the paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (Wei et al., Google Brain, 2022).
Without CoT:
`
Q: A store sells 10 apples at $0.50 each and 5 oranges at $0.75 each. What's the total?
A: $5.75 (often wrong without reasoning)
`
With CoT:
`
Q: A store sells 10 apples at $0.50 each and 5 oranges at $0.75 each. What's the total?
A: Let me calculate step by step.
Apples: 10 × $0.50 = $5.00
Oranges: 5 × $0.75 = $3.75
Total: $5.00 + $3.75 = $8.75
Answer: $8.75
`
Types of CoT Prompting
1. Few-Shot CoT (Original)
Provide examples that include reasoning chains:
`
Q: [Example problem]
A: [Step-by-step reasoning] → [Answer]
Q: [Example problem 2]
A: [Step-by-step reasoning] → [Answer]
Q: [New problem]
A:
`
2. Zero-Shot CoT
Simply append a trigger phrase — no examples needed:
`
Q: [Problem]
A: Let's think step by step.
`
Effective trigger phrases:
- "Let's think step by step."
- "Think carefully before answering."
- "Let's work through this systematically."
- "First, let me break this down."
- Extend CoT into a tree structure
- Model explores multiple reasoning branches simultaneously
- Backtrack from dead ends
- Best path is selected via search/evaluation
- < 10B params: minimal benefit
- ~10–50B params: moderate benefit
- > 100B params: strong benefit
- Smaller models trained with CoT data can learn this skill (distillation)
- Hidden from the user (internal scratchpad)
- Can be thousands of tokens of intermediate reasoning
- Dramatically improves complex reasoning benchmarks
- Referred to as "extended thinking" or "reasoning models"
- Few-Shot, Zero-Shot, Prompt, Inference, Reasoning Models, Tree of Thoughts, Self-Consistency
3. Auto-CoT
Automatically generate reasoning chains using the model itself, then use those as few-shot examples.
4. Self-Consistency CoT
1. Generate N independent reasoning chains (with temperature > 0)
2. Each chain may reach a different answer
3. Take the majority vote answer
4. More reliable than single-chain CoT
5. Tree of Thoughts (ToT)
When CoT Helps Most
| Task Type | Benefit |
|-----------|---------|
| Multi-step arithmetic | High |
| Symbolic reasoning | High |
| Logic puzzles | High |
| Commonsense reasoning | Moderate-High |
| Code debugging | High |
| Multi-hop QA | High |
| Simple factual QA | Low (may hurt by adding noise) |
| Classification | Low-Moderate |
Mechanism: Why CoT Works
1. Scratchpad effect: intermediate steps serve as working memory the model can reference
2. Error decomposition: each small step is easier than the full problem
3. Self-checking: making steps explicit creates opportunities to catch errors
4. Attention shaping: reasoning tokens in the output condition subsequent token predictions
Key insight: the tokens in the reasoning chain are real computation — they literally influence what the model outputs next.
CoT and Model Scale
CoT is an emergent ability — it only works well on sufficiently large models:
Modern LLMs and "Thinking" Tokens
Frontier models (OpenAI o1, Claude 3.5+, Gemini Thinking) implement extended CoT via dedicated thinking/reasoning tokens that are generated before the final answer:
CoT in Practice
Prompt Pattern
`
You are a math tutor. When solving problems, always show your reasoning step by step.
For each step, explain why you're taking that step.
Problem: [user's problem]
Let me work through this step by step:
`
With Output Parsing
For structured output, separate the chain from the answer:
`
Reasoning: [full chain of thought]
Final Answer: [just the answer]
`
Limitations
| Limitation | Notes |
|------------|-------|
| Hallucinated reasoning | Model can produce fluent but wrong reasoning chains |
| Cost | Reasoning tokens consume context and increase latency |
| Not always better | Simple tasks → CoT adds noise, not signal |
| Scale dependency | Weak on small models |
| Faithfulness question | Reasoning may not reflect actual computation |