Chain of Thought (CoT) — FDE@ProdAI Blog

Definition

Chain of Thought (CoT) prompting is a technique that encourages an LLM to produce intermediate reasoning steps before arriving at a final answer. By "thinking out loud," the model decomposes complex problems into manageable steps, dramatically improving accuracy on tasks requiring multi-step reasoning.

The Core Discovery

Introduced in the paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (Wei et al., Google Brain, 2022).

Without CoT:

Q: A store sells 10 apples at $0.50 each and 5 oranges at $0.75 each. What's the total?

A: $5.75 (often wrong without reasoning)

With CoT:

Q: A store sells 10 apples at $0.50 each and 5 oranges at $0.75 each. What's the total?

A: Let me calculate step by step.

Apples: 10 × $0.50 = $5.00

Oranges: 5 × $0.75 = $3.75

Total: $5.00 + $3.75 = $8.75

Answer: $8.75

Types of CoT Prompting

1. Few-Shot CoT (Original)

Provide examples that include reasoning chains:

Q: [Example problem]

A: [Step-by-step reasoning] → [Answer]

Q: [Example problem 2]

A: [Step-by-step reasoning] → [Answer]

Q: [New problem]

2. Zero-Shot CoT

Simply append a trigger phrase — no examples needed:

Q: [Problem]

A: Let's think step by step.

Effective trigger phrases:

"Let's think step by step."
"Think carefully before answering."
"Let's work through this systematically."
"First, let me break this down."

3. Auto-CoT

Automatically generate reasoning chains using the model itself, then use those as few-shot examples.

4. Self-Consistency CoT

1. Generate N independent reasoning chains (with temperature > 0)

2. Each chain may reach a different answer

3. Take the majority vote answer

4. More reliable than single-chain CoT

5. Tree of Thoughts (ToT)

Extend CoT into a tree structure
Model explores multiple reasoning branches simultaneously
Backtrack from dead ends
Best path is selected via search/evaluation

When CoT Helps Most

| Task Type | Benefit |

|-----------|---------|

| Multi-step arithmetic | High |

| Symbolic reasoning | High |

| Logic puzzles | High |

| Commonsense reasoning | Moderate-High |

| Code debugging | High |

| Multi-hop QA | High |

| Simple factual QA | Low (may hurt by adding noise) |

| Classification | Low-Moderate |

Mechanism: Why CoT Works

1. Scratchpad effect: intermediate steps serve as working memory the model can reference

2. Error decomposition: each small step is easier than the full problem

3. Self-checking: making steps explicit creates opportunities to catch errors

4. Attention shaping: reasoning tokens in the output condition subsequent token predictions

Key insight: the tokens in the reasoning chain are real computation — they literally influence what the model outputs next.

CoT and Model Scale

CoT is an emergent ability — it only works well on sufficiently large models:

< 10B params: minimal benefit
~10–50B params: moderate benefit
> 100B params: strong benefit
Smaller models trained with CoT data can learn this skill (distillation)

Modern LLMs and "Thinking" Tokens

Frontier models (OpenAI o1, Claude 3.5+, Gemini Thinking) implement extended CoT via dedicated thinking/reasoning tokens that are generated before the final answer:

Hidden from the user (internal scratchpad)
Can be thousands of tokens of intermediate reasoning
Dramatically improves complex reasoning benchmarks
Referred to as "extended thinking" or "reasoning models"

CoT in Practice

Prompt Pattern

You are a math tutor. When solving problems, always show your reasoning step by step.

For each step, explain why you're taking that step.

Problem: [user's problem]

Let me work through this step by step:

With Output Parsing

For structured output, separate the chain from the answer:

Reasoning: [full chain of thought]

Final Answer: [just the answer]

Limitations

| Limitation | Notes |

|------------|-------|

| Hallucinated reasoning | Model can produce fluent but wrong reasoning chains |

| Cost | Reasoning tokens consume context and increase latency |

| Not always better | Simple tasks → CoT adds noise, not signal |

| Scale dependency | Weak on small models |

| Faithfulness question | Reasoning may not reflect actual computation |

Related Concepts

Few-Shot, Zero-Shot, Prompt, Inference, Reasoning Models, Tree of Thoughts, Self-Consistency