Definition
Emergent abilities are capabilities that appear in LLMs at sufficient scale — they are absent or near-random in smaller models, then appear sharply and unexpectedly in larger ones, without being explicitly trained for. They are "emergent" because they arise from scale alone and cannot be predicted by simply extrapolating small-model behavior.
The Key Characteristic
Unlike smooth scaling (e.g., perplexity decreases smoothly with more parameters), emergent abilities show a phase-transition-like pattern:
`
Accuracy on task:
Small model (7B): ~5% (random guessing)
Medium model (70B): ~8% (still near-random)
Large model (175B): ~65% (suddenly works!)
Frontier (500B+): ~92%
`
The jump from ~5% to ~65% is "emergence" — it happens at a threshold, not gradually.
Famous Examples of Emergent Abilities
In-Context Learning (Few-Shot)
- Small models cannot generalize from examples in the prompt
- At ~100B+ parameters, models suddenly "get" the pattern from 3–5 examples
- Enables zero-shot and few-shot prompting as practical techniques
- Small models produce incoherent reasoning when asked to "think step by step"
- At ~100B+ parameters, step-by-step reasoning becomes coherent and useful
- Enables the entire CoT prompting paradigm
- Smaller models fail at "if A=3 and B=A+2 and C=B×2, what is C?"
- Emerges reliably at larger scale
- Connects to the effectiveness of reasoning models
- Base models don't naturally follow instructions (they complete text)
- After instruction fine-tuning at sufficient scale, models generalize to novel instructions
- Enables zero-shot task execution with natural language
- Smaller models produce syntactically invalid code
- At scale, models produce working code for complex specifications
- Small models confidently state wrong answers
- Large models more reliably say "I don't know" for uncertain questions
- Model trained mostly on English shows emergent understanding of other languages
- Better English reasoning improves reasoning in other languages
- Sub-threshold: model fails completely (0% success)
- At threshold: all prerequisite skills combine to enable the task (~100% suddenly)
- Benchmarks that score 0 until all prerequisites are met create the appearance of abruptness
- Sub-skill 1 (80% of models): good at skill A
- Sub-skill 2 (80% of models): good at skill B
- Combined success: only ~64% of models are good at BOTH
- At sufficient scale, all sub-skills become reliable → composed task becomes reliable
- Exact-match metrics: 99% correct but wrong on last step = 0% score
- Benchmark design creates apparent discontinuities in otherwise smooth scaling
- "Emergent abilities" may be an artifact of discontinuous metrics
- When continuous metrics are used, abilities often scale smoothly
- Apparent emergence = metric threshold, not actual phase transition
- Some capabilities genuinely appear to be absent then present
- Induction heads in small vs. large models appear structurally different
- In-context learning in particular has strong evidence of actual emergence
- Don't expect small models to perform well on tasks requiring complex reasoning
- Emergent abilities define the minimum viable model size for certain tasks
- Rule of thumb: complex multi-step reasoning → 70B+ or API-only models
- You cannot fine-tune emergent abilities into a model that lacks the scale for them
- Fine-tuning builds on existing capabilities; it cannot create capabilities the model lacks
- Solution: use a larger base model
- A model trained on text learns to solve analogies, do arithmetic, follow instructions
- These were not explicitly trained — they emerged from learning language patterns
- Training with RL on reasoning tasks creates qualitatively new capabilities
- Chain-of-thought reasoning as an emergent strategy, not a trained skill
- "Aha moments" where models discover problem-solving shortcuts through RL
- Scaling Laws, Reasoning Models, Chain of Thought, In-Context Learning, Parameters, Benchmarks
Chain-of-Thought Reasoning
Multi-Step Arithmetic
Instruction Following
Code Generation
Calibrated Uncertainty
Multi-lingual Transfer
Why Emergence Happens
The "Threshold" Hypothesis
Tasks require a minimum level of capability to succeed at all:
Multi-Skill Composition
Complex tasks require multiple sub-skills, each of which scales smoothly:
Metric Discontinuities
Some metrics show sudden jumps even if the underlying capability is smooth:
The Debate: Are Emergent Abilities Real?
Skeptical View (Schaeffer et al., 2023):
Confirming View:
Practical conclusion: Regardless of mechanism, empirically, some capabilities reliably appear only at large scale — this is a practical truth even if the theoretical explanation is debated.
Implications for Practitioners
Model Selection
Fine-Tuning Limits
The Surprising Generalization
Emergent abilities often transfer to tasks not in training:
Current Frontier: Reasoning Emergence
Modern reasoning models (o1, DeepSeek-R1) suggest a new type of emergence: