Emergent Abilities — FDE@ProdAI Blog

Definition

Emergent abilities are capabilities that appear in LLMs at sufficient scale — they are absent or near-random in smaller models, then appear sharply and unexpectedly in larger ones, without being explicitly trained for. They are "emergent" because they arise from scale alone and cannot be predicted by simply extrapolating small-model behavior.

The Key Characteristic

Unlike smooth scaling (e.g., perplexity decreases smoothly with more parameters), emergent abilities show a phase-transition-like pattern:

Accuracy on task:

Small model (7B): ~5% (random guessing)

Medium model (70B): ~8% (still near-random)

Large model (175B): ~65% (suddenly works!)

Frontier (500B+): ~92%

The jump from ~5% to ~65% is "emergence" — it happens at a threshold, not gradually.

Famous Examples of Emergent Abilities

In-Context Learning (Few-Shot)

Small models cannot generalize from examples in the prompt
At ~100B+ parameters, models suddenly "get" the pattern from 3–5 examples
Enables zero-shot and few-shot prompting as practical techniques

Chain-of-Thought Reasoning

Small models produce incoherent reasoning when asked to "think step by step"
At ~100B+ parameters, step-by-step reasoning becomes coherent and useful
Enables the entire CoT prompting paradigm

Multi-Step Arithmetic

Smaller models fail at "if A=3 and B=A+2 and C=B×2, what is C?"
Emerges reliably at larger scale
Connects to the effectiveness of reasoning models

Instruction Following

Base models don't naturally follow instructions (they complete text)
After instruction fine-tuning at sufficient scale, models generalize to novel instructions
Enables zero-shot task execution with natural language

Code Generation

Smaller models produce syntactically invalid code
At scale, models produce working code for complex specifications

Calibrated Uncertainty

Small models confidently state wrong answers
Large models more reliably say "I don't know" for uncertain questions

Multi-lingual Transfer

Model trained mostly on English shows emergent understanding of other languages
Better English reasoning improves reasoning in other languages

Why Emergence Happens

The "Threshold" Hypothesis

Tasks require a minimum level of capability to succeed at all:

Sub-threshold: model fails completely (0% success)
At threshold: all prerequisite skills combine to enable the task (~100% suddenly)
Benchmarks that score 0 until all prerequisites are met create the appearance of abruptness

Multi-Skill Composition

Complex tasks require multiple sub-skills, each of which scales smoothly:

Sub-skill 1 (80% of models): good at skill A
Sub-skill 2 (80% of models): good at skill B
Combined success: only ~64% of models are good at BOTH
At sufficient scale, all sub-skills become reliable → composed task becomes reliable

Metric Discontinuities

Some metrics show sudden jumps even if the underlying capability is smooth:

Exact-match metrics: 99% correct but wrong on last step = 0% score
Benchmark design creates apparent discontinuities in otherwise smooth scaling

The Debate: Are Emergent Abilities Real?

Skeptical View (Schaeffer et al., 2023):

"Emergent abilities" may be an artifact of discontinuous metrics
When continuous metrics are used, abilities often scale smoothly
Apparent emergence = metric threshold, not actual phase transition

Confirming View:

Some capabilities genuinely appear to be absent then present
Induction heads in small vs. large models appear structurally different
In-context learning in particular has strong evidence of actual emergence

Practical conclusion: Regardless of mechanism, empirically, some capabilities reliably appear only at large scale — this is a practical truth even if the theoretical explanation is debated.

Implications for Practitioners

Model Selection

Don't expect small models to perform well on tasks requiring complex reasoning
Emergent abilities define the minimum viable model size for certain tasks
Rule of thumb: complex multi-step reasoning → 70B+ or API-only models

Fine-Tuning Limits

You cannot fine-tune emergent abilities into a model that lacks the scale for them
Fine-tuning builds on existing capabilities; it cannot create capabilities the model lacks
Solution: use a larger base model

The Surprising Generalization

Emergent abilities often transfer to tasks not in training:

A model trained on text learns to solve analogies, do arithmetic, follow instructions
These were not explicitly trained — they emerged from learning language patterns

Current Frontier: Reasoning Emergence

Modern reasoning models (o1, DeepSeek-R1) suggest a new type of emergence:

Training with RL on reasoning tasks creates qualitatively new capabilities
Chain-of-thought reasoning as an emergent strategy, not a trained skill
"Aha moments" where models discover problem-solving shortcuts through RL

Related Concepts

Scaling Laws, Reasoning Models, Chain of Thought, In-Context Learning, Parameters, Benchmarks