Advanced·4 min read

Emergent Abilities

Emergent abilities are capabilities that appear in LLMs at sufficient scale — they are absent or near-random in smaller models, then appear sharply an

Definition

Emergent abilities are capabilities that appear in LLMs at sufficient scale — they are absent or near-random in smaller models, then appear sharply and unexpectedly in larger ones, without being explicitly trained for. They are "emergent" because they arise from scale alone and cannot be predicted by simply extrapolating small-model behavior.

The Key Characteristic

Unlike smooth scaling (e.g., perplexity decreases smoothly with more parameters), emergent abilities show a phase-transition-like pattern:

`

Accuracy on task:

Small model (7B): ~5% (random guessing)

Medium model (70B): ~8% (still near-random)

Large model (175B): ~65% (suddenly works!)

Frontier (500B+): ~92%

`

The jump from ~5% to ~65% is "emergence" — it happens at a threshold, not gradually.

Famous Examples of Emergent Abilities

In-Context Learning (Few-Shot)

  • Small models cannot generalize from examples in the prompt
  • At ~100B+ parameters, models suddenly "get" the pattern from 3–5 examples
  • Enables zero-shot and few-shot prompting as practical techniques
  • Chain-of-Thought Reasoning

  • Small models produce incoherent reasoning when asked to "think step by step"
  • At ~100B+ parameters, step-by-step reasoning becomes coherent and useful
  • Enables the entire CoT prompting paradigm
  • Multi-Step Arithmetic

  • Smaller models fail at "if A=3 and B=A+2 and C=B×2, what is C?"
  • Emerges reliably at larger scale
  • Connects to the effectiveness of reasoning models
  • Instruction Following

  • Base models don't naturally follow instructions (they complete text)
  • After instruction fine-tuning at sufficient scale, models generalize to novel instructions
  • Enables zero-shot task execution with natural language
  • Code Generation

  • Smaller models produce syntactically invalid code
  • At scale, models produce working code for complex specifications
  • Calibrated Uncertainty

  • Small models confidently state wrong answers
  • Large models more reliably say "I don't know" for uncertain questions
  • Multi-lingual Transfer

  • Model trained mostly on English shows emergent understanding of other languages
  • Better English reasoning improves reasoning in other languages
  • Why Emergence Happens

    The "Threshold" Hypothesis

    Tasks require a minimum level of capability to succeed at all:

  • Sub-threshold: model fails completely (0% success)
  • At threshold: all prerequisite skills combine to enable the task (~100% suddenly)
  • Benchmarks that score 0 until all prerequisites are met create the appearance of abruptness
  • Multi-Skill Composition

    Complex tasks require multiple sub-skills, each of which scales smoothly:

  • Sub-skill 1 (80% of models): good at skill A
  • Sub-skill 2 (80% of models): good at skill B
  • Combined success: only ~64% of models are good at BOTH
  • At sufficient scale, all sub-skills become reliable → composed task becomes reliable
  • Metric Discontinuities

    Some metrics show sudden jumps even if the underlying capability is smooth:

  • Exact-match metrics: 99% correct but wrong on last step = 0% score
  • Benchmark design creates apparent discontinuities in otherwise smooth scaling
  • The Debate: Are Emergent Abilities Real?

    Skeptical View (Schaeffer et al., 2023):

  • "Emergent abilities" may be an artifact of discontinuous metrics
  • When continuous metrics are used, abilities often scale smoothly
  • Apparent emergence = metric threshold, not actual phase transition
  • Confirming View:

  • Some capabilities genuinely appear to be absent then present
  • Induction heads in small vs. large models appear structurally different
  • In-context learning in particular has strong evidence of actual emergence
  • Practical conclusion: Regardless of mechanism, empirically, some capabilities reliably appear only at large scale — this is a practical truth even if the theoretical explanation is debated.

    Implications for Practitioners

    Model Selection

  • Don't expect small models to perform well on tasks requiring complex reasoning
  • Emergent abilities define the minimum viable model size for certain tasks
  • Rule of thumb: complex multi-step reasoning → 70B+ or API-only models
  • Fine-Tuning Limits

  • You cannot fine-tune emergent abilities into a model that lacks the scale for them
  • Fine-tuning builds on existing capabilities; it cannot create capabilities the model lacks
  • Solution: use a larger base model
  • The Surprising Generalization

    Emergent abilities often transfer to tasks not in training:

  • A model trained on text learns to solve analogies, do arithmetic, follow instructions
  • These were not explicitly trained — they emerged from learning language patterns
  • Current Frontier: Reasoning Emergence

    Modern reasoning models (o1, DeepSeek-R1) suggest a new type of emergence:

  • Training with RL on reasoning tasks creates qualitatively new capabilities
  • Chain-of-thought reasoning as an emergent strategy, not a trained skill
  • "Aha moments" where models discover problem-solving shortcuts through RL
  • Related Concepts

  • Scaling Laws, Reasoning Models, Chain of Thought, In-Context Learning, Parameters, Benchmarks

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 12).