Beginner·4 min read

In-Context Learning (ICL)

In-Context Learning (ICL) is the emergent ability of LLMs to learn a new task or adapt to new patterns by reading examples provided in the prompt — wi

Definition

In-Context Learning (ICL) is the emergent ability of LLMs to learn a new task or adapt to new patterns by reading examples provided in the prompt — without any gradient updates or weight changes to the model. The model "learns" from demonstrations purely through the forward pass of inference.

Key Distinction

| Learning Type | Weight Update? | When? | Examples Needed |

|--------------|---------------|-------|----------------|

| Pre-training | Yes | Before deployment | Trillions of tokens |

| Fine-tuning | Yes | Before deployment | Thousands of examples |

| In-Context Learning | No | At inference time | 1–100 examples in prompt |

ICL is remarkable because it happens with zero parameter updates — the model processes examples and adapts its outputs in a single forward pass.

Why ICL Works: The Mechanistic View

Research (Olsson et al., 2022) identified induction heads — attention heads that:

1. Match a pattern: "previous token A was followed by B"

2. When they see token A again, copy B as a high-probability next token

3. Chain together to implement more complex pattern completion

At sufficient scale, induction heads generalize from single token-pairs to complex input-output patterns (full ICL).

ICL vs. Few-Shot Prompting

These terms are often used interchangeably but have a distinction:

  • Few-shot prompting: the practical technique (providing examples in prompt)
  • In-context learning: the theoretical phenomenon (how the model adapts from examples)
  • Few-shot prompting leverages in-context learning
  • What Models Actually Do During ICL

    Current research suggests models don't fully "learn" the task — they:

    1. Locate similar patterns from pre-training that match the examples

    2. Infer the task format from the example structure

    3. Adapt output format to match the demonstrated pattern

    This means ICL is most effective for tasks that were represented in some form during pre-training.

    ICL Performance Factors

    What Makes ICL Work Well

    | Factor | Effect |

    |--------|--------|

    | More examples | Generally better, up to context limit |

    | High-quality examples | Critical — bad examples hurt performance |

    | Diverse examples | Better generalization to test inputs |

    | Consistent format | Clear pattern → better imitation |

    | Representative examples | Match the distribution of test inputs |

    What Makes ICL Fail

    | Factor | Effect |

    |--------|--------|

    | Wrong label examples | Surprisingly, random labels barely hurt — format matters more than label correctness (Min et al., 2022) |

    | Inconsistent format | Model can't identify the pattern |

    | Novel task type | Not seen in pre-training → ICL limited |

    | Small model | ICL is an emergent ability requiring scale |

    | Very long examples | Token budget exceeded before test input |

    The Surprising Robustness of ICL

    A counterintuitive finding: wrong labels barely matter

    `

    Standard few-shot:

    Input: "I love this movie" → Label: Positive

    Input: "Terrible experience" → Label: Negative

    Random label few-shot:

    Input: "I love this movie" → Label: Negative ← WRONG

    Input: "Terrible experience" → Label: Positive ← WRONG

    Performance: Nearly identical!

    `

    This suggests the model primarily uses examples to learn the format and task structure, not the actual input-output mapping — it's drawing on pre-trained knowledge.

    ICL in Practice

    Classification Template

    `

    Sentiment analysis:

    "Great product!" → positive

    "Would not recommend" → negative

    "Average, nothing special" → neutral

    "Works exactly as described" →

    `

    Extraction Template

    `

    Extract the company and amount from financial news.

    "Google acquired DeepMind for $400M" → {"company": "DeepMind", "amount": "$400M"}

    "Microsoft paid $26B to acquire LinkedIn" → {"company": "LinkedIn", "amount": "$26B"}

    "Amazon bought Whole Foods for $13.7B" → {"company": "Whole Foods", "amount": "$13.7B"}

    "Salesforce completed its $27.7B purchase of Slack" →

    `

    Format Teaching

    `

    Convert to military time:

    3:30 PM → 15:30

    10:15 AM → 10:15

    11:45 PM → 23:45

    6:00 AM →

    `

    ICL vs. Fine-Tuning: When to Use Which

    | Use ICL | Use Fine-Tuning |

    |---------|----------------|

    | Task requirements change frequently | Consistent, stable task |

    | Small amount of examples (<100) | Large dataset available (1K+) |

    | Prototyping and experimentation | Production, high-volume |

    | Token budget not critical | Token efficiency matters |

    | Don't want training overhead | Can afford training compute |

    ICL Scaling: Many-Shot ICL

    Recent trend: using many-shot ICL with very long context windows (Gemini 1M tokens):

  • Provide hundreds or thousands of examples in the context
  • Approaches fine-tuning quality without training
  • Especially powerful for tasks with highly consistent format
  • Related Concepts

  • Few-Shot, Zero-Shot, Chain of Thought, Emergent Abilities, Context Window, Attention

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 13).