Definition
Pre-training is the initial, large-scale training phase where a model learns general language understanding and generation capabilities by training on massive text corpora. It is called "pre-training" because it precedes more focused training stages (fine-tuning, RLHF). The result is a base model or foundation model.
Core Objective: Next-Token Prediction
The model is trained to predict the next token in a sequence:
`
Input: "The capital of France is"
Target: "Paris"
Loss = CrossEntropy(model_output, "Paris")
`
This simple objective, applied at scale over trillions of tokens, forces the model to implicitly learn:
- Grammar and syntax
- World knowledge and facts
- Reasoning patterns
- Coding conventions
- Multiple languages
- Hundreds to thousands of GPUs/TPUs (H100, A100, TPU v4/v5)
- Distributed training: Data Parallelism, Tensor Parallelism, Pipeline Parallelism
- Mixed-precision training (bf16) to fit in GPU memory
- Gradient checkpointing to trade compute for memory
- Efficient data loaders to prevent I/O bottlenecks
- Rule of thumb: 20 tokens per parameter for compute-optimal training
- 7B model → ~140B tokens (minimal); frontier models train on 10–100× more for over-training efficiency at inference
- Can complete text coherently
- Has broad world knowledge
- Does NOT reliably follow instructions (may continue a question rather than answer it)
- Requires further alignment work to be useful as an assistant
- Base Model, Fine-Tuning, RLHF, Parameters, Token, Scaling Laws, Chinchilla
Training Data
| Source Type | Examples |
|-------------|----------|
| Web crawls | CommonCrawl, C4, RefinedWeb |
| Books | Books1, Books2, Project Gutenberg |
| Code | GitHub repositories, Stack Overflow |
| Wikipedia | All language editions |
| Scientific papers | ArXiv, PubMed |
| Curated datasets | The Pile, Dolma, RedPajama |
Typical scale: 1–15 trillion tokens for frontier models
The Training Loop
1. Sample a batch of token sequences from the dataset
2. Run forward pass: model predicts next token at every position
3. Compute cross-entropy loss between predictions and actual next tokens
4. Run backward pass: compute gradients via backpropagation
5. Update all parameters using an optimizer (typically AdamW)
6. Repeat for billions of steps
Infrastructure Requirements
Compute Scale (Chinchilla Scaling Laws)
The Chinchilla paper (DeepMind, 2022) established optimal token-to-parameter ratios:
Compute Cost Estimate (Rough Formula)
`
FLOPs ≈ 6 × N × D
where:
N = number of parameters
D = number of training tokens
`
Example: 7B model × 2T tokens ≈ 84 × 10^21 FLOPs ≈ $1–5M in GPU cost
Pre-training Phases
Some models use a multi-phase curriculum:
1. Phase 1: Broad web data for general language understanding
2. Phase 2: High-quality curated data (books, code, math) to boost specific capabilities
3. Phase 3 (optional): Domain-specific data for specialized models
What Pre-training Produces
A base/foundation model that:
Notable Pre-trained Models
| Model | Organization | Params | Tokens |
|-------|-------------|--------|--------|
| GPT-3 | OpenAI | 175B | 300B |
| LLaMA 3 | Meta | 8B–70B | 15T |
| Mistral 7B | Mistral AI | 7B | ~1T |
| Gemma 2 | Google | 2B–27B | ~13T |
| Claude (base) | Anthropic | undisclosed | undisclosed |