Pre-training — FDE@ProdAI Blog

Definition

Pre-training is the initial, large-scale training phase where a model learns general language understanding and generation capabilities by training on massive text corpora. It is called "pre-training" because it precedes more focused training stages (fine-tuning, RLHF). The result is a base model or foundation model.

Core Objective: Next-Token Prediction

The model is trained to predict the next token in a sequence:

Input: "The capital of France is"

Target: "Paris"

Loss = CrossEntropy(model_output, "Paris")

This simple objective, applied at scale over trillions of tokens, forces the model to implicitly learn:

Grammar and syntax
World knowledge and facts
Reasoning patterns
Coding conventions
Multiple languages

Training Data

| Source Type | Examples |

|-------------|----------|

| Web crawls | CommonCrawl, C4, RefinedWeb |

| Books | Books1, Books2, Project Gutenberg |

| Code | GitHub repositories, Stack Overflow |

| Wikipedia | All language editions |

| Scientific papers | ArXiv, PubMed |

| Curated datasets | The Pile, Dolma, RedPajama |

Typical scale: 1–15 trillion tokens for frontier models

The Training Loop

1. Sample a batch of token sequences from the dataset

2. Run forward pass: model predicts next token at every position

3. Compute cross-entropy loss between predictions and actual next tokens

4. Run backward pass: compute gradients via backpropagation

5. Update all parameters using an optimizer (typically AdamW)

6. Repeat for billions of steps

Infrastructure Requirements

Hundreds to thousands of GPUs/TPUs (H100, A100, TPU v4/v5)
Distributed training: Data Parallelism, Tensor Parallelism, Pipeline Parallelism
Mixed-precision training (bf16) to fit in GPU memory
Gradient checkpointing to trade compute for memory
Efficient data loaders to prevent I/O bottlenecks

Compute Scale (Chinchilla Scaling Laws)

The Chinchilla paper (DeepMind, 2022) established optimal token-to-parameter ratios:

Rule of thumb: 20 tokens per parameter for compute-optimal training
7B model → ~140B tokens (minimal); frontier models train on 10–100× more for over-training efficiency at inference

Compute Cost Estimate (Rough Formula)

FLOPs ≈ 6 × N × D

where:

N = number of parameters

D = number of training tokens

Example: 7B model × 2T tokens ≈ 84 × 10^21 FLOPs ≈ $1–5M in GPU cost

Pre-training Phases

Some models use a multi-phase curriculum:

1. Phase 1: Broad web data for general language understanding

2. Phase 2: High-quality curated data (books, code, math) to boost specific capabilities

3. Phase 3 (optional): Domain-specific data for specialized models

What Pre-training Produces

A base/foundation model that:

Can complete text coherently
Has broad world knowledge
Does NOT reliably follow instructions (may continue a question rather than answer it)
Requires further alignment work to be useful as an assistant

Notable Pre-trained Models

|-------|-------------|--------|--------|

| GPT-3 | OpenAI | 175B | 300B |

| LLaMA 3 | Meta | 8B–70B | 15T |

| Mistral 7B | Mistral AI | 7B | ~1T |

| Gemma 2 | Google | 2B–27B | ~13T |

Related Concepts

Base Model, Fine-Tuning, RLHF, Parameters, Token, Scaling Laws, Chinchilla