Definition
A token is the atomic unit of text that an LLM processes. It is a piece of text — such as a word, sub-word, character, or punctuation symbol — that the model reads and generates one unit at a time.
Why Tokens (Not Characters or Words)?
- Characters → too granular, very long sequences, poor semantic grouping
- Words → vocabulary explodes (millions of rare/compound words), can't handle unknown words
- Sub-word tokens → best balance: compact vocabulary (~32K–128K tokens), handles rare words by splitting them, retains common words whole
- Vocabulary size: typically 32K–128K unique tokens
- Token ≠ word: one word can be 1–4 tokens; one token can span multiple characters
- Special tokens:
<|endoftext|>,,,[CLS],[SEP],[PAD]— control model behavior - Whitespace matters: " hello" and "hello" are often different tokens
- 1 token ≈ 4 characters (English)
- 1 token ≈ 0.75 words (English)
- Non-English languages are typically less efficient (more tokens per word)
- Code is generally tokenized efficiently
- Cost: APIs charge per token (input + output)
- Context limits: models have a maximum token count they can process at once (context window)
- Latency: more tokens = slower generation
- Prompt design: being concise saves tokens and cost
- Tokenization, Embeddings, Context Window, Vocabulary, LLM
Common Tokenization Schemes
| Scheme | Description | Used By |
|--------|-------------|---------|
| BPE (Byte Pair Encoding) | Merges frequent byte pairs iteratively | GPT-2, GPT-4, LLaMA |
| WordPiece | Similar to BPE, maximizes language model likelihood | BERT |
| SentencePiece | Language-agnostic, works on raw bytes | T5, Gemini |
| Tiktoken | OpenAI's fast BPE implementation | GPT-3.5, GPT-4 |
Token Examples (GPT-4 tokenizer)
| Text | Tokens | Count |
|------|--------|-------|
| "Hello, world!" | ["Hello", ",", " world", "!"] | 4 |
| "tokenization" | ["token", "ization"] | 2 |
| "LLM" | ["L", "LM"] or ["LLM"] | varies |