Token — FDE@ProdAI Blog

Definition

A token is the atomic unit of text that an LLM processes. It is a piece of text — such as a word, sub-word, character, or punctuation symbol — that the model reads and generates one unit at a time.

Why Tokens (Not Characters or Words)?

Characters → too granular, very long sequences, poor semantic grouping
Words → vocabulary explodes (millions of rare/compound words), can't handle unknown words
Sub-word tokens → best balance: compact vocabulary (~32K–128K tokens), handles rare words by splitting them, retains common words whole

Common Tokenization Schemes

| Scheme | Description | Used By |

|--------|-------------|---------|

| BPE (Byte Pair Encoding) | Merges frequent byte pairs iteratively | GPT-2, GPT-4, LLaMA |

| WordPiece | Similar to BPE, maximizes language model likelihood | BERT |

| SentencePiece | Language-agnostic, works on raw bytes | T5, Gemini |

| Tiktoken | OpenAI's fast BPE implementation | GPT-3.5, GPT-4 |

Token Examples (GPT-4 tokenizer)

| Text | Tokens | Count |

|------|--------|-------|

| "Hello, world!" | ["Hello", ",", " world", "!"] | 4 |

| "tokenization" | ["token", "ization"] | 2 |

| "LLM" | ["L", "LM"] or ["LLM"] | varies |

Key Properties

Vocabulary size: typically 32K–128K unique tokens
Token ≠ word: one word can be 1–4 tokens; one token can span multiple characters
Special tokens: <|endoftext|>, , , [CLS], [SEP], [PAD] — control model behavior
Whitespace matters: " hello" and "hello" are often different tokens

Token Counting Rules of Thumb

1 token ≈ 4 characters (English)
1 token ≈ 0.75 words (English)
Non-English languages are typically less efficient (more tokens per word)
Code is generally tokenized efficiently

Practical Implications

Cost: APIs charge per token (input + output)
Context limits: models have a maximum token count they can process at once (context window)
Latency: more tokens = slower generation
Prompt design: being concise saves tokens and cost

Related Concepts

Tokenization, Embeddings, Context Window, Vocabulary, LLM