Tokenization — FDE@ProdAI Blog

Definition

Tokenization is the process of converting raw text into a sequence of tokens — the discrete numeric IDs that an LLM can process. It is the first and last step in every LLM pipeline: text → token IDs (encoding) and token IDs → text (decoding).

The Full Pipeline

Raw Text → [Tokenizer] → Token IDs → [Model] → Token IDs → [Detokenizer] → Output Text

"Hello!" → [15496, 0] → ...model... → [2159] → "World"

Steps in Tokenization

1. Normalization — Unicode normalization, lowercasing (model-dependent), whitespace handling

2. Pre-tokenization — split on spaces/punctuation as initial boundary hints

3. Subword splitting — apply BPE/WordPiece/SentencePiece rules to produce final token units

4. Vocabulary lookup — map each token string → integer ID

5. Special token injection — add [BOS], [EOS], [PAD] as required by the model

Major Algorithms

Byte Pair Encoding (BPE)

Start with individual characters as vocabulary
Repeatedly merge the most frequent adjacent pair
Stop when vocabulary reaches target size
Used by: GPT family, LLaMA, Mistral

WordPiece

Similar to BPE but merges based on maximizing likelihood of training data
Unknown tokens split with ## prefix for continuations
Used by: BERT, DistilBERT

SentencePiece

Treats input as a raw byte stream (language-agnostic)
Supports both BPE and unigram language model modes
Handles spaces explicitly with ▁ symbol
Used by: T5, XLNet, Gemini

Unigram Language Model

Starts with a large vocabulary and prunes tokens that minimize loss increase
More probabilistic approach

Encoding vs. Decoding

| Direction | Term | Description |

|-----------|------|-------------|

| Text → IDs | Encoding / Tokenizing | Used at input time |

| IDs → Text | Decoding / Detokenizing | Used at output time |

Why Tokenization Matters

Model performance: poor tokenization = poor understanding of rare words, numbers, code
Multilingual support: byte-level tokenizers handle all languages; word-level struggles with non-Latin scripts
Arithmetic tasks: models struggle with math partly because numbers tokenize inconsistently ("123" may be 1 or 3 tokens)
Prompt engineering: knowing how text tokenizes helps design efficient, precise prompts

Common Gotchas

Leading spaces: " word" ≠ "word" as tokens
Numbers: "1000000" may tokenize as ["100", "0000"] — not ["1000000"]
Code symbols: ->, =>, // each tokenize differently per tokenizer
Emoji/Unicode: may tokenize into many byte-level tokens

Tools

tiktoken (OpenAI) — fast Python tokenizer for GPT models
transformers.AutoTokenizer (HuggingFace) — universal tokenizer loader
Tokenizer Playground (OpenAI) — visual token inspection

Related Concepts

Token, Embeddings, Vocabulary, Context Window, BPE