Beginner·3 min read

Tokenization

Tokenization is the process of converting raw text into a sequence of tokens — the discrete numeric IDs that an LLM can process. It is the first and l

Definition

Tokenization is the process of converting raw text into a sequence of tokens — the discrete numeric IDs that an LLM can process. It is the first and last step in every LLM pipeline: text → token IDs (encoding) and token IDs → text (decoding).

The Full Pipeline

`

Raw Text → [Tokenizer] → Token IDs → [Model] → Token IDs → [Detokenizer] → Output Text

"Hello!" → [15496, 0] → ...model... → [2159] → "World"

`

Steps in Tokenization

1. Normalization — Unicode normalization, lowercasing (model-dependent), whitespace handling

2. Pre-tokenization — split on spaces/punctuation as initial boundary hints

3. Subword splitting — apply BPE/WordPiece/SentencePiece rules to produce final token units

4. Vocabulary lookup — map each token string → integer ID

5. Special token injection — add [BOS], [EOS], [PAD] as required by the model

Major Algorithms

Byte Pair Encoding (BPE)

  • Start with individual characters as vocabulary
  • Repeatedly merge the most frequent adjacent pair
  • Stop when vocabulary reaches target size
  • Used by: GPT family, LLaMA, Mistral
  • WordPiece

  • Similar to BPE but merges based on maximizing likelihood of training data
  • Unknown tokens split with ## prefix for continuations
  • Used by: BERT, DistilBERT
  • SentencePiece

  • Treats input as a raw byte stream (language-agnostic)
  • Supports both BPE and unigram language model modes
  • Handles spaces explicitly with symbol
  • Used by: T5, XLNet, Gemini
  • Unigram Language Model

  • Starts with a large vocabulary and prunes tokens that minimize loss increase
  • More probabilistic approach
  • Encoding vs. Decoding

    | Direction | Term | Description |

    |-----------|------|-------------|

    | Text → IDs | Encoding / Tokenizing | Used at input time |

    | IDs → Text | Decoding / Detokenizing | Used at output time |

    Why Tokenization Matters

  • Model performance: poor tokenization = poor understanding of rare words, numbers, code
  • Multilingual support: byte-level tokenizers handle all languages; word-level struggles with non-Latin scripts
  • Arithmetic tasks: models struggle with math partly because numbers tokenize inconsistently ("123" may be 1 or 3 tokens)
  • Prompt engineering: knowing how text tokenizes helps design efficient, precise prompts
  • Common Gotchas

  • Leading spaces: " word""word" as tokens
  • Numbers: "1000000" may tokenize as ["100", "0000"] — not ["1000000"]
  • Code symbols: ->, =>, // each tokenize differently per tokenizer
  • Emoji/Unicode: may tokenize into many byte-level tokens
  • Tools

  • tiktoken (OpenAI) — fast Python tokenizer for GPT models
  • transformers.AutoTokenizer (HuggingFace) — universal tokenizer loader
  • Tokenizer Playground (OpenAI) — visual token inspection
  • Related Concepts

  • Token, Embeddings, Vocabulary, Context Window, BPE

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 1).