Subword Tokenization (BPE, WordPiece, SentencePiece)

Also known as: BPE, byte-pair encoding, WordPiece, SentencePiece, Unigram tokenization, tiktoken

TL;DR

Subword tokenization is the family of algorithms that learn a vocabulary of subword units from a corpus. BPE (byte-pair encoding) merges the most frequent adjacent pairs; WordPiece and Unigram are variants. tiktoken, SentencePiece, and tokenizers are the standard libraries.

Subword tokenization is the family of algorithms that learn a vocabulary of subword units from a corpus. The dominant approach is byte-pair encoding (BPE) — start with raw bytes, iteratively merge the most-frequent adjacent pair into a new token, stop at a target vocabulary size (typically 32K to 200K). Every modern large language model ships with a tokenizer trained this way, and that choice fixes most of the per-token cost and coverage tradeoffs the model will ever have.

The training procedure

BPE training is a greedy-frequency loop. Initialize the vocabulary as the 256 raw bytes. Sweep the corpus, count every adjacent token pair, add the most frequent as a new token, replace its occurrences, repeat until the target size. The output is an ordered merge list applied left-to-right at inference; once trained, tokenization is deterministic.

The variants

The subword tokenization family

BPE (Sennrich et al., 2015, originally for neural MT) — the dominant algorithm. tiktoken’s cl100k_base (GPT-4), o200k (GPT-4o), and Llama-3’s tokenizer all use BPE over raw bytes.
WordPiece (BERT, DistilBERT) — picks each merge to maximize corpus likelihood under a unigram LM rather than raw frequency. Practical difference from BPE is small.
Unigram language model (Kudo, 2018) — starts with an overcomplete vocabulary and EM-prunes the least-likely subwords to the target size. Used by T5, mBART.
SentencePiece — the library, not an algorithm. Supports BPE or Unigram, operates on raw text including whitespace. The default for any non-English or whitespace-irregular target.

Why this matters in production

Vocab size trades cost for coverage. A bigger vocabulary means fewer tokens per input — cheaper inference, longer effective context window in characters — but a bigger embedding matrix and softmax, which dominate parameter count in small models. The trajectory: 30K BERT, 50K GPT-2, 100K cl100k_base, 128K Llama 3, 200K o200k. Each bump was motivated by multilingual coverage.

A fine-tuned model is bound to the exact tokenizer it was trained against; swapping it corrupts every embedding row. Post-hoc multilingual extensions bolt on script-specific tokens rather than retraining — retraining would invalidate the model.

A token is not a word. It is a frequency-driven subword unit — common English words are usually one token, but “perplexity” is two, “Llama-3” is three, and “안녕하세요” is often eight or more.

Llama-3’s tokenizer includes all 256 raw bytes as fallback tokens. If the encoder ever sees a string with no learned merge — a rare emoji, an obscure Unicode codepoint, a script it never saw — it falls back to encoding the UTF-8 bytes directly. The model has embeddings for every byte, so the input is always representable.

This is what makes a tokenizer “complete” in the strict sense: no string can crash the encoder with OOV. Older WordPiece tokenizers (BERT) used a [UNK] token instead, which silently destroyed information. Byte-fallback fixed this at the cost of long sequences on cold scripts.

Subword statistics differ wildly across corpora. A tokenizer trained on English web text and used to encode Python loses 30-50% efficiency — punctuation, indentation, and identifier conventions don’t match its merges. Modern recipes train the tokenizer on a representative sample of the exact pretraining corpus mixture.

This is the deeper reason “tokenizer surgery” rarely works. Patching in new tokens without retraining leaves the model with embedding rows it has never seen — they encode the input, but their internal representation is whatever random initialization they got. The fix is continued pretraining so the embeddings can adapt.

Go further

How is BPE actually trained?

Start with byte-level (or character-level) tokens; count adjacent token pairs in the corpus; merge the most frequent pair into a new token; repeat until reaching the target vocab size. The merge list — applied left-to-right at inference — is the trained tokenizer.

Why does the same prompt cost more in Korean or Hindi?

Most modern tokenizers (cl100k_base, Llama-3's tiktoken) were trained on English-dominant corpora; non-Latin scripts get split into more (sometimes much more) tokens. A Korean sentence can be 3-5× more tokens than its English translation. tiktoken-ext and multilingual tokenizers like XLM-R's SentencePiece partially fix this.

Cost per token

What's the difference between BPE and SentencePiece?

BPE merges by frequency on pre-tokenized text (usually whitespace-split). SentencePiece operates directly on the raw byte stream including whitespace, treating it as just another character. SentencePiece is what you actually want for languages without whitespace word boundaries (Chinese, Japanese, Thai).

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs