Chapter 1: Tokenization from Scratch¶
Learning Outcome¶
Understand how raw text is converted into integer token IDs, implement a basic BPE
tokenizer, and demystify what AutoTokenizer is doing behind the scenes.
Concepts¶
What a Tokenizer Does¶
A tokenizer is a two-way mapping between text and integer sequences:
"Hello, world!" → [15496, 11, 995, 0] (GPT-2 encoding)
[15496, 11, 995, 0] → "Hello, world!" (decoding)
Neural networks cannot process raw text — they need fixed-size numerical inputs. A tokenizer defines the vocabulary: a finite set of text fragments (tokens), each assigned a unique integer ID.
The key design tension is:
- Larger vocabulary → shorter sequences (better for the model), but more parameters in the embedding table.
- Smaller vocabulary → longer sequences and potentially many
[UNK]tokens for rare words.
Byte-Pair Encoding (BPE)¶
BPE is a bottom-up subword algorithm, used by GPT-2, GPT-4, LLaMA, and most modern LLMs.
Training algorithm:
- Start with a vocabulary of individual characters (or bytes).
- Count all adjacent symbol pairs in the corpus.
- Merge the most frequent pair into a new symbol.
- Repeat until the vocabulary reaches the desired size.
Example — training BPE on "low lower newest widest":
Initial tokens: l o w _ , l o w e r _ , n e w e s t _ , w i d e s t _
Most frequent pair: (e, s) → merge to "es"
Tokens: l o w _ , l o w e r _ , n e w es t _ , w i d es t _
Most frequent pair: (es, t) → merge to "est"
Tokens: l o w _ , l o w e r _ , n e w est _ , w i d est _
Most frequent pair: (l, o) → merge to "lo"
Tokens: lo w _ , lo w e r _ , n e w est _ , w i d est _
...
After training, you store the merge table (ordered list of merges). To encode new text, apply merges in order.
WordPiece vs. BPE vs. SentencePiece¶
| Tokenizer | Used by | Strategy |
|---|---|---|
| BPE | GPT-2, GPT-4, LLaMA | Merge most-frequent pairs |
| WordPiece | BERT | Merge pairs maximizing likelihood |
| SentencePiece | T5, LLaMA 2/3 | BPE/Unigram on raw bytes (language-agnostic) |
Special Tokens¶
Every tokenizer reserves IDs for control symbols:
| Token | Purpose |
|---|---|
[CLS] |
Start of sequence (BERT classification) |
[SEP] |
Sentence separator (BERT) |
[PAD] |
Padding to fixed length |
[MASK] |
Masked position (BERT MLM training) |
[UNK] |
Unknown token |
<s>, </s> |
Start/end of sequence (GPT, T5) |
<\|endoftext\|> |
GPT-2 sequence boundary |
The tokenizer.json Format¶
HuggingFace tokenizers serialize to a JSON with these fields:
{
"model": {
"type": "BPE",
"vocab": {"Ġhello": 15496, ...},
"merges": ["Ġ h", "e l", ...]
},
"added_tokens": [{"id": 50256, "content": "<|endoftext|>", "special": true}],
"post_processor": {
"type": "GPT2BPE"
}
}
Exercise 1 — Implement a Minimal BPE Tokenizer¶
Guided Exercise
Follow along and implement each step together. Run each code cell before moving on.
Step 1: Build the initial vocabulary from a corpus¶
from collections import defaultdict, Counter
def build_vocab(text: str) -> dict[str, int]:
"""
Split text into words, then into characters.
Returns word frequencies as {' '.join(chars): count}.
"""
word_freq = Counter(text.lower().split())
vocab = {}
for word, freq in word_freq.items():
# Represent each word as space-separated characters + end-of-word marker
key = ' '.join(list(word)) + ' </w>'
vocab[key] = freq
return vocab
corpus = """
the quick brown fox jumps over the lazy dog
the dog barked at the fox
the fox ran quickly away
"""
vocab = build_vocab(corpus)
print("Initial vocabulary (first 5 entries):")
for word, freq in list(vocab.items())[:5]:
print(f" {freq}x {word!r}")
Step 2: Count adjacent symbol pairs¶
def get_pair_stats(vocab: dict[str, int]) -> Counter:
"""Count all adjacent symbol pairs across all word entries."""
pairs = Counter()
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols) - 1):
pairs[(symbols[i], symbols[i + 1])] += freq
return pairs
pairs = get_pair_stats(vocab)
print("Top 5 most frequent pairs:")
for pair, count in pairs.most_common(5):
print(f" {count}x {pair}")
Step 3: Merge the most frequent pair¶
def merge_pair(pair: tuple[str, str], vocab: dict[str, int]) -> dict[str, int]:
"""Replace all occurrences of `pair` with the merged symbol."""
new_vocab = {}
bigram = ' '.join(pair)
replacement = ''.join(pair)
for word, freq in vocab.items():
new_word = word.replace(bigram, replacement)
new_vocab[new_word] = freq
return new_vocab
# Merge the most frequent pair
best_pair = pairs.most_common(1)[0][0]
print(f"Merging: {best_pair} → {''.join(best_pair)}")
vocab = merge_pair(best_pair, vocab)
Step 4: Train BPE for N merges¶
def train_bpe(text: str, num_merges: int) -> tuple[dict, list]:
"""Train BPE and return vocabulary + merge rules."""
vocab = build_vocab(text)
merges = []
for i in range(num_merges):
pairs = get_pair_stats(vocab)
if not pairs:
break
best = pairs.most_common(1)[0][0]
vocab = merge_pair(best, vocab)
merges.append(best)
if (i + 1) % 10 == 0:
print(f"Merge {i+1:3d}: {best[0]!r} + {best[1]!r} → {''.join(best)!r}")
return vocab, merges
vocab, merges = train_bpe(corpus, num_merges=30)
Step 5: Encode a new sentence¶
def encode(text: str, merges: list) -> list[str]:
"""Apply learned merge rules to encode text into subword tokens."""
tokens = []
for word in text.lower().split():
# Start with characters
word_tokens = list(word) + ['</w>']
# Apply merges in order
for pair in merges:
i = 0
new_tokens = []
while i < len(word_tokens):
if (i < len(word_tokens) - 1 and
word_tokens[i] == pair[0] and
word_tokens[i + 1] == pair[1]):
new_tokens.append(''.join(pair))
i += 2
else:
new_tokens.append(word_tokens[i])
i += 1
word_tokens = new_tokens
tokens.extend(word_tokens)
return tokens
sentence = "the fox jumped"
tokens = encode(sentence, merges)
print(f"Input: {sentence!r}")
print(f"Tokens: {tokens}")
Exercise 2 — Compare BERT and GPT-2 Tokenizers¶
Guided Exercise
Load both tokenizers and tokenize the same sentence to see how they differ.
from transformers import AutoTokenizer
bert_tok = AutoTokenizer.from_pretrained("bert-base-uncased")
gpt2_tok = AutoTokenizer.from_pretrained("gpt2")
sentence = "Transformers revolutionized natural language processing."
bert_ids = bert_tok.encode(sentence, add_special_tokens=True)
gpt2_ids = gpt2_tok.encode(sentence)
print("BERT tokens:")
print(" IDs:", bert_ids)
print(" Pieces:", bert_tok.convert_ids_to_tokens(bert_ids))
print("\nGPT-2 tokens:")
print(" IDs:", gpt2_ids)
print(" Pieces:", gpt2_tok.convert_ids_to_tokens(gpt2_ids))
Notice:
- BERT uses
[CLS]and[SEP]special tokens; GPT-2 does not. - BERT lowercases input (
bert-base-**uncased**). - GPT-2 uses the
Ġprefix (Unicode 0x0120) to mark word-initial characters. - Rare words may be split differently — compare "revolutionized".
# Inspect batch encoding with offset mapping
encoding = bert_tok(sentence, return_offsets_mapping=True)
print("\nBERT offset mapping (token → character span):")
pieces = bert_tok.convert_ids_to_tokens(encoding["input_ids"])
for tok, span in zip(pieces, encoding["offset_mapping"]):
print(f" {tok:20s} chars {span[0]:3d}–{span[1]:3d}")
Exercise 3 — Inspect the tokenizer.json File¶
import json
from transformers.utils import cached_file
# Locate the tokenizer.json on disk
path = cached_file("gpt2", "tokenizer.json")
print(f"Tokenizer file: {path}\n")
with open(path) as f:
data = json.load(f)
print("Top-level keys:", list(data.keys()))
print("\nModel type:", data["model"]["type"])
print("Vocab size:", len(data["model"]["vocab"]))
print("\nFirst 5 merge rules:")
for merge in data["model"]["merges"][:5]:
print(f" {merge!r}")
print("\nAdded tokens:")
for tok in data["added_tokens"]:
print(f" id={tok['id']:6d} content={tok['content']!r} special={tok['special']}")
Map each field to the BPE algorithm you implemented in Exercise 1:
tokenizer.json field |
Corresponds to |
|---|---|
model.vocab |
Final symbol → ID mapping |
model.merges |
Ordered merge rules |
added_tokens |
Special tokens inserted outside BPE |
post_processor |
Rules for adding [CLS]/[SEP] etc. |
Summary¶
- Tokenizers split text into subword units using learned merge rules (BPE) or likelihood maximization (WordPiece).
- The vocabulary size controls the trade-off between sequence length and embedding table size; modern LLMs typically use 32 k–128 k tokens.
- HuggingFace tokenizers serialize to
tokenizer.json; every field maps directly to an algorithm step. - Special tokens (
[CLS],[PAD],<|endoftext|>) are reserved IDs inserted by the post-processor.