Skip to content

Chapter 1: Tokenization from Scratch

Learning Outcome

Understand how raw text is converted into integer token IDs, implement a basic BPE tokenizer, and demystify what AutoTokenizer is doing behind the scenes.


Concepts

What a Tokenizer Does

A tokenizer is a two-way mapping between text and integer sequences:

"Hello, world!" → [15496, 11, 995, 0]  (GPT-2 encoding)
[15496, 11, 995, 0] → "Hello, world!"  (decoding)

Neural networks cannot process raw text — they need fixed-size numerical inputs. A tokenizer defines the vocabulary: a finite set of text fragments (tokens), each assigned a unique integer ID.

The key design tension is:

  • Larger vocabulary → shorter sequences (better for the model), but more parameters in the embedding table.
  • Smaller vocabulary → longer sequences and potentially many [UNK] tokens for rare words.

Byte-Pair Encoding (BPE)

BPE is a bottom-up subword algorithm, used by GPT-2, GPT-4, LLaMA, and most modern LLMs.

Training algorithm:

  1. Start with a vocabulary of individual characters (or bytes).
  2. Count all adjacent symbol pairs in the corpus.
  3. Merge the most frequent pair into a new symbol.
  4. Repeat until the vocabulary reaches the desired size.

Example — training BPE on "low lower newest widest":

Initial tokens: l o w _ , l o w e r _ , n e w e s t _ , w i d e s t _

Most frequent pair: (e, s) → merge to "es"
Tokens: l o w _ , l o w e r _ , n e w es t _ , w i d es t _

Most frequent pair: (es, t) → merge to "est"
Tokens: l o w _ , l o w e r _ , n e w est _ , w i d est _

Most frequent pair: (l, o) → merge to "lo"
Tokens: lo w _ , lo w e r _ , n e w est _ , w i d est _
...

After training, you store the merge table (ordered list of merges). To encode new text, apply merges in order.

WordPiece vs. BPE vs. SentencePiece

Tokenizer Used by Strategy
BPE GPT-2, GPT-4, LLaMA Merge most-frequent pairs
WordPiece BERT Merge pairs maximizing likelihood
SentencePiece T5, LLaMA 2/3 BPE/Unigram on raw bytes (language-agnostic)

Special Tokens

Every tokenizer reserves IDs for control symbols:

Token Purpose
[CLS] Start of sequence (BERT classification)
[SEP] Sentence separator (BERT)
[PAD] Padding to fixed length
[MASK] Masked position (BERT MLM training)
[UNK] Unknown token
<s>, </s> Start/end of sequence (GPT, T5)
<\|endoftext\|> GPT-2 sequence boundary

The tokenizer.json Format

HuggingFace tokenizers serialize to a JSON with these fields:

{
  "model": {
    "type": "BPE",
    "vocab": {"Ġhello": 15496, ...},
    "merges": ["Ġ h", "e l", ...]
  },
  "added_tokens": [{"id": 50256, "content": "<|endoftext|>", "special": true}],
  "post_processor": {
    "type": "GPT2BPE"
  }
}

Exercise 1 — Implement a Minimal BPE Tokenizer

Guided Exercise

Follow along and implement each step together. Run each code cell before moving on.

Step 1: Build the initial vocabulary from a corpus

from collections import defaultdict, Counter


def build_vocab(text: str) -> dict[str, int]:
    """
    Split text into words, then into characters.
    Returns word frequencies as {' '.join(chars): count}.
    """
    word_freq = Counter(text.lower().split())
    vocab = {}
    for word, freq in word_freq.items():
        # Represent each word as space-separated characters + end-of-word marker
        key = ' '.join(list(word)) + ' </w>'
        vocab[key] = freq
    return vocab


corpus = """
the quick brown fox jumps over the lazy dog
the dog barked at the fox
the fox ran quickly away
"""

vocab = build_vocab(corpus)
print("Initial vocabulary (first 5 entries):")
for word, freq in list(vocab.items())[:5]:
    print(f"  {freq}x  {word!r}")

Step 2: Count adjacent symbol pairs

def get_pair_stats(vocab: dict[str, int]) -> Counter:
    """Count all adjacent symbol pairs across all word entries."""
    pairs = Counter()
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[(symbols[i], symbols[i + 1])] += freq
    return pairs


pairs = get_pair_stats(vocab)
print("Top 5 most frequent pairs:")
for pair, count in pairs.most_common(5):
    print(f"  {count}x  {pair}")

Step 3: Merge the most frequent pair

def merge_pair(pair: tuple[str, str], vocab: dict[str, int]) -> dict[str, int]:
    """Replace all occurrences of `pair` with the merged symbol."""
    new_vocab = {}
    bigram = ' '.join(pair)
    replacement = ''.join(pair)
    for word, freq in vocab.items():
        new_word = word.replace(bigram, replacement)
        new_vocab[new_word] = freq
    return new_vocab


# Merge the most frequent pair
best_pair = pairs.most_common(1)[0][0]
print(f"Merging: {best_pair}{''.join(best_pair)}")
vocab = merge_pair(best_pair, vocab)

Step 4: Train BPE for N merges

def train_bpe(text: str, num_merges: int) -> tuple[dict, list]:
    """Train BPE and return vocabulary + merge rules."""
    vocab = build_vocab(text)
    merges = []

    for i in range(num_merges):
        pairs = get_pair_stats(vocab)
        if not pairs:
            break
        best = pairs.most_common(1)[0][0]
        vocab = merge_pair(best, vocab)
        merges.append(best)
        if (i + 1) % 10 == 0:
            print(f"Merge {i+1:3d}: {best[0]!r} + {best[1]!r}{''.join(best)!r}")

    return vocab, merges


vocab, merges = train_bpe(corpus, num_merges=30)

Step 5: Encode a new sentence

def encode(text: str, merges: list) -> list[str]:
    """Apply learned merge rules to encode text into subword tokens."""
    tokens = []
    for word in text.lower().split():
        # Start with characters
        word_tokens = list(word) + ['</w>']
        # Apply merges in order
        for pair in merges:
            i = 0
            new_tokens = []
            while i < len(word_tokens):
                if (i < len(word_tokens) - 1 and
                        word_tokens[i] == pair[0] and
                        word_tokens[i + 1] == pair[1]):
                    new_tokens.append(''.join(pair))
                    i += 2
                else:
                    new_tokens.append(word_tokens[i])
                    i += 1
            word_tokens = new_tokens
        tokens.extend(word_tokens)
    return tokens


sentence = "the fox jumped"
tokens = encode(sentence, merges)
print(f"Input:  {sentence!r}")
print(f"Tokens: {tokens}")

Exercise 2 — Compare BERT and GPT-2 Tokenizers

Guided Exercise

Load both tokenizers and tokenize the same sentence to see how they differ.

from transformers import AutoTokenizer

bert_tok = AutoTokenizer.from_pretrained("bert-base-uncased")
gpt2_tok = AutoTokenizer.from_pretrained("gpt2")

sentence = "Transformers revolutionized natural language processing."

bert_ids = bert_tok.encode(sentence, add_special_tokens=True)
gpt2_ids = gpt2_tok.encode(sentence)

print("BERT tokens:")
print("  IDs:", bert_ids)
print("  Pieces:", bert_tok.convert_ids_to_tokens(bert_ids))

print("\nGPT-2 tokens:")
print("  IDs:", gpt2_ids)
print("  Pieces:", gpt2_tok.convert_ids_to_tokens(gpt2_ids))

Notice:

  • BERT uses [CLS] and [SEP] special tokens; GPT-2 does not.
  • BERT lowercases input (bert-base-**uncased**).
  • GPT-2 uses the Ġ prefix (Unicode 0x0120) to mark word-initial characters.
  • Rare words may be split differently — compare "revolutionized".
# Inspect batch encoding with offset mapping
encoding = bert_tok(sentence, return_offsets_mapping=True)
print("\nBERT offset mapping (token → character span):")
pieces = bert_tok.convert_ids_to_tokens(encoding["input_ids"])
for tok, span in zip(pieces, encoding["offset_mapping"]):
    print(f"  {tok:20s}  chars {span[0]:3d}{span[1]:3d}")

Exercise 3 — Inspect the tokenizer.json File

import json
from transformers.utils import cached_file

# Locate the tokenizer.json on disk
path = cached_file("gpt2", "tokenizer.json")
print(f"Tokenizer file: {path}\n")

with open(path) as f:
    data = json.load(f)

print("Top-level keys:", list(data.keys()))
print("\nModel type:", data["model"]["type"])
print("Vocab size:", len(data["model"]["vocab"]))
print("\nFirst 5 merge rules:")
for merge in data["model"]["merges"][:5]:
    print(f"  {merge!r}")

print("\nAdded tokens:")
for tok in data["added_tokens"]:
    print(f"  id={tok['id']:6d}  content={tok['content']!r}  special={tok['special']}")

Map each field to the BPE algorithm you implemented in Exercise 1:

tokenizer.json field Corresponds to
model.vocab Final symbol → ID mapping
model.merges Ordered merge rules
added_tokens Special tokens inserted outside BPE
post_processor Rules for adding [CLS]/[SEP] etc.

Summary

  • Tokenizers split text into subword units using learned merge rules (BPE) or likelihood maximization (WordPiece).
  • The vocabulary size controls the trade-off between sequence length and embedding table size; modern LLMs typically use 32 k–128 k tokens.
  • HuggingFace tokenizers serialize to tokenizer.json; every field maps directly to an algorithm step.
  • Special tokens ([CLS], [PAD], <|endoftext|>) are reserved IDs inserted by the post-processor.

← Chapter 0: PyTorch Basics Chapter 2: Embeddings →