Chapter 2: Embeddings — Turning Tokens into Vectors¶

Learning Outcome¶

Implement token embeddings and positional encodings in PyTorch, understand why position information must be injected explicitly, and survey the major positional encoding schemes used in modern models.

Concepts¶

`nn.Embedding`: A Trainable Lookup Table¶

nn.Embedding(num_embeddings, embedding_dim) is simply a matrix of shape (vocab_size, d_model). Each row is the embedding vector for one token ID.

import torch
import torch.nn as nn

vocab_size = 50257  # GPT-2
d_model = 768

embed = nn.Embedding(vocab_size, d_model)

# Forward: index the rows by token IDs
token_ids = torch.tensor([[15496, 11, 995]])  # shape (1, 3)
x = embed(token_ids)  # shape (1, 3, 768)
print(x.shape)  # torch.Size([1, 3, 768])

During training, gradients flow back through the indexing operation and update only the rows corresponding to tokens that appeared in the batch.

Why Positional Encodings Are Necessary¶

Self-attention computes pairwise dot products between all tokens. Without position information, the operation is permutation-invariant — shuffling the input tokens would produce the same output. We must explicitly inject order information.

Sinusoidal Positional Encodings (Original Paper)¶

Vaswani et al. (2017) used fixed sinusoidal functions:

[ PE_{(pos, 2i)} = \sin!\left(\frac{pos}{10000^{2i/d_{model}}}\right) ] [ PE_{(pos, 2i+1)} = \cos!\left(\frac{pos}{10000^{2i/d_{model}}}\right) ]

where pos is the token position and i is the dimension index.

Properties:

No learnable parameters.
Each position has a unique pattern.
The model can extrapolate to longer sequences than seen during training (in principle).

Learned Positional Embeddings¶

BERT and GPT-2 use a second nn.Embedding(max_seq_len, d_model) whose rows are trained from data. Simpler to implement, but cannot extrapolate beyond the trained length.

Rotary Position Embeddings (RoPE)¶

Used by LLaMA, Mistral, and most modern LLMs. Instead of adding a positional vector, RoPE rotates the query and key vectors before computing attention:

\[ q_m \cdot k_n = \text{Re}(W_q x_m \cdot \overline{W_k x_n}) \cdot e^{i(m-n)\theta} \]

The dot product depends only on the relative position \(m - n\), not absolute positions. This improves length generalization significantly.

Embedding Scaling¶

The embedding output is multiplied by sqrt(d_model) before adding positional encodings. This keeps the magnitude of the token embeddings comparable to positional encodings as d_model grows. Without scaling, positional information would be drowned out in higher dimensions.

Exercise 1 — Sinusoidal Positional Encoding¶

Guided Exercise

Implement the encoding and visualize it as a heatmap.

Step 1: Implement the encoding¶

import torch
import math
import matplotlib.pyplot as plt


def sinusoidal_positional_encoding(max_len: int, d_model: int) -> torch.Tensor:
    """
    Returns a tensor of shape (max_len, d_model) with sinusoidal encodings.
    """
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)  # (max_len, 1)

    # Compute the division terms: 10000^(2i/d_model)
    div_term = torch.exp(
        torch.arange(0, d_model, 2, dtype=torch.float) * (-math.log(10000.0) / d_model)
    )  # shape: (d_model // 2,)

    pe[:, 0::2] = torch.sin(position * div_term)  # even dimensions
    pe[:, 1::2] = torch.cos(position * div_term)  # odd dimensions

    return pe


pe = sinusoidal_positional_encoding(max_len=100, d_model=64)
print("PE shape:", pe.shape)  # (100, 64)

Step 2: Visualize the encoding matrix¶

plt.figure(figsize=(12, 5))
plt.pcolormesh(pe.numpy().T, cmap='RdBu')
plt.xlabel('Position')
plt.ylabel('Dimension')
plt.colorbar()
plt.title('Sinusoidal Positional Encoding')
plt.tight_layout()
plt.savefig('sinusoidal_pe.png', dpi=100)
plt.show()

Notice how:

Low-frequency oscillations (large i) encode coarse position.
High-frequency oscillations (small i) encode fine-grained position.
The pattern is unique for every position.

Step 3: Build a combined embedding module¶

class TokenAndPositionEmbedding(nn.Module):
    def __init__(self, vocab_size: int, d_model: int, max_len: int, dropout: float = 0.1):
        super().__init__()
        self.token_embed = nn.Embedding(vocab_size, d_model)
        self.d_model = d_model

        # Register as buffer (not a parameter — not trained)
        pe = sinusoidal_positional_encoding(max_len, d_model)
        self.register_buffer('pe', pe)

        self.dropout = nn.Dropout(p=dropout)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: (batch_size, seq_len) token IDs
        seq_len = x.size(1)
        # Scale token embeddings, then add positional encoding
        token_emb = self.token_embed(x) * math.sqrt(self.d_model)
        return self.dropout(token_emb + self.pe[:seq_len])


# Test
vocab_size, d_model = 1000, 64
model = TokenAndPositionEmbedding(vocab_size, d_model, max_len=512)
x = torch.randint(0, vocab_size, (2, 20))  # batch=2, seq_len=20
out = model(x)
print("Output shape:", out.shape)  # (2, 20, 64)

Exercise 2 — Inspect HuggingFace Embedding Layers¶

from transformers import BertModel, GPT2Model

bert = BertModel.from_pretrained("bert-base-uncased")
gpt2 = GPT2Model.from_pretrained("gpt2")

# BERT embeddings
print("=== BERT ===")
print("Token embeddings:", bert.embeddings.word_embeddings.weight.shape)
print("Position embeddings:", bert.embeddings.position_embeddings.weight.shape)
print("Token type embeddings:", bert.embeddings.token_type_embeddings.weight.shape)

# GPT-2 embeddings
print("\n=== GPT-2 ===")
print("Token embeddings:", gpt2.wte.weight.shape)    # wte = word token embedding
print("Position embeddings:", gpt2.wpe.weight.shape) # wpe = word position embedding

Key differences:

BERT has token_type_embeddings (segment IDs for sentence pairs) — GPT-2 does not.
BERT positions are learned (position_embeddings) and go up to 512.
GPT-2 positions are also learned (wpe) and go up to 1024.
Neither BERT nor GPT-2 uses sinusoidal encodings.

# Compare embedding magnitudes
import torch

bert_emb = bert.embeddings.word_embeddings.weight.detach()
gpt2_emb = gpt2.wte.weight.detach()

print(f"\nBERT token embedding norms — mean: {bert_emb.norm(dim=1).mean():.3f}")
print(f"GPT-2 token embedding norms — mean: {gpt2_emb.norm(dim=1).mean():.3f}")

Exercise 3 — Implement RoPE¶

Rotary embeddings rotate query/key vectors rather than adding to them.

def precompute_freqs(d_head: int, max_len: int, base: float = 10000.0) -> tuple:
    """
    Precompute rotation frequencies for RoPE.
    Returns cos and sin tensors of shape (max_len, d_head).
    """
    # Frequencies: theta_i = 1 / base^(2i/d_head)
    theta = 1.0 / (base ** (torch.arange(0, d_head, 2, dtype=torch.float) / d_head))
    # Positions
    positions = torch.arange(max_len, dtype=torch.float)
    # Outer product: (max_len, d_head//2)
    freqs = torch.outer(positions, theta)
    # Duplicate to cover all dimensions: (max_len, d_head)
    freqs = torch.cat([freqs, freqs], dim=-1)
    return freqs.cos(), freqs.sin()


def rotate_half(x: torch.Tensor) -> torch.Tensor:
    """Rotate the second half of the last dimension to the first half."""
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat([-x2, x1], dim=-1)


def apply_rope(q: torch.Tensor, k: torch.Tensor,
               cos: torch.Tensor, sin: torch.Tensor) -> tuple:
    """Apply RoPE to query and key tensors."""
    # q, k: (batch, n_heads, seq_len, d_head)
    # cos, sin: (seq_len, d_head)
    seq_len = q.size(2)
    cos = cos[:seq_len].unsqueeze(0).unsqueeze(0)  # (1, 1, seq_len, d_head)
    sin = sin[:seq_len].unsqueeze(0).unsqueeze(0)

    q_rot = q * cos + rotate_half(q) * sin
    k_rot = k * cos + rotate_half(k) * sin
    return q_rot, k_rot


# Verify: dot products should depend only on relative position
d_head, max_len = 64, 128
cos_freq, sin_freq = precompute_freqs(d_head, max_len)

batch, n_heads = 1, 1
q = torch.randn(batch, n_heads, max_len, d_head)
k = torch.randn(batch, n_heads, max_len, d_head)

q_rot, k_rot = apply_rope(q, k, cos_freq, sin_freq)

# Dot product between position 5 and position 8 (relative distance = 3)
pos5_pos8 = (q_rot[0, 0, 5] * k_rot[0, 0, 8]).sum().item()

# Dot product between position 10 and position 13 (same relative distance = 3)
pos10_pos13 = (q_rot[0, 0, 10] * k_rot[0, 0, 13]).sum().item()

print(f"Dot product (pos 5→8):   {pos5_pos8:.6f}")
print(f"Dot product (pos 10→13): {pos10_pos13:.6f}")
# These should differ (dot product depends on vector values too),
# but the contribution of the rotation is purely relative.

Summary¶

Token embeddings are a lookup table (nn.Embedding) trained end-to-end.
Transformers require explicit positional encodings because attention is permutation-invariant.
Sinusoidal encodings are fixed; learned embeddings are the norm in BERT/GPT-2.
RoPE (used in LLaMA, Mistral) encodes relative positions by rotating Q/K vectors.
Embeddings are scaled by sqrt(d_model) to control gradient magnitude.

← Chapter 1 Chapter 3: Attention →