Chapter 7: Decoder-Only Architecture — GPT and Autoregressive Generation¶

Learning Outcome¶

Implement a GPT-style decoder-only model, understand autoregressive text generation (greedy, sampling, beam search), and connect the architecture to modern LLMs like LLaMA.

Concepts¶

Decoder-Only Architecture¶

A decoder-only model uses only causal (unidirectional) self-attention — each token can only attend to itself and previous tokens. There is no encoder, no cross-attention.

Input:  The  quick  brown  fox
         ↓     ↓      ↓     ↓
        Stacked Causal Transformer Blocks
         ↓     ↓      ↓     ↓
        h_1   h_2    h_3   h_4
         ↓     ↓      ↓     ↓
LM Head: quick brown  fox  jumps   ← predicts next token for each position

This architecture excels at generation tasks.

Language Model Head¶

After the final transformer block, a linear projection maps the hidden state to vocabulary logits:

lm_head = nn.Linear(d_model, vocab_size, bias=False)

Often, lm_head.weight is tied to token_embed.weight (same matrix), reducing parameter count by ~50M for a 50k vocabulary with d_model=768.

Autoregressive Generation¶

At inference time, generate one token at a time:

Encode the prompt: [t_1, t_2, ..., t_n] → logits over vocabulary.
Select the next token from logits: t_{n+1}.
Append to input: [t_1, ..., t_n, t_{n+1}].
Repeat until EOS or max length.

Decoding Strategies¶

Greedy decoding: Always pick the highest-probability token.

next_token = logits.argmax(dim=-1)  # deterministic

Temperature sampling: Scale logits before softmax to control randomness.

probs = torch.softmax(logits / temperature, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)

- T < 1: more peaked distribution → less random. - T > 1: flatter distribution → more random. - T = 0: approaches greedy.

Top-k sampling: Sample only from the k highest-probability tokens.

# Zero out everything except top k
values, _ = logits.topk(k)
logits[logits < values[:, -1:]] = -float('inf')
probs = torch.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)

Top-p (nucleus) sampling: Sample from the smallest set of tokens whose cumulative probability exceeds p.

Beam search: Maintain k candidate sequences, extend each, prune to top k.

Exercise 1 — Implement a GPT-Style Model with Generation¶

Guided Exercise

Build a complete GPT model with greedy and temperature-based generate().

import torch
import torch.nn as nn
import torch.nn.functional as F


class GPTModel(nn.Module):
    def __init__(
        self,
        vocab_size: int,
        d_model: int,
        n_heads: int,
        n_layers: int,
        max_len: int = 1024,
        dropout: float = 0.1,
    ):
        super().__init__()
        self.token_embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(max_len, d_model)
        self.drop = nn.Dropout(dropout)
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, n_heads, dropout=dropout,
                             causal=True, max_len=max_len)
            for _ in range(n_layers)
        ])
        self.norm_f = nn.LayerNorm(d_model)
        # Weight-tied LM head: shares embedding matrix
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
        self.lm_head.weight = self.token_embed.weight

    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        batch, seq_len = input_ids.shape
        positions = torch.arange(seq_len, device=input_ids.device)
        x = self.drop(self.token_embed(input_ids) + self.pos_embed(positions))
        for block in self.blocks:
            x = block(x)
        return self.lm_head(self.norm_f(x))  # (batch, seq, vocab_size)

    @torch.no_grad()
    def generate(
        self,
        input_ids: torch.Tensor,
        max_new_tokens: int,
        temperature: float = 1.0,
        top_k: int | None = None,
    ) -> torch.Tensor:
        """Autoregressively generate tokens."""
        for _ in range(max_new_tokens):
            # Only pass the last max_len tokens if context is too long
            context = input_ids[:, -1024:]
            logits = self(context)[:, -1, :]  # (batch, vocab_size) — last position

            # Apply temperature
            logits = logits / max(temperature, 1e-8)

            # Apply top-k filter
            if top_k is not None:
                v, _ = logits.topk(top_k)
                logits[logits < v[:, [-1]]] = float('-inf')

            probs = F.softmax(logits, dim=-1)

            if temperature == 0.0:
                next_token = probs.argmax(dim=-1, keepdim=True)
            else:
                next_token = torch.multinomial(probs, num_samples=1)

            input_ids = torch.cat([input_ids, next_token], dim=1)

        return input_ids


# Test generation with a small model
small_gpt = GPTModel(vocab_size=50257, d_model=256, n_heads=8, n_layers=4)
ids = torch.tensor([[464, 2068, 318]])  # "The model is"
generated = small_gpt.generate(ids, max_new_tokens=10, temperature=1.0, top_k=50)
print("Generated token IDs:", generated[0].tolist())

Exercise 2 — Load GPT-2 Weights and Verify Generation¶

from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
hf_gpt2 = GPT2LMHeadModel.from_pretrained("gpt2")
hf_gpt2.eval()

prompt = "The meaning of life is"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# HuggingFace greedy generation
torch.manual_seed(42)
hf_output = hf_gpt2.generate(
    input_ids,
    max_new_tokens=20,
    do_sample=False,       # greedy
    pad_token_id=tokenizer.eos_token_id,
)
hf_text = tokenizer.decode(hf_output[0])
print("HuggingFace greedy output:")
print(f"  {hf_text!r}")

# HuggingFace sampling
torch.manual_seed(42)
hf_sample = hf_gpt2.generate(
    input_ids,
    max_new_tokens=20,
    do_sample=True,
    temperature=0.8,
    top_k=50,
    pad_token_id=tokenizer.eos_token_id,
)
print("\nHuggingFace sampling (T=0.8, top_k=50):")
print(f"  {tokenizer.decode(hf_sample[0])!r}")

Exercise 3 — Profile the Generation Loop¶

import time
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = hf_gpt2.to(device)
model.eval()

def benchmark_generation(prompt: str, n_tokens: int) -> float:
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
    start = time.perf_counter()
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_new_tokens=n_tokens,
            do_sample=False,
            use_cache=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    elapsed = time.perf_counter() - start
    return n_tokens / elapsed  # tokens/second

prompt = "In a world where transformers rule"
print(f"Device: {device}")
print(f"\n{'Tokens':>8}  {'Tokens/sec':>12}")
print("-" * 25)
for n in [10, 50, 100, 200]:
    tps = benchmark_generation(prompt, n)
    print(f"{n:>8}  {tps:>12.1f}")

At small batch sizes, GPT-2 generation is memory-bandwidth bound — the GPU's compute units sit mostly idle, waiting for weight matrices to be read from HBM. Quantization and Flash Attention (Chapter 12) address this.

Summary¶

Decoder-only models use causal self-attention and no encoder.
The language model head projects the final hidden state to vocabulary logits.
Weight tying between token_embed and lm_head saves ~50M parameters.
Autoregressive generation appends one token at a time; efficiency depends heavily on the KV cache (Chapter 9).
Temperature, top-k, and top-p sampling trade off diversity vs. coherence.

← Chapter 6 Chapter 8: Encoder-Decoder →