Chapter 7: Decoder-Only Architecture — GPT and Autoregressive Generation¶
Learning Outcome¶
Implement a GPT-style decoder-only model, understand autoregressive text generation (greedy, sampling, beam search), and connect the architecture to modern LLMs like LLaMA.
Concepts¶
Decoder-Only Architecture¶
A decoder-only model uses only causal (unidirectional) self-attention — each token can only attend to itself and previous tokens. There is no encoder, no cross-attention.
Input: The quick brown fox
↓ ↓ ↓ ↓
Stacked Causal Transformer Blocks
↓ ↓ ↓ ↓
h_1 h_2 h_3 h_4
↓ ↓ ↓ ↓
LM Head: quick brown fox jumps ← predicts next token for each position
This architecture excels at generation tasks.
Language Model Head¶
After the final transformer block, a linear projection maps the hidden state to vocabulary logits:
Often, lm_head.weight is tied to token_embed.weight (same matrix), reducing
parameter count by ~50M for a 50k vocabulary with d_model=768.
Autoregressive Generation¶
At inference time, generate one token at a time:
- Encode the prompt:
[t_1, t_2, ..., t_n]→ logits over vocabulary. - Select the next token from logits:
t_{n+1}. - Append to input:
[t_1, ..., t_n, t_{n+1}]. - Repeat until EOS or max length.
Decoding Strategies¶
Greedy decoding: Always pick the highest-probability token.
Temperature sampling: Scale logits before softmax to control randomness.
probs = torch.softmax(logits / temperature, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
T < 1: more peaked distribution → less random.
- T > 1: flatter distribution → more random.
- T = 0: approaches greedy.
Top-k sampling: Sample only from the k highest-probability tokens.
# Zero out everything except top k
values, _ = logits.topk(k)
logits[logits < values[:, -1:]] = -float('inf')
probs = torch.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
Top-p (nucleus) sampling: Sample from the smallest set of tokens whose cumulative probability exceeds p.
Beam search: Maintain k candidate sequences, extend each, prune to top k.
Exercise 1 — Implement a GPT-Style Model with Generation¶
Guided Exercise
Build a complete GPT model with greedy and temperature-based generate().
import torch
import torch.nn as nn
import torch.nn.functional as F
class GPTModel(nn.Module):
def __init__(
self,
vocab_size: int,
d_model: int,
n_heads: int,
n_layers: int,
max_len: int = 1024,
dropout: float = 0.1,
):
super().__init__()
self.token_embed = nn.Embedding(vocab_size, d_model)
self.pos_embed = nn.Embedding(max_len, d_model)
self.drop = nn.Dropout(dropout)
self.blocks = nn.ModuleList([
TransformerBlock(d_model, n_heads, dropout=dropout,
causal=True, max_len=max_len)
for _ in range(n_layers)
])
self.norm_f = nn.LayerNorm(d_model)
# Weight-tied LM head: shares embedding matrix
self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
self.lm_head.weight = self.token_embed.weight
def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
batch, seq_len = input_ids.shape
positions = torch.arange(seq_len, device=input_ids.device)
x = self.drop(self.token_embed(input_ids) + self.pos_embed(positions))
for block in self.blocks:
x = block(x)
return self.lm_head(self.norm_f(x)) # (batch, seq, vocab_size)
@torch.no_grad()
def generate(
self,
input_ids: torch.Tensor,
max_new_tokens: int,
temperature: float = 1.0,
top_k: int | None = None,
) -> torch.Tensor:
"""Autoregressively generate tokens."""
for _ in range(max_new_tokens):
# Only pass the last max_len tokens if context is too long
context = input_ids[:, -1024:]
logits = self(context)[:, -1, :] # (batch, vocab_size) — last position
# Apply temperature
logits = logits / max(temperature, 1e-8)
# Apply top-k filter
if top_k is not None:
v, _ = logits.topk(top_k)
logits[logits < v[:, [-1]]] = float('-inf')
probs = F.softmax(logits, dim=-1)
if temperature == 0.0:
next_token = probs.argmax(dim=-1, keepdim=True)
else:
next_token = torch.multinomial(probs, num_samples=1)
input_ids = torch.cat([input_ids, next_token], dim=1)
return input_ids
# Test generation with a small model
small_gpt = GPTModel(vocab_size=50257, d_model=256, n_heads=8, n_layers=4)
ids = torch.tensor([[464, 2068, 318]]) # "The model is"
generated = small_gpt.generate(ids, max_new_tokens=10, temperature=1.0, top_k=50)
print("Generated token IDs:", generated[0].tolist())
Exercise 2 — Load GPT-2 Weights and Verify Generation¶
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
hf_gpt2 = GPT2LMHeadModel.from_pretrained("gpt2")
hf_gpt2.eval()
prompt = "The meaning of life is"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
# HuggingFace greedy generation
torch.manual_seed(42)
hf_output = hf_gpt2.generate(
input_ids,
max_new_tokens=20,
do_sample=False, # greedy
pad_token_id=tokenizer.eos_token_id,
)
hf_text = tokenizer.decode(hf_output[0])
print("HuggingFace greedy output:")
print(f" {hf_text!r}")
# HuggingFace sampling
torch.manual_seed(42)
hf_sample = hf_gpt2.generate(
input_ids,
max_new_tokens=20,
do_sample=True,
temperature=0.8,
top_k=50,
pad_token_id=tokenizer.eos_token_id,
)
print("\nHuggingFace sampling (T=0.8, top_k=50):")
print(f" {tokenizer.decode(hf_sample[0])!r}")
Exercise 3 — Profile the Generation Loop¶
import time
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = hf_gpt2.to(device)
model.eval()
def benchmark_generation(prompt: str, n_tokens: int) -> float:
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
start = time.perf_counter()
with torch.no_grad():
output = model.generate(
input_ids,
max_new_tokens=n_tokens,
do_sample=False,
use_cache=True,
pad_token_id=tokenizer.eos_token_id,
)
elapsed = time.perf_counter() - start
return n_tokens / elapsed # tokens/second
prompt = "In a world where transformers rule"
print(f"Device: {device}")
print(f"\n{'Tokens':>8} {'Tokens/sec':>12}")
print("-" * 25)
for n in [10, 50, 100, 200]:
tps = benchmark_generation(prompt, n)
print(f"{n:>8} {tps:>12.1f}")
At small batch sizes, GPT-2 generation is memory-bandwidth bound — the GPU's compute units sit mostly idle, waiting for weight matrices to be read from HBM. Quantization and Flash Attention (Chapter 12) address this.
Summary¶
- Decoder-only models use causal self-attention and no encoder.
- The language model head projects the final hidden state to vocabulary logits.
- Weight tying between
token_embedandlm_headsaves ~50M parameters. - Autoregressive generation appends one token at a time; efficiency depends heavily on the KV cache (Chapter 9).
- Temperature, top-k, and top-p sampling trade off diversity vs. coherence.