Transformer Models in PyTorch¶

From Internals to HuggingFace¶

A hands-on tutorial for practitioners who already know PyTorch and high-level transformer APIs (HuggingFace, vLLM) and want to understand how everything works under the hood.

Prerequisites

Comfortable training neural networks in PyTorch
Has used transformers, HuggingFace pipelines, or vLLM for inference
GPU access assumed throughout the exercises

What You Will Learn¶

By the end of this tutorial you will be able to:

Implement every component of a transformer from scratch in pure PyTorch — tokenizer, embeddings, attention, feed-forward, normalization, and full model stacks.
Match HuggingFace outputs by loading pretrained weights directly into your custom implementations and verifying numerically identical results.
Understand the design decisions behind BERT, GPT-2, LLaMA, and T5 — not just that they work, but why specific choices (RoPE, RMSNorm, SwiGLU, causal masking, KV cache) were made.
Apply production techniques — LoRA, quantization, Flash Attention, KV caching — with a clear mental model of what each one does.

Tutorial Structure¶

The tutorial consists of 12 chapters that build on each other progressively. Each chapter includes:

Concepts — theory explained with code, diagrams, and worked examples
Practical Exercises — guided, instructor-led implementations you build step by step

Chapter	Topic
0	PyTorch Basics Refresher
1	Tokenization from Scratch
2	Embeddings — Turning Tokens into Vectors
3	Scaled Dot-Product Attention
4	Causal (Masked) Attention
5	The Transformer Block
6	Encoder-Only Architecture (BERT)
7	Decoder-Only Architecture (GPT)
8	Encoder-Decoder Architecture (T5)
9	The KV Cache
10	Training from Scratch
11	HuggingFace Internals
12	Efficient Training & Inference at Scale

How to Use This Tutorial¶

This is a guided tutorial — exercises are done together with an instructor, not solo. The code in each chapter is meant to be typed, run, and discussed as a group.

You will need:

A Python environment with PyTorch and HuggingFace transformers installed. See the Setup page for instructions.
A GPU (or Colab/Kaggle free tier) for chapters 7 onward.
Curiosity and a willingness to read error messages carefully.

Get Started → Chapter 0: PyTorch Basics →