Transformer Models in PyTorch¶
From Internals to HuggingFace¶
A hands-on tutorial for practitioners who already know PyTorch and high-level transformer APIs (HuggingFace, vLLM) and want to understand how everything works under the hood.
Prerequisites
- Comfortable training neural networks in PyTorch
- Has used
transformers, HuggingFace pipelines, or vLLM for inference - GPU access assumed throughout the exercises
What You Will Learn¶
By the end of this tutorial you will be able to:
- Implement every component of a transformer from scratch in pure PyTorch — tokenizer, embeddings, attention, feed-forward, normalization, and full model stacks.
- Match HuggingFace outputs by loading pretrained weights directly into your custom implementations and verifying numerically identical results.
- Understand the design decisions behind BERT, GPT-2, LLaMA, and T5 — not just that they work, but why specific choices (RoPE, RMSNorm, SwiGLU, causal masking, KV cache) were made.
- Apply production techniques — LoRA, quantization, Flash Attention, KV caching — with a clear mental model of what each one does.
Tutorial Structure¶
The tutorial consists of 12 chapters that build on each other progressively. Each chapter includes:
- Concepts — theory explained with code, diagrams, and worked examples
- Practical Exercises — guided, instructor-led implementations you build step by step
| Chapter | Topic |
|---|---|
| 0 | PyTorch Basics Refresher |
| 1 | Tokenization from Scratch |
| 2 | Embeddings — Turning Tokens into Vectors |
| 3 | Scaled Dot-Product Attention |
| 4 | Causal (Masked) Attention |
| 5 | The Transformer Block |
| 6 | Encoder-Only Architecture (BERT) |
| 7 | Decoder-Only Architecture (GPT) |
| 8 | Encoder-Decoder Architecture (T5) |
| 9 | The KV Cache |
| 10 | Training from Scratch |
| 11 | HuggingFace Internals |
| 12 | Efficient Training & Inference at Scale |
How to Use This Tutorial¶
This is a guided tutorial — exercises are done together with an instructor, not solo. The code in each chapter is meant to be typed, run, and discussed as a group.
You will need:
- A Python environment with PyTorch and HuggingFace
transformersinstalled. See the Setup page for instructions. - A GPU (or Colab/Kaggle free tier) for chapters 7 onward.
- Curiosity and a willingness to read error messages carefully.