Skip to content

Transformer Models in PyTorch

From Internals to HuggingFace

A hands-on tutorial for practitioners who already know PyTorch and high-level transformer APIs (HuggingFace, vLLM) and want to understand how everything works under the hood.


Prerequisites

  • Comfortable training neural networks in PyTorch
  • Has used transformers, HuggingFace pipelines, or vLLM for inference
  • GPU access assumed throughout the exercises

What You Will Learn

By the end of this tutorial you will be able to:

  • Implement every component of a transformer from scratch in pure PyTorch — tokenizer, embeddings, attention, feed-forward, normalization, and full model stacks.
  • Match HuggingFace outputs by loading pretrained weights directly into your custom implementations and verifying numerically identical results.
  • Understand the design decisions behind BERT, GPT-2, LLaMA, and T5 — not just that they work, but why specific choices (RoPE, RMSNorm, SwiGLU, causal masking, KV cache) were made.
  • Apply production techniques — LoRA, quantization, Flash Attention, KV caching — with a clear mental model of what each one does.

Tutorial Structure

The tutorial consists of 12 chapters that build on each other progressively. Each chapter includes:

  • Concepts — theory explained with code, diagrams, and worked examples
  • Practical Exercises — guided, instructor-led implementations you build step by step
Chapter Topic
0 PyTorch Basics Refresher
1 Tokenization from Scratch
2 Embeddings — Turning Tokens into Vectors
3 Scaled Dot-Product Attention
4 Causal (Masked) Attention
5 The Transformer Block
6 Encoder-Only Architecture (BERT)
7 Decoder-Only Architecture (GPT)
8 Encoder-Decoder Architecture (T5)
9 The KV Cache
10 Training from Scratch
11 HuggingFace Internals
12 Efficient Training & Inference at Scale

How to Use This Tutorial

This is a guided tutorial — exercises are done together with an instructor, not solo. The code in each chapter is meant to be typed, run, and discussed as a group.

You will need:

  1. A Python environment with PyTorch and HuggingFace transformers installed. See the Setup page for instructions.
  2. A GPU (or Colab/Kaggle free tier) for chapters 7 onward.
  3. Curiosity and a willingness to read error messages carefully.

Get Started → Chapter 0: PyTorch Basics →