Skip to content

Chapter 0: PyTorch Basics Refresher

Learning Outcome

Revisit the core PyTorch primitives that every chapter in this tutorial builds on: tensors, device management, nn.Module, and the training loop. By the end of this chapter you will have trained a small neural network from scratch and saved its weights to disk.


Concepts

Tensors

A torch.Tensor is the fundamental data structure in PyTorch — an n-dimensional array with automatic differentiation support.

Creating tensors:

import torch

# From Python lists
a = torch.tensor([1.0, 2.0, 3.0])

# Zeros, ones, and random values
zeros = torch.zeros(3, 4)          # shape (3, 4), all 0.0
ones  = torch.ones(3, 4)           # shape (3, 4), all 1.0
rand  = torch.rand(3, 4)           # uniform in [0, 1)
randn = torch.randn(3, 4)          # standard normal

# A range of integers
idx = torch.arange(10)             # tensor([0, 1, 2, ..., 9])

Shape, dtype, and device are the three key properties:

x = torch.randn(2, 3)
print(x.shape)    # torch.Size([2, 3])
print(x.dtype)    # torch.float32
print(x.device)   # device(type='cpu')

Reshaping and slicing:

x = torch.arange(12).float()
y = x.reshape(3, 4)          # view the same data as 3×4
z = y[:, 1:3]                # columns 1 and 2 → shape (3, 2)
w = y[0]                     # first row → shape (4,)

Broadcasting follows NumPy rules — dimensions of size 1 are automatically expanded:

a = torch.ones(3, 1)
b = torch.ones(1, 4)
c = a + b                    # shape (3, 4) — no copies made

Automatic Differentiation

PyTorch tracks operations on tensors that have requires_grad=True and computes gradients automatically via .backward().

x = torch.tensor(3.0, requires_grad=True)
y = x ** 2 + 2 * x + 1      # y = (x+1)²

y.backward()                 # dy/dx = 2x + 2 = 8
print(x.grad)                # tensor(8.)

The computation graph is built dynamically as you run operations — there is no separate "compile" step.

Moving Data Between Devices

PyTorch computations run on a device — either CPU or a CUDA-enabled GPU. Moving a tensor to a different device creates a copy there; the original is unchanged.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

x = torch.randn(4, 4)            # lives on CPU
x_gpu = x.to(device)             # copy to GPU (or stays on CPU if no GPU)
print(x_gpu.device)

Best practice — define the device once and reuse it:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# All new tensors go directly to the target device
x = torch.randn(4, 4, device=device)

All tensors in an operation must be on the same device. A common error is forgetting to move labels or a new tensor to the GPU when the model is already there.

Building a Neural Network with nn.Module

Every learnable component in PyTorch is a subclass of nn.Module. The framework handles parameter registration, device movement, serialization, and gradient tracking.

Anatomy of an nn.Module:

import torch.nn as nn

class TwoLayerNet(nn.Module):
    def __init__(self, in_features: int, hidden: int, out_features: int):
        super().__init__()
        # nn.Linear registers weight and bias as parameters automatically
        self.fc1 = nn.Linear(in_features, hidden)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden, out_features)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

Calling model(x) invokes forward(x) and also runs any registered hooks.

Moving the entire model to a device:

model = TwoLayerNet(in_features=16, hidden=64, out_features=4)
model = model.to(device)          # moves all parameters and buffers

Inspecting parameters:

total = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total:,}")

for name, param in model.named_parameters():
    print(f"  {name:20s}  {tuple(param.shape)}")

The Training Loop

A standard PyTorch training loop has five steps per batch:

  1. Forward pass — run the model to get predictions.
  2. Compute loss — compare predictions to targets.
  3. Zero gradients — clear accumulated gradients from the previous step.
  4. Backward pass — compute gradients via loss.backward().
  5. Optimizer step — update parameters using the computed gradients.
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
loss_fn   = nn.CrossEntropyLoss()

for epoch in range(num_epochs):
    model.train()                           # enable dropout, batch norm, etc.
    for x_batch, y_batch in dataloader:
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)

        logits = model(x_batch)             # 1. forward
        loss   = loss_fn(logits, y_batch)   # 2. loss

        optimizer.zero_grad()               # 3. zero grads
        loss.backward()                     # 4. backward
        optimizer.step()                    # 5. update

model.train() vs model.eval() — some layers (Dropout, BatchNorm) behave differently during training and evaluation. Always set the mode explicitly.

model.eval()
with torch.no_grad():                # disables gradient tracking for speed
    predictions = model(x_test)

Saving and Loading

Save model weights (state dict):

torch.save(model.state_dict(), "model.pt")

Load weights back:

model = TwoLayerNet(in_features=16, hidden=64, out_features=4)
model.load_state_dict(torch.load("model.pt", weights_only=True))
model.to(device)
model.eval()

Always pass weights_only=True to torch.load — it restricts loading to tensors only and avoids executing arbitrary Python code that could be embedded in older checkpoint files.

Save and load arbitrary data with torch.save / torch.load:

# Save a dataset split or intermediate tensors
torch.save({"X_train": X_train, "y_train": y_train}, "data.pt")

data = torch.load("data.pt", weights_only=True)
X_train = data["X_train"]
y_train  = data["y_train"]

Exercise 1 — Tensor Basics

Guided Exercise

Follow along and implement each step. Run each cell before moving on.

Step 1: Create and inspect tensors

import torch

# Create a 4×4 matrix of random normal values
x = torch.randn(4, 4)
print("Shape:", x.shape)
print("Dtype:", x.dtype)
print("Device:", x.device)
print(x)

Step 2: Reshape and slice

# Flatten to 1-D, then reshape to 2×8
flat = x.reshape(-1)          # -1 means "infer this dimension"
y    = flat.reshape(2, 8)
print("flat shape:", flat.shape)
print("y shape:", y.shape)

# Extract the top-left 2×2 sub-matrix
sub = x[:2, :2]
print("sub:\n", sub)

Step 3: Arithmetic and broadcasting

# Element-wise operations
a = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
b = torch.tensor([[10.0, 20.0], [30.0, 40.0]])

print("a + b:\n", a + b)
print("a * b:\n", a * b)          # element-wise, NOT matrix multiply
print("a @ b:\n", a @ b)          # matrix multiply

# Broadcasting: add a row vector to each row of a matrix
bias = torch.tensor([100.0, 200.0])   # shape (2,)
print("a + bias:\n", a + bias)         # bias broadcast over rows

Step 4: Gradients

# Simple scalar computation
x = torch.tensor(2.0, requires_grad=True)
y = 3 * x ** 2 - 4 * x + 1      # y = 3x² - 4x + 1

y.backward()                      # dy/dx = 6x - 4 = 8
print(f"x = {x.item()},  dy/dx = {x.grad.item()}")  # expect 8.0

Exercise 2 — Moving Tensors Between Devices

Guided Exercise

Step 1: Check available hardware

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)

if torch.cuda.is_available():
    print("GPU name:", torch.cuda.get_device_name(0))
    print("Memory (GB):", torch.cuda.get_device_properties(0).total_memory / 1e9)

Step 2: Move tensors to the device

x = torch.randn(1000, 1000)          # on CPU
print("Before:", x.device)

x = x.to(device)
print("After:", x.device)

# Create directly on the target device
y = torch.zeros(1000, 1000, device=device)
print("y device:", y.device)

Step 3: Observe the device mismatch error

# This will raise a RuntimeError on GPU systems — read the error message carefully
a = torch.tensor([1.0, 2.0])           # CPU
b = torch.tensor([3.0, 4.0]).to(device)  # GPU (if available)

try:
    c = a + b
except RuntimeError as e:
    print("RuntimeError:", e)

The fix is simple — move a to the same device as b before adding.


Exercise 3 — Build and Inspect a Two-Layer Network

Guided Exercise

Step 1: Define the network

import torch
import torch.nn as nn

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


class TwoLayerNet(nn.Module):
    def __init__(self, in_features: int, hidden: int, out_features: int):
        super().__init__()
        self.fc1 = nn.Linear(in_features, hidden)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden, out_features)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.fc2(self.relu(self.fc1(x)))


model = TwoLayerNet(in_features=8, hidden=32, out_features=2)
model = model.to(device)
print(model)

Step 2: Count and inspect parameters

total_params = sum(p.numel() for p in model.parameters())
print(f"Total trainable parameters: {total_params}")

for name, param in model.named_parameters():
    print(f"  {name:20s}  shape={tuple(param.shape):20}  device={param.device}")

Expected output:

  fc1.weight           shape=(32, 8)             device=cuda:0
  fc1.bias             shape=(32,)               device=cuda:0
  fc2.weight           shape=(2, 32)             device=cuda:0
  fc2.bias             shape=(2,)                device=cuda:0
Total trainable parameters: 354

Step 3: Run a forward pass

# Create a batch of 16 random input vectors, each of length 8
x = torch.randn(16, 8, device=device)
logits = model(x)
print("Input shape:", x.shape)
print("Output shape:", logits.shape)   # should be (16, 2)

Exercise 4 — Implement the Training Loop

Guided Exercise

We will train the network on a small synthetic binary classification dataset.

Step 1: Generate synthetic data

import torch
from torch.utils.data import TensorDataset, DataLoader

torch.manual_seed(42)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 500 samples, 8 features, 2 classes
# Class 0: mean at -1, Class 1: mean at +1
N = 500
X = torch.randn(N, 8)
y = (X[:, 0] + X[:, 1] > 0).long()   # label = 1 if first two features sum > 0

# 80/20 train/test split
split = int(0.8 * N)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

train_ds = TensorDataset(X_train, y_train)
test_ds  = TensorDataset(X_test,  y_test)
train_dl = DataLoader(train_ds, batch_size=32, shuffle=True)
test_dl  = DataLoader(test_ds,  batch_size=64)

print(f"Train: {len(train_ds)} samples, Test: {len(test_ds)} samples")

Step 2: Instantiate model, loss, and optimizer

model     = TwoLayerNet(in_features=8, hidden=32, out_features=2).to(device)
loss_fn   = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

Step 3: Write the training and evaluation functions

def train_one_epoch(model, dataloader, loss_fn, optimizer, device):
    model.train()
    total_loss, correct, total = 0.0, 0, 0
    for x_batch, y_batch in dataloader:
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)

        logits = model(x_batch)
        loss   = loss_fn(logits, y_batch)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item() * len(y_batch)
        correct    += (logits.argmax(dim=1) == y_batch).sum().item()
        total      += len(y_batch)

    return total_loss / total, correct / total


def evaluate(model, dataloader, loss_fn, device):
    model.eval()
    total_loss, correct, total = 0.0, 0, 0
    with torch.no_grad():
        for x_batch, y_batch in dataloader:
            x_batch = x_batch.to(device)
            y_batch = y_batch.to(device)

            logits = model(x_batch)
            loss   = loss_fn(logits, y_batch)

            total_loss += loss.item() * len(y_batch)
            correct    += (logits.argmax(dim=1) == y_batch).sum().item()
            total      += len(y_batch)

    return total_loss / total, correct / total

Step 4: Run the training loop

num_epochs = 20

for epoch in range(1, num_epochs + 1):
    train_loss, train_acc = train_one_epoch(model, train_dl, loss_fn, optimizer, device)
    test_loss,  test_acc  = evaluate(model, test_dl, loss_fn, device)
    if epoch % 5 == 0:
        print(
            f"Epoch {epoch:3d} | "
            f"train loss {train_loss:.4f}  acc {train_acc:.3f} | "
            f"test  loss {test_loss:.4f}  acc {test_acc:.3f}"
        )

You should see training accuracy climb above 90 % within a few epochs — the task is deliberately easy so that you can focus on the training loop mechanics rather than the model design.


Exercise 5 — Save and Load Weights

Guided Exercise

Step 1: Save the trained model

import os

save_path = "two_layer_net.pt"
torch.save(model.state_dict(), save_path)
print(f"Saved model weights to {save_path}")
print(f"File size: {os.path.getsize(save_path)} bytes")

Step 2: Reload the weights and verify predictions match

# Build a fresh model with the same architecture
model_loaded = TwoLayerNet(in_features=8, hidden=32, out_features=2)
model_loaded.load_state_dict(torch.load(save_path, weights_only=True))
model_loaded = model_loaded.to(device)
model_loaded.eval()

# Run both models on the test set and compare
model.eval()
x_test_dev = X_test.to(device)

with torch.no_grad():
    out_original = model(x_test_dev)
    out_loaded   = model_loaded(x_test_dev)

max_diff = (out_original - out_loaded).abs().max().item()
print(f"Max absolute difference between original and loaded model: {max_diff:.2e}")
# Expect: 0.00e+00 — the outputs should be identical

Step 3: Save and load a data checkpoint

# Save training data tensors
torch.save({"X_train": X_train, "y_train": y_train,
            "X_test":  X_test,  "y_test":  y_test}, "data.pt")

# Reload and verify shapes
data = torch.load("data.pt", weights_only=True)
print("Loaded shapes:")
for key, tensor in data.items():
    print(f"  {key}: {tensor.shape}")

Summary

  • torch.Tensor is PyTorch's core data structure. Key properties: shape, dtype, and device.
  • Move tensors and models to the target device with .to(device). All operands in a computation must live on the same device.
  • Subclass nn.Module to build learnable components. Register sub-modules as attributes and implement forward().
  • The training loop: forward → loss → zero_grad → backward → step. Always call model.train() before training and model.eval() before evaluation.
  • Save weights with torch.save(model.state_dict(), path) and reload with model.load_state_dict(torch.load(path, weights_only=True)).

← Setup Chapter 1: Tokenization →