Chapter 0: PyTorch Basics Refresher¶
Learning Outcome¶
Revisit the core PyTorch primitives that every chapter in this tutorial builds on:
tensors, device management, nn.Module, and the training loop. By the end of this
chapter you will have trained a small neural network from scratch and saved its weights
to disk.
Concepts¶
Tensors¶
A torch.Tensor is the fundamental data structure in PyTorch — an n-dimensional array
with automatic differentiation support.
Creating tensors:
import torch
# From Python lists
a = torch.tensor([1.0, 2.0, 3.0])
# Zeros, ones, and random values
zeros = torch.zeros(3, 4) # shape (3, 4), all 0.0
ones = torch.ones(3, 4) # shape (3, 4), all 1.0
rand = torch.rand(3, 4) # uniform in [0, 1)
randn = torch.randn(3, 4) # standard normal
# A range of integers
idx = torch.arange(10) # tensor([0, 1, 2, ..., 9])
Shape, dtype, and device are the three key properties:
x = torch.randn(2, 3)
print(x.shape) # torch.Size([2, 3])
print(x.dtype) # torch.float32
print(x.device) # device(type='cpu')
Reshaping and slicing:
x = torch.arange(12).float()
y = x.reshape(3, 4) # view the same data as 3×4
z = y[:, 1:3] # columns 1 and 2 → shape (3, 2)
w = y[0] # first row → shape (4,)
Broadcasting follows NumPy rules — dimensions of size 1 are automatically expanded:
Automatic Differentiation¶
PyTorch tracks operations on tensors that have requires_grad=True and computes
gradients automatically via .backward().
x = torch.tensor(3.0, requires_grad=True)
y = x ** 2 + 2 * x + 1 # y = (x+1)²
y.backward() # dy/dx = 2x + 2 = 8
print(x.grad) # tensor(8.)
The computation graph is built dynamically as you run operations — there is no separate "compile" step.
Moving Data Between Devices¶
PyTorch computations run on a device — either CPU or a CUDA-enabled GPU. Moving a tensor to a different device creates a copy there; the original is unchanged.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)
x = torch.randn(4, 4) # lives on CPU
x_gpu = x.to(device) # copy to GPU (or stays on CPU if no GPU)
print(x_gpu.device)
Best practice — define the device once and reuse it:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# All new tensors go directly to the target device
x = torch.randn(4, 4, device=device)
All tensors in an operation must be on the same device. A common error is forgetting to move labels or a new tensor to the GPU when the model is already there.
Building a Neural Network with nn.Module¶
Every learnable component in PyTorch is a subclass of nn.Module. The framework
handles parameter registration, device movement, serialization, and gradient tracking.
Anatomy of an nn.Module:
import torch.nn as nn
class TwoLayerNet(nn.Module):
def __init__(self, in_features: int, hidden: int, out_features: int):
super().__init__()
# nn.Linear registers weight and bias as parameters automatically
self.fc1 = nn.Linear(in_features, hidden)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden, out_features)
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
return x
Calling model(x) invokes forward(x) and also runs any registered hooks.
Moving the entire model to a device:
model = TwoLayerNet(in_features=16, hidden=64, out_features=4)
model = model.to(device) # moves all parameters and buffers
Inspecting parameters:
total = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total:,}")
for name, param in model.named_parameters():
print(f" {name:20s} {tuple(param.shape)}")
The Training Loop¶
A standard PyTorch training loop has five steps per batch:
- Forward pass — run the model to get predictions.
- Compute loss — compare predictions to targets.
- Zero gradients — clear accumulated gradients from the previous step.
- Backward pass — compute gradients via
loss.backward(). - Optimizer step — update parameters using the computed gradients.
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
for epoch in range(num_epochs):
model.train() # enable dropout, batch norm, etc.
for x_batch, y_batch in dataloader:
x_batch = x_batch.to(device)
y_batch = y_batch.to(device)
logits = model(x_batch) # 1. forward
loss = loss_fn(logits, y_batch) # 2. loss
optimizer.zero_grad() # 3. zero grads
loss.backward() # 4. backward
optimizer.step() # 5. update
model.train() vs model.eval() — some layers (Dropout, BatchNorm) behave
differently during training and evaluation. Always set the mode explicitly.
model.eval()
with torch.no_grad(): # disables gradient tracking for speed
predictions = model(x_test)
Saving and Loading¶
Save model weights (state dict):
Load weights back:
model = TwoLayerNet(in_features=16, hidden=64, out_features=4)
model.load_state_dict(torch.load("model.pt", weights_only=True))
model.to(device)
model.eval()
Always pass weights_only=True to torch.load — it restricts loading to tensors only
and avoids executing arbitrary Python code that could be embedded in older checkpoint
files.
Save and load arbitrary data with torch.save / torch.load:
# Save a dataset split or intermediate tensors
torch.save({"X_train": X_train, "y_train": y_train}, "data.pt")
data = torch.load("data.pt", weights_only=True)
X_train = data["X_train"]
y_train = data["y_train"]
Exercise 1 — Tensor Basics¶
Guided Exercise
Follow along and implement each step. Run each cell before moving on.
Step 1: Create and inspect tensors¶
import torch
# Create a 4×4 matrix of random normal values
x = torch.randn(4, 4)
print("Shape:", x.shape)
print("Dtype:", x.dtype)
print("Device:", x.device)
print(x)
Step 2: Reshape and slice¶
# Flatten to 1-D, then reshape to 2×8
flat = x.reshape(-1) # -1 means "infer this dimension"
y = flat.reshape(2, 8)
print("flat shape:", flat.shape)
print("y shape:", y.shape)
# Extract the top-left 2×2 sub-matrix
sub = x[:2, :2]
print("sub:\n", sub)
Step 3: Arithmetic and broadcasting¶
# Element-wise operations
a = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
b = torch.tensor([[10.0, 20.0], [30.0, 40.0]])
print("a + b:\n", a + b)
print("a * b:\n", a * b) # element-wise, NOT matrix multiply
print("a @ b:\n", a @ b) # matrix multiply
# Broadcasting: add a row vector to each row of a matrix
bias = torch.tensor([100.0, 200.0]) # shape (2,)
print("a + bias:\n", a + bias) # bias broadcast over rows
Step 4: Gradients¶
# Simple scalar computation
x = torch.tensor(2.0, requires_grad=True)
y = 3 * x ** 2 - 4 * x + 1 # y = 3x² - 4x + 1
y.backward() # dy/dx = 6x - 4 = 8
print(f"x = {x.item()}, dy/dx = {x.grad.item()}") # expect 8.0
Exercise 2 — Moving Tensors Between Devices¶
Guided Exercise
Step 1: Check available hardware¶
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)
if torch.cuda.is_available():
print("GPU name:", torch.cuda.get_device_name(0))
print("Memory (GB):", torch.cuda.get_device_properties(0).total_memory / 1e9)
Step 2: Move tensors to the device¶
x = torch.randn(1000, 1000) # on CPU
print("Before:", x.device)
x = x.to(device)
print("After:", x.device)
# Create directly on the target device
y = torch.zeros(1000, 1000, device=device)
print("y device:", y.device)
Step 3: Observe the device mismatch error¶
# This will raise a RuntimeError on GPU systems — read the error message carefully
a = torch.tensor([1.0, 2.0]) # CPU
b = torch.tensor([3.0, 4.0]).to(device) # GPU (if available)
try:
c = a + b
except RuntimeError as e:
print("RuntimeError:", e)
The fix is simple — move a to the same device as b before adding.
Exercise 3 — Build and Inspect a Two-Layer Network¶
Guided Exercise
Step 1: Define the network¶
import torch
import torch.nn as nn
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
class TwoLayerNet(nn.Module):
def __init__(self, in_features: int, hidden: int, out_features: int):
super().__init__()
self.fc1 = nn.Linear(in_features, hidden)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden, out_features)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.fc2(self.relu(self.fc1(x)))
model = TwoLayerNet(in_features=8, hidden=32, out_features=2)
model = model.to(device)
print(model)
Step 2: Count and inspect parameters¶
total_params = sum(p.numel() for p in model.parameters())
print(f"Total trainable parameters: {total_params}")
for name, param in model.named_parameters():
print(f" {name:20s} shape={tuple(param.shape):20} device={param.device}")
Expected output:
fc1.weight shape=(32, 8) device=cuda:0
fc1.bias shape=(32,) device=cuda:0
fc2.weight shape=(2, 32) device=cuda:0
fc2.bias shape=(2,) device=cuda:0
Total trainable parameters: 354
Step 3: Run a forward pass¶
# Create a batch of 16 random input vectors, each of length 8
x = torch.randn(16, 8, device=device)
logits = model(x)
print("Input shape:", x.shape)
print("Output shape:", logits.shape) # should be (16, 2)
Exercise 4 — Implement the Training Loop¶
Guided Exercise
We will train the network on a small synthetic binary classification dataset.
Step 1: Generate synthetic data¶
import torch
from torch.utils.data import TensorDataset, DataLoader
torch.manual_seed(42)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# 500 samples, 8 features, 2 classes
# Class 0: mean at -1, Class 1: mean at +1
N = 500
X = torch.randn(N, 8)
y = (X[:, 0] + X[:, 1] > 0).long() # label = 1 if first two features sum > 0
# 80/20 train/test split
split = int(0.8 * N)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
train_ds = TensorDataset(X_train, y_train)
test_ds = TensorDataset(X_test, y_test)
train_dl = DataLoader(train_ds, batch_size=32, shuffle=True)
test_dl = DataLoader(test_ds, batch_size=64)
print(f"Train: {len(train_ds)} samples, Test: {len(test_ds)} samples")
Step 2: Instantiate model, loss, and optimizer¶
model = TwoLayerNet(in_features=8, hidden=32, out_features=2).to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
Step 3: Write the training and evaluation functions¶
def train_one_epoch(model, dataloader, loss_fn, optimizer, device):
model.train()
total_loss, correct, total = 0.0, 0, 0
for x_batch, y_batch in dataloader:
x_batch = x_batch.to(device)
y_batch = y_batch.to(device)
logits = model(x_batch)
loss = loss_fn(logits, y_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item() * len(y_batch)
correct += (logits.argmax(dim=1) == y_batch).sum().item()
total += len(y_batch)
return total_loss / total, correct / total
def evaluate(model, dataloader, loss_fn, device):
model.eval()
total_loss, correct, total = 0.0, 0, 0
with torch.no_grad():
for x_batch, y_batch in dataloader:
x_batch = x_batch.to(device)
y_batch = y_batch.to(device)
logits = model(x_batch)
loss = loss_fn(logits, y_batch)
total_loss += loss.item() * len(y_batch)
correct += (logits.argmax(dim=1) == y_batch).sum().item()
total += len(y_batch)
return total_loss / total, correct / total
Step 4: Run the training loop¶
num_epochs = 20
for epoch in range(1, num_epochs + 1):
train_loss, train_acc = train_one_epoch(model, train_dl, loss_fn, optimizer, device)
test_loss, test_acc = evaluate(model, test_dl, loss_fn, device)
if epoch % 5 == 0:
print(
f"Epoch {epoch:3d} | "
f"train loss {train_loss:.4f} acc {train_acc:.3f} | "
f"test loss {test_loss:.4f} acc {test_acc:.3f}"
)
You should see training accuracy climb above 90 % within a few epochs — the task is deliberately easy so that you can focus on the training loop mechanics rather than the model design.
Exercise 5 — Save and Load Weights¶
Guided Exercise
Step 1: Save the trained model¶
import os
save_path = "two_layer_net.pt"
torch.save(model.state_dict(), save_path)
print(f"Saved model weights to {save_path}")
print(f"File size: {os.path.getsize(save_path)} bytes")
Step 2: Reload the weights and verify predictions match¶
# Build a fresh model with the same architecture
model_loaded = TwoLayerNet(in_features=8, hidden=32, out_features=2)
model_loaded.load_state_dict(torch.load(save_path, weights_only=True))
model_loaded = model_loaded.to(device)
model_loaded.eval()
# Run both models on the test set and compare
model.eval()
x_test_dev = X_test.to(device)
with torch.no_grad():
out_original = model(x_test_dev)
out_loaded = model_loaded(x_test_dev)
max_diff = (out_original - out_loaded).abs().max().item()
print(f"Max absolute difference between original and loaded model: {max_diff:.2e}")
# Expect: 0.00e+00 — the outputs should be identical
Step 3: Save and load a data checkpoint¶
# Save training data tensors
torch.save({"X_train": X_train, "y_train": y_train,
"X_test": X_test, "y_test": y_test}, "data.pt")
# Reload and verify shapes
data = torch.load("data.pt", weights_only=True)
print("Loaded shapes:")
for key, tensor in data.items():
print(f" {key}: {tensor.shape}")
Summary¶
torch.Tensoris PyTorch's core data structure. Key properties:shape,dtype, anddevice.- Move tensors and models to the target device with
.to(device). All operands in a computation must live on the same device. - Subclass
nn.Moduleto build learnable components. Register sub-modules as attributes and implementforward(). - The training loop: forward → loss → zero_grad → backward → step.
Always call
model.train()before training andmodel.eval()before evaluation. - Save weights with
torch.save(model.state_dict(), path)and reload withmodel.load_state_dict(torch.load(path, weights_only=True)).