Learning Objectives
- Apply the chain rule of calculus to compute gradients through a computational graph.
- Describe the backpropagation algorithm and its role in computing the gradients that gradient descent needs.
- Distinguish between full-batch, stochastic, and mini-batch gradient descent and explain why the stochasticity can be beneficial.
- Explain how the learning rate affects training convergence and stability, and describe strategies to adapt it.
- Interpret training and validation loss curves to diagnose underfitting, overfitting, and good generalization.
- Load image data in MATLAB, train a shallow network with
patternnet, and interpret the Performance plot.
Notation used in this lesson
Background & Motivation
Closing the loop
In Lesson 30 we introduced gradient descent: given the gradient of a loss function, we update parameters by stepping opposite the gradient. In Lesson 35 we built neural networks with multiple layers, each with its own weights \(\mathbf{W}^{[\ell]}\) and biases \(\mathbf{b}^{[\ell]}\). When we initialize a network, we often assign random weights and biases to each layer, which produce a poor output according to our loss function. The missing piece that we will address in this lesson is: how do we efficiently compute the gradient of the loss function with respect to every weight and bias in every layer, and how do we update those parameters to obtain better values that produce a lower loss?
Backpropagation is the answer. It is an efficient application of the chain rule from calculus, applied layer-by-layer from the output back to the input. Backpropagation is not a learning algorithm itself — it is a gradient computation algorithm. Gradient descent then uses those gradients to update parameters.
Historical note & the AI boom
The backpropagation algorithm was popularized by Rumelhart, Hinton, and Williams in their landmark 1986 Nature paper, which demonstrated that multi-layer networks could be trained by propagating error signals backward through the network. Before this, there was no practical way to efficiently compute gradients for hidden layers, which meant deep networks were essentially untrained potential.
Despite the 1986 breakthrough, true deep learning remained dormant for two decades due to computational and data limitations. The AI boom of the 2010s was ignited by three converging factors:
- Big data: The internet generated billions of labeled images, text, and audio samples — more training data than had ever existed.
- GPU acceleration: GPUs could perform the matrix multiplications required by backpropagation thousands of times faster than CPUs, making training of 100-layer networks practical.
- Backpropagation: Already known since 1986, but now scalable to large architectures. The same chain-rule algorithm you will learn today powers GPT-4, AlphaFold, and DALL-E.
The Chain Rule & Computational Graphs
Chain rule review
If our loss function \(L\) depends on an intermediate quantity \(z\), which in turn depends on a weight \(w\), the chain rule gives:
For a chain of functions \(L = f_n(f_{n-1}(\cdots f_1(w) \cdots))\):
Backpropagation applies this rule systematically, layer by layer from output to input, accumulating gradient contributions at each step.
Computational graphs
A computational graph represents a mathematical expression as a directed graph: each node is an operation, each edge carries a value forward and a gradient backward.
Consider the simplest possible network: a single neuron with two inputs, no bias, ReLU activation, and L2 (MSE) loss:
[Σ] → z → [ReLU] → ŷ → [½(y−ŷ)²] → L
x₂ · w₂ → a₂ ╱
The forward pass evaluates the loss for a given input by evaluating the network left-to-right, computing and storing all intermediate values. The backward pass propagates the derivatives of the loss with respect to each node \(\frac{\partial L}{\partial \text{each node}}\) right-to-left using the chain rule. This is the backpropagation algorithm. It provides the gradients of each parameter in the network with respect to the loss function. In other words, it tells us how much each parameter contributed to the loss, and we can use this to update the parameters to minimize the loss.
Simple numerical walkthrough
Let's trace both passes through the single-neuron graph with concrete numbers.
Given:
Forward pass:
Backward pass — apply chain rule right-to-left:
Gradient descent update with \(\alpha = 0.1\):
Verify loss decreased:
Backpropagation in a Multi-Layer Network
The single-neuron example above traced the chain rule through 4 operations. A multi-layer network is the same idea — just more links in the chain. The key advantage of backpropagation is that once you compute each error signal \(\delta^{[\ell]}\), you can reuse it to compute both the weight gradient and the error signal for the layer below.
Two-layer network with L2 loss
Consider: \(\mathbf{x} \xrightarrow{[\mathbf{W}^{[1]},\,\mathbf{b}^{[1]}]} \text{ReLU} \xrightarrow{[\mathbf{W}^{[2]},\,b^{[2]}]} \hat{y} \xrightarrow{} L\)
Forward pass:
Backward pass — deriving each error signal step by step:
patternnet handles this automatically.
The ReLU derivative
\(\text{ReLU}'(z) = \mathbf{1}[z>0]\) — it is 1 where the neuron was active during the forward pass, and 0 where it was not. The term \(\odot\,\mathbf{1}[\mathbf{z}^{[1]}>0]\) in \(\delta^{[1]}\) "gates" the gradient: only neurons that fired receive gradient signal. Neurons that did not fire have zero gradient — they receive no update this iteration.
The general backpropagation algorithm
For a network with \(L\) layers and a mini-batch of \(B\) examples:
- Forward pass: Compute and store \(\mathbf{z}^{[\ell]}\) and \(\mathbf{a}^{[\ell]}\) for all \(\ell=1,\ldots,L\). Storage is necessary — the backward pass reuses these values.
- Compute output error signal: \(\delta^{[L]} = \nabla_{z^{[L]}} L\) (depends on loss and output activation choice).
- Backpropagate from \(\ell=L-1\) down to \(1\): \(\;\delta^{[\ell]} = (\mathbf{W}^{[\ell+1]})^\top\delta^{[\ell+1]} \odot \phi'(\mathbf{z}^{[\ell]})\).
- Compute weight gradients: \(\;\dfrac{\partial L}{\partial \mathbf{W}^{[\ell]}} = \dfrac{1}{B}\,\delta^{[\ell]}(\mathbf{A}^{[\ell-1]})^\top\).
- Update parameters with gradient descent: \(\;\mathbf{W}^{[\ell]} \leftarrow \mathbf{W}^{[\ell]} - \alpha\,\dfrac{\partial L}{\partial \mathbf{W}^{[\ell]}}\).
Stochastic Gradient Descent (SGD)
What does "stochastic" mean?
In Lesson 30, gradient descent computed the exact gradient using all \(N\) training samples. In machine learning, \(N\) can be millions — computing the exact gradient over the entire dataset for every update is prohibitively expensive.
Stochastic means random. Rather than using all \(N\) samples, we randomly select a small subset (\(B\) samples, called a mini-batch) and compute an approximate gradient from that subset. We then take a step using this noisy approximation and move on to the next random mini-batch.
What is the "noise" and how does it help?
The noise is the difference between the true gradient (computed over all \(N\) samples) and the estimated gradient (computed over \(B\) samples). Since the mini-batch is a random sample, the estimated gradient is a random variable — it points roughly in the right direction but not exactly.
Surprisingly, this noise is often beneficial in three ways:
- Speed: Each update is cheap — only \(B\) samples instead of \(N\). You can take many more steps per unit time.
- Escaping local minima: The noisy gradient occasionally pushes the optimizer over small barriers in the loss landscape that exact gradient descent would get stuck in.
- Implicit regularization: The random variation in gradient estimates prevents the model from perfectly memorizing any single batch, acting similarly to dropout regularization.
The three variants
| Variant | Gradient Computed Over | Updates per Epoch | Noise Level |
|---|---|---|---|
| GD (full batch) | All \(N\) training samples | 1 | None — exact gradient |
| SGD (pure stochastic) | 1 random sample | \(N\) | Very high |
| Mini-batch SGD | \(B\) random samples (\(B \approx 32\text{–}256\)) | \(\lfloor N/B\rfloor\) | Moderate — best trade-off |
patternnet: By default it uses full-batch gradient descent with a second-order optimizer (scaled conjugate gradient), which processes all training samples each iteration. This is appropriate for our small datasets (hundreds of samples). For the neural network literature and large-scale deep learning, mini-batch SGD is standard.
The Learning Rate
The learning rate \(\alpha\) controls the step size at each parameter update:
This is exactly the gradient descent update rule from Lesson 30 (where we called it \(\gamma\)). The same trade-offs apply — but they are amplified in neural networks because the loss surface is high-dimensional, non-convex, and full of saddle points and narrow valleys.
Four regimes
Momentum
Standard gradient descent treats each update independently. Gradient descent with momentum accumulates a moving average of past gradients — called the velocity — and adds it to the current update. This damps oscillations (the updates are smoother) and builds speed along consistent gradient directions (like a ball rolling downhill and gaining inertia):
where \(\mathbf{v}\) is the velocity vector (same dimensions as \(\mathbf{W}\)), initialized to zero and updated each iteration. The momentum coefficient \(\mu \in [0,1)\), typically set to \(0.9\), controls how much of the previous velocity is retained. A higher \(\mu\) gives more "inertia." In MATLAB, traingdm implements gradient descent with momentum.
Adaptive learning rates
A fixed learning rate requires careful tuning: what works for early training (when gradients are large) may be too large or too small later. Adaptive optimizers automatically adjust the learning rate for each parameter based on its gradient history:
| Optimizer | Key Idea | MATLAB |
|---|---|---|
| GD with momentum | Accumulate gradient history as velocity; damp oscillations | traingdm |
| RMSprop | Divide learning rate by a running average of squared gradients; parameters with large gradients get smaller steps | trainNetwork only |
| Adam | Combines momentum (first moment) and RMSprop (second moment); widely used default for deep learning | trainNetwork with sgdmOptions / adamOptions |
| SCG (default in patternnet) | Second-order method; uses curvature information to select step size automatically; no learning rate to tune | trainscg (default) |
trainscg (MATLAB's default) is often the best choice because it selects its step size automatically using second-order curvature information. For large datasets or deep architectures, Adam is the most common choice. Manual learning-rate tuning with traingdm is useful for understanding the effect of \(\alpha\), which is the focus of HW 36.
Training & Validation Curves
Monitoring the loss during training is how you diagnose whether your network is learning well. The key principle: split your data into training and validation sets, then track the loss on both throughout training. MATLAB's patternnet does this automatically.
How patternnet splits data
By default, patternnet randomly divides all input data into three subsets before training:
- Training (70%): Used to compute gradients and update weights each iteration.
- Validation (15%): Evaluated each epoch but never used for weight updates. Monitors for overfitting.
- Test (15%): Evaluated once, after training completes. Gives an unbiased final performance estimate.
Click Performance in the training GUI to see the loss on all three sets vs. epoch.
Four diagnostic patterns
Interpreting the patterns
| Panel | Pattern | Diagnosis | Remedy |
|---|---|---|---|
| 1 | Both curves high and flat | Underfitting — model too simple or learning rate too small | Add neurons/layers; increase learning rate; train longer |
| 2 | Both curves low and tracking closely | Good generalization — model fits well | No action needed |
| 3 | Val. loss decreases then rises (U-shape) | Overfitting onset — model has reached its generalization limit | Use early stopping; restore weights at the validation minimum |
| 4 | Training near zero, validation stays high throughout | Severe overfitting — large persistent gap; model memorizes training data | Reduce model size; add more data; add regularization |
Early stopping
Early stopping halts training when the validation loss has not improved for a specified number of consecutive epochs, then rolls back the weights to the best validation epoch. In MATLAB:
net.trainParam.max_fail = 6; % stop after 6 consecutive epochs without
% improvement in validation loss (default)
max_fail triggers, MATLAB restores the weights from the best validation epoch — the dashed red line in Panel 3.
Recommended Videos
3Blue1Brown's two-part backpropagation series. Watch Ch. 3 first for intuition, then Ch. 4 for the full calculus derivation:
Worked Example — Handwritten Digit Recognition
We train a shallow network on MATLAB's built-in DigitDataset (10,000 handwritten digit images, classes 0–9, 28×28 pixels). Rather than patternnet, we use the Deep Learning Toolbox layer API directly — the same approach you will use in HW 36 and beyond, giving you full control over architecture, activation functions, and optimizer settings.
Three parameters at the top of the script control the experiment: n_per_class (dataset size), n_hidden (network capacity), and learn_rate (optimizer step size). The defaults are chosen to produce poor results.. Try adjusting them to see how each one affects the Training Progress curves before tackling HW 36.
%% Lesson 36 — Handwritten Digit Classification (Deep Learning Toolbox)
% Dataset: MATLAB's built-in DigitDataset (10,000 images, 10 classes, 28x28)
% Network: 784 → n_hidden (ReLU + Dropout) → 10 (softmax)
% HW 36: change n_per_class and n_hidden to explore underfitting/overfitting.
%% ── Parameters to experiment with ──────────────────────────────────────────
n_per_class = 100; % images loaded per digit class (max 1000; try 50–500)
n_hidden = 10; % neurons in the hidden layer (try 10, 64, 256, 512)
learn_rate = 0.05; % Adam learning rate (default 0.001; high → overfits)
%% ─────────────────────────────────────────────────────────────────────────
%% 1. Load dataset
digitPath = fullfile(matlabroot,'toolbox','nnet','nndemos', ...
'nndatasets','DigitDataset');
imds = imageDatastore(digitPath, ...
'IncludeSubfolders', true, 'LabelSource', 'foldernames');
rng(356);
imds_sub = splitEachLabel(imds, n_per_class, 'randomize');
fprintf('Loaded %d images (%d per class)\n', numel(imds_sub.Files), n_per_class);
%% 2. Split into training (80%) and test (20%)
[imds_train, imds_test] = splitEachLabel(imds_sub, 0.8, 'randomize');
%% 3. Flatten images: each 28x28 → 784-element row vector
% featureInputLayer expects N×features (rows = observations)
imgs_train = readall(imds_train);
v_train = cellfun(@(I) double(I(:))/255, imgs_train, 'UniformOutput', false);
X_train = [v_train{:}]'; % (0.8*n_per_class*10) × 784
imgs_test = readall(imds_test);
v_test = cellfun(@(I) double(I(:))/255, imgs_test, 'UniformOutput', false);
X_test = [v_test{:}]'; % (0.2*n_per_class*10) × 784
Y_train = imds_train.Labels; % categorical
Y_test = imds_test.Labels; % categorical
%% 4. Define the network architecture (layer-by-layer)
layers = [
featureInputLayer(784, 'Name', 'input')
fullyConnectedLayer(n_hidden, 'Name', 'fc1')
reluLayer( 'Name', 'relu1')
dropoutLayer(0.3, 'Name', 'drop1') % randomly zero 30% of
fullyConnectedLayer(10, 'Name', 'fc2') % activations each batch
softmaxLayer( 'Name', 'softmax')
classificationLayer( 'Name', 'output')
];
%% 5. Set training options
opts = trainingOptions('adam', ... % adaptive learning rate
'MaxEpochs', 50, ...
'MiniBatchSize', 64, ...
'InitialLearnRate', learn_rate, ...
'ValidationData', {X_test, Y_test}, ...
'ValidationFrequency', 5, ...
'ValidationPatience', 10, ... % early stopping
'Plots', 'training-progress', ...
'Verbose', false);
%% 6. Train
net = trainNetwork(X_train, Y_train, layers, opts);
% → Watch the Training Progress window.
% → With these defaults (lr=0.05, n_hidden=10, n_per_class=100) you should
% see overfitting: training accuracy climbs while validation lags or drops.
% Lower learn_rate to 0.001 and raise n_per_class / n_hidden to improve.
%% 7. Evaluate on test set
Y_pred = classify(net, X_test);
acc = 100 * mean(Y_pred == Y_test);
fprintf('n_per_class=%d n_hidden=%d lr=%.4f Test acc: %.1f%%\n', ...
n_per_class, n_hidden, learn_rate, acc);
%% 8. Visualize 25 random test examples in a 5×5 grid
% Note: test images are stored in class order, so taking the first 25
% would show mostly "0"s. Use randperm to get a representative mix.
idx = randperm(numel(imgs_test), 25);
figure('Name', 'Test Set Predictions', 'NumberTitle', 'off');
for k = 1:25
subplot(5,5,k);
imshow(imgs_test{idx(k)});
pred_str = char(Y_pred(idx(k)));
true_str = char(Y_test(idx(k)));
if strcmp(pred_str, true_str)
title(sprintf('Pred: %s', pred_str), 'Color', 'g', 'FontSize', 11);
else
title(sprintf('Pred: %s (T:%s)', pred_str, true_str), ...
'Color', 'r', 'FontSize', 11);
end
end
sgtitle('25 Random Test Predictions (green = correct | red = wrong)');
dropoutLayer(0.3) do? During each training mini-batch, dropout randomly sets 30% of the activations in that layer to zero — a different random subset each time. This forces the network to learn redundant, distributed representations rather than relying on a small number of neurons, which reduces overfitting. At test time dropout is automatically disabled and all activations are used (scaled to account for the dropped fraction during training). Dropout is one of the most widely used regularization techniques in deep learning.
Summary
| Concept | Key Idea |
|---|---|
| Backpropagation | Chain rule applied layer-by-layer backward; computes \(\partial L/\partial \mathbf{W}^{[\ell]}\) for every layer efficiently |
| L2 output delta | \(\delta^{[L]} = \hat{y} - y\) for linear output + MSE loss |
| Hidden delta | \(\delta^{[\ell]} = (\mathbf{W}^{[\ell+1]})^\top\delta^{[\ell+1]} \odot \phi'(\mathbf{z}^{[\ell]})\) |
| Mini-batch SGD | Use \(B\) random samples per update; noise improves speed, helps escape local minima, acts as regularization |
| Learning rate \(\alpha\) | Too small → slow; too large → oscillates or diverges; adaptive methods (Adam, SCG) remove the need to tune |
| Momentum | Velocity \(\mathbf{v}\) accumulates past gradients; damps oscillations and builds speed in consistent directions |
| Training & validation curves | Underfitting: both high; good fit: both low and close; overfitting: val. rises while train. falls |
| Early stopping | Halt when val. loss stops improving; restore best weights; prevents overfitting automatically |
References
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Ch. 6.5 (Backprop) & Ch. 8 (Optimization). MIT Press.
- 3Blue1Brown. (2017). What is backpropagation really doing? [Video]. YouTube. youtube.com
- 3Blue1Brown. (2017). Backpropagation calculus [Video]. YouTube. youtube.com
- Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.
- Prechelt, L. (1998). Early stopping — but when? In Neural Networks: Tricks of the Trade, Springer, pp. 55–69.
- Kingma, D. P. & Ba, J. (2015). Adam: A method for stochastic optimization. ICLR 2015.