Lesson 36 — Backpropagation & Training NNs

Learning Objectives

Apply the chain rule of calculus to compute gradients through a computational graph.
Describe the backpropagation algorithm and its role in computing the gradients that gradient descent needs.
Distinguish between full-batch, stochastic, and mini-batch gradient descent and explain why the stochasticity can be beneficial.
Explain how the learning rate affects training convergence and stability, and describe strategies to adapt it.
Interpret training and validation loss curves to diagnose underfitting, overfitting, and good generalization.
Load image data in MATLAB, train a shallow network with patternnet, and interpret the Performance plot.

Notation used in this lesson

\(\delta^{[\ell]}\)Error signal at layer \(\ell\) — the gradient of \(L\) w.r.t. the pre-activation \(z^{[\ell]}\)

\(\frac{\partial L}{\partial \mathbf{W}^{[\ell]}}\)Gradient of loss w.r.t. the weight matrix at layer \(\ell\)

\(\phi'\)Derivative of the activation function

\(\odot\)Element-wise (Hadamard) product

\(\alpha\)Learning rate: controls the step size of each parameter update

\(\mathbf{v}\)Velocity vector in momentum-based optimization

\(B\)Mini-batch size (number of training samples per gradient update)

epochOne complete pass through the entire training dataset

🌍 Background & Motivation

Closing the loop

In Lesson 30 we introduced gradient descent: given the gradient of a loss function, we update parameters by stepping opposite the gradient. In Lesson 35 we built neural networks with multiple layers, each with its own weights \(\mathbf{W}^{[\ell]}\) and biases \(\mathbf{b}^{[\ell]}\). When we initialize a network, we often assign random weights and biases to each layer, which produce a poor output according to our loss function. The missing piece that we will address in this lesson is: how do we efficiently compute the gradient of the loss function with respect to every weight and bias in every layer, and how do we update those parameters to obtain better values that produce a lower loss?

Backpropagation is the answer. It is an efficient application of the chain rule from calculus, applied layer-by-layer from the output back to the input. Backpropagation is not a learning algorithm itself — it is a gradient computation algorithm. Gradient descent then uses those gradients to update parameters.

The full training loop for a neural network: (1) Forward pass → (2) Compute loss → (3) Backward pass (backpropagation) → (4) Gradient descent update → repeat until convergence.

Historical note & the AI boom

The backpropagation algorithm was popularized by Rumelhart, Hinton, and Williams in their landmark 1986 Nature paper, which demonstrated that multi-layer networks could be trained by propagating error signals backward through the network. Before this, there was no practical way to efficiently compute gradients for hidden layers, which meant deep networks were essentially untrained potential.

Despite the 1986 breakthrough, true deep learning remained dormant for two decades due to computational and data limitations. The AI boom of the 2010s was ignited by three converging factors:

Big data: The internet generated billions of labeled images, text, and audio samples — more training data than had ever existed.
GPU acceleration: GPUs could perform the matrix multiplications required by backpropagation thousands of times faster than CPUs, making training of 100-layer networks practical.
Backpropagation: Already known since 1986, but now scalable to large architectures. The same chain-rule algorithm you will learn today powers GPT-4, AlphaFold, and DALL-E.

Key insight: Backpropagation did not change between 1986 and 2012. What changed was the hardware and data that unlocked its full potential.

🧠 The Chain Rule & Computational Graphs

Chain rule review

If our loss function \(L\) depends on an intermediate quantity \(z\), which in turn depends on a weight \(w\), the chain rule gives:

\frac{dL}{dw} = \frac{dL}{dz} \cdot \frac{dz}{dw}

For a chain of functions \(L = f_n(f_{n-1}(\cdots f_1(w) \cdots))\):

\frac{dL}{dw} = \frac{\partial f_n}{\partial f_{n-1}} \cdot \frac{\partial f_{n-1}}{\partial f_{n-2}} \cdots \frac{\partial f_1}{\partial w}

Backpropagation applies this rule systematically, layer by layer from output to input, accumulating gradient contributions at each step.

Computational graphs

A computational graph represents a mathematical expression as a directed graph: each node is an operation, each edge carries a value forward and a gradient backward.

Consider the simplest possible network: a single neuron with two inputs, no bias, ReLU activation, and L2 (MSE) loss:

      x₁ · w₁ → a₁ ╲

              [Σ] → z → [ReLU] → ŷ → [½(y−ŷ)²] → L

      x₂ · w₂ → a₂ ╱

The forward pass evaluates the loss for a given input by evaluating the network left-to-right, computing and storing all intermediate values. The backward pass propagates the derivatives of the loss with respect to each node \(\frac{\partial L}{\partial \text{each node}}\) right-to-left using the chain rule. This is the backpropagation algorithm. It provides the gradients of each parameter in the network with respect to the loss function. In other words, it tells us how much each parameter contributed to the loss, and we can use this to update the parameters to minimize the loss.

Simple numerical walkthrough

Let's trace both passes through the single-neuron graph with concrete numbers.

Given:

x_1 = 2,\quad x_2 = 1,\quad w_1 = 0.3,\quad w_2 = 0.1,\quad y = 2\;(\text{target})

Forward pass:

a_1 = w_1 x_1 = 0.6,\qquad a_2 = w_2 x_2 = 0.1 \] \[ z = a_1 + a_2 = 0.7,\qquad \hat{y} = \text{ReLU}(0.7) = 0.7 \] \[ L = \tfrac{1}{2}(y - \hat{y})^2 = \tfrac{1}{2}(2 - 0.7)^2 = \tfrac{1}{2}(1.69) = 0.845

Backward pass — apply chain rule right-to-left:

\frac{\partial L}{\partial \hat{y}} = -(y - \hat{y}) = -(2 - 0.7) = -1.3 \] \[ \frac{\partial \hat{y}}{\partial z} = 1\quad\text{(ReLU derivative, since } z=0.7 > 0\text{)} \] \[ \frac{\partial L}{\partial z} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} = -1.3 \times 1 = -1.3 \] \[ \frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial z}\cdot\frac{\partial z}{\partial w_1} = -1.3\times x_1 = -1.3\times 2 = -2.6 \] \[ \frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial z}\cdot\frac{\partial z}{\partial w_2} = -1.3\times x_2 = -1.3\times 1 = -1.3

Gradient descent update with \(\alpha = 0.1\):

w_1^{\text{new}} = 0.3 - 0.1(-2.6) = 0.56,\qquad w_2^{\text{new}} = 0.1 - 0.1(-1.3) = 0.23

Verify loss decreased:

z^{\text{new}} = 0.56(2) + 0.23(1) = 1.35,\quad \hat{y}^{\text{new}} = 1.35,\quad L^{\text{new}} = \tfrac{1}{2}(2-1.35)^2 = 0.211

It works! One gradient descent step reduced the loss from \(0.845\) to \(0.211\). The negative gradients indicated we should increase both weights to bring \(\hat{y}\) closer to \(y=2\), which is exactly what happened.

🔀 Backpropagation in a Multi-Layer Network

The single-neuron example above traced the chain rule through 4 operations. A multi-layer network is the same idea — just more links in the chain. The key advantage of backpropagation is that once you compute each error signal \(\delta^{[\ell]}\), you can reuse it to compute both the weight gradient and the error signal for the layer below.

Two-layer network with L2 loss

Consider: \(\mathbf{x} \xrightarrow{[\mathbf{W}^{[1]},\,\mathbf{b}^{[1]}]} \text{ReLU} \xrightarrow{[\mathbf{W}^{[2]},\,b^{[2]}]} \hat{y} \xrightarrow{} L\)

Forward pass:

\mathbf{z}^{[1]} = \mathbf{W}^{[1]}\mathbf{x} + \mathbf{b}^{[1]},\qquad \mathbf{a}^{[1]} = \text{ReLU}\!\left(\mathbf{z}^{[1]}\right) \] \[ \hat{y} = z^{[2]} = \mathbf{W}^{[2]}\mathbf{a}^{[1]} + b^{[2]}\qquad\text{(linear output)} \] \[ L = \tfrac{1}{2}(y - \hat{y})^2\qquad\text{(L2 / MSE loss)}

Backward pass — deriving each error signal step by step:

\delta^{[2]} = \frac{\partial L}{\partial z^{[2]}} = \frac{\partial L}{\partial \hat{y}}\cdot\frac{\partial \hat{y}}{\partial z^{[2]}} = -(y-\hat{y})\cdot 1 = \hat{y} - y

\frac{\partial L}{\partial \mathbf{W}^{[2]}} = \delta^{[2]}\,(\mathbf{a}^{[1]})^\top,\qquad \frac{\partial L}{\partial b^{[2]}} = \delta^{[2]}

\delta^{[1]} = \frac{\partial L}{\partial \mathbf{z}^{[1]}} = \underbrace{(\mathbf{W}^{[2]})^\top\delta^{[2]}}_{\text{backprop through }\mathbf{W}^{[2]}} \;\odot\; \underbrace{\mathbf{1}[\mathbf{z}^{[1]}>0]}_{\text{ReLU gate}}

\frac{\partial L}{\partial \mathbf{W}^{[1]}} = \delta^{[1]}\,\mathbf{x}^\top,\qquad \frac{\partial L}{\partial \mathbf{b}^{[1]}} = \delta^{[1]}

Key pattern: Each layer's error signal \(\delta^{[\ell]}\) is computed from the layer above's signal \(\delta^{[\ell+1]}\) multiplied back through the weight matrix, then gated by the activation derivative. This is the "backpropagated error" interpretation — each layer receives its share of blame from the layer above.

Note on classification: For multi-class classification with softmax output and cross-entropy loss, the formula simplifies to \(\delta^{[L]} = \hat{\mathbf{p}} - \mathbf{y}\) (predicted probabilities minus one-hot target). The remainder of the backprop equations are identical. MATLAB's patternnet handles this automatically.

The ReLU derivative

\(\text{ReLU}'(z) = \mathbf{1}[z>0]\) — it is 1 where the neuron was active during the forward pass, and 0 where it was not. The term \(\odot\,\mathbf{1}[\mathbf{z}^{[1]}>0]\) in \(\delta^{[1]}\) "gates" the gradient: only neurons that fired receive gradient signal. Neurons that did not fire have zero gradient — they receive no update this iteration.

Dying ReLU: If a neuron's pre-activation is always negative across all training samples, its gradient is always 0 — it never updates and effectively dies. Careful weight initialization (e.g., Xavier or He) and moderate learning rates prevent this.

The general backpropagation algorithm

For a network with \(L\) layers and a mini-batch of \(B\) examples:

Forward pass: Compute and store \(\mathbf{z}^{[\ell]}\) and \(\mathbf{a}^{[\ell]}\) for all \(\ell=1,\ldots,L\). Storage is necessary — the backward pass reuses these values.
Compute output error signal: \(\delta^{[L]} = \nabla_{z^{[L]}} L\) (depends on loss and output activation choice).
Backpropagate from \(\ell=L-1\) down to \(1\): \(\;\delta^{[\ell]} = (\mathbf{W}^{[\ell+1]})^\top\delta^{[\ell+1]} \odot \phi'(\mathbf{z}^{[\ell]})\).
Compute weight gradients: \(\;\dfrac{\partial L}{\partial \mathbf{W}^{[\ell]}} = \dfrac{1}{B}\,\delta^{[\ell]}(\mathbf{A}^{[\ell-1]})^\top\).
Update parameters with gradient descent: \(\;\mathbf{W}^{[\ell]} \leftarrow \mathbf{W}^{[\ell]} - \alpha\,\dfrac{\partial L}{\partial \mathbf{W}^{[\ell]}}\).

🎲 Stochastic Gradient Descent (SGD)

What does "stochastic" mean?

In Lesson 30, gradient descent computed the exact gradient using all \(N\) training samples. In machine learning, \(N\) can be millions — computing the exact gradient over the entire dataset for every update is prohibitively expensive.

Stochastic means random. Rather than using all \(N\) samples, we randomly select a small subset (\(B\) samples, called a mini-batch) and compute an approximate gradient from that subset. We then take a step using this noisy approximation and move on to the next random mini-batch.

What is the "noise" and how does it help?

The noise is the difference between the true gradient (computed over all \(N\) samples) and the estimated gradient (computed over \(B\) samples). Since the mini-batch is a random sample, the estimated gradient is a random variable — it points roughly in the right direction but not exactly.

Surprisingly, this noise is often beneficial in three ways:

Speed: Each update is cheap — only \(B\) samples instead of \(N\). You can take many more steps per unit time.
Escaping local minima: The noisy gradient occasionally pushes the optimizer over small barriers in the loss landscape that exact gradient descent would get stuck in.
Implicit regularization: The random variation in gradient estimates prevents the model from perfectly memorizing any single batch, acting similarly to dropout regularization.

The three variants

Variant	Gradient Computed Over	Updates per Epoch	Noise Level
GD (full batch)	All \(N\) training samples	1	None — exact gradient
SGD (pure stochastic)	1 random sample	\(N\)	Very high
Mini-batch SGD	\(B\) random samples (\(B \approx 32\text{–}256\))	\(\lfloor N/B\rfloor\)	Moderate — best trade-off

Terminology note: In practice, "SGD" usually refers to mini-batch SGD. Pure single-sample SGD is rarely used in modern deep learning. One epoch = one full pass through the training data = \(\lfloor N/B \rfloor\) mini-batch updates.

MATLAB's patternnet: By default it uses full-batch gradient descent with a second-order optimizer (scaled conjugate gradient), which processes all training samples each iteration. This is appropriate for our small datasets (hundreds of samples). For the neural network literature and large-scale deep learning, mini-batch SGD is standard.

⚙️ The Learning Rate

The learning rate \(\alpha\) controls the step size at each parameter update:

\mathbf{W} \leftarrow \mathbf{W} - \alpha\,\frac{\partial L}{\partial \mathbf{W}}

This is exactly the gradient descent update rule from Lesson 30 (where we called it \(\gamma\)). The same trade-offs apply — but they are amplified in neural networks because the loss surface is high-dimensional, non-convex, and full of saddle points and narrow valleys.

Four regimes

Figure 1. Effect of learning rate on training loss. A good learning rate gives a smooth, rapid decrease (green). Too small converges but extremely slowly (blue). Too large causes oscillation (orange) or divergence (red).

Momentum

Standard gradient descent treats each update independently. Gradient descent with momentum accumulates a moving average of past gradients — called the velocity — and adds it to the current update. This damps oscillations (the updates are smoother) and builds speed along consistent gradient directions (like a ball rolling downhill and gaining inertia):

\mathbf{v} \leftarrow \mu\,\mathbf{v} + \alpha\,\nabla_{\mathbf{W}} L, \qquad \mathbf{W} \leftarrow \mathbf{W} - \mathbf{v}

where \(\mathbf{v}\) is the velocity vector (same dimensions as \(\mathbf{W}\)), initialized to zero and updated each iteration. The momentum coefficient \(\mu \in [0,1)\), typically set to \(0.9\), controls how much of the previous velocity is retained. A higher \(\mu\) gives more "inertia." In MATLAB, traingdm implements gradient descent with momentum.

Adaptive learning rates

A fixed learning rate requires careful tuning: what works for early training (when gradients are large) may be too large or too small later. Adaptive optimizers automatically adjust the learning rate for each parameter based on its gradient history:

Optimizer	Key Idea	MATLAB
GD with momentum	Accumulate gradient history as velocity; damp oscillations	`traingdm`
RMSprop	Divide learning rate by a running average of squared gradients; parameters with large gradients get smaller steps	trainNetwork only
Adam	Combines momentum (first moment) and RMSprop (second moment); widely used default for deep learning	`trainNetwork` with `sgdmOptions` / `adamOptions`
SCG (default in patternnet)	Second-order method; uses curvature information to select step size automatically; no learning rate to tune	`trainscg` (default)

Practical guideline: For small datasets (hundreds of samples), trainscg (MATLAB's default) is often the best choice because it selects its step size automatically using second-order curvature information. For large datasets or deep architectures, Adam is the most common choice. Manual learning-rate tuning with traingdm is useful for understanding the effect of \(\alpha\), which is the focus of HW 36.

📈 Training & Validation Curves

Monitoring the loss during training is how you diagnose whether your network is learning well. The key principle: split your data into training and validation sets, then track the loss on both throughout training. MATLAB's patternnet does this automatically.

How `patternnet` splits data

By default, patternnet randomly divides all input data into three subsets before training:

Training (70%): Used to compute gradients and update weights each iteration.
Validation (15%): Evaluated each epoch but never used for weight updates. Monitors for overfitting.
Test (15%): Evaluated once, after training completes. Gives an unbiased final performance estimate.

Click Performance in the training GUI to see the loss on all three sets vs. epoch.

Four diagnostic patterns

Figure 2. Four diagnostic patterns in the performance plot. Underfitting: model too simple. Good generalization: both curves low and close. Overfitting onset: validation rises after a minimum — stop here. Severe overfitting: training near zero but validation stays high throughout.

Interpreting the patterns

Panel	Pattern	Diagnosis	Remedy
1	Both curves high and flat	Underfitting — model too simple or learning rate too small	Add neurons/layers; increase learning rate; train longer
2	Both curves low and tracking closely	Good generalization — model fits well	No action needed
3	Val. loss decreases then rises (U-shape)	Overfitting onset — model has reached its generalization limit	Use early stopping; restore weights at the validation minimum
4	Training near zero, validation stays high throughout	Severe overfitting — large persistent gap; model memorizes training data	Reduce model size; add more data; add regularization

Early stopping

Early stopping halts training when the validation loss has not improved for a specified number of consecutive epochs, then rolls back the weights to the best validation epoch. In MATLAB:

net.trainParam.max_fail = 6;   % stop after 6 consecutive epochs without
                               % improvement in validation loss (default)

Why early stopping is powerful: It is automatic (no extra hyperparameters to tune), free (no extra computation), and directly prevents the overfitting pattern shown in Figure 2. When max_fail triggers, MATLAB restores the weights from the best validation epoch — the dashed red line in Panel 3.

▶ Recommended Videos

3Blue1Brown's two-part backpropagation series. Watch Ch. 3 first for intuition, then Ch. 4 for the full calculus derivation:

▶

What is backpropagation really doing? (Chapter 3, Deep Learning)

3Blue1Brown — YouTube

▶

Backpropagation calculus (Chapter 4, Deep Learning)

3Blue1Brown — YouTube

💻 Worked Example — Handwritten Digit Recognition

We train a shallow network on MATLAB's built-in DigitDataset (10,000 handwritten digit images, classes 0–9, 28×28 pixels). Rather than patternnet, we use the Deep Learning Toolbox layer API directly — the same approach you will use in HW 36 and beyond, giving you full control over architecture, activation functions, and optimizer settings.

Three parameters at the top of the script control the experiment: n_per_class (dataset size), n_hidden (network capacity), and learn_rate (optimizer step size). The defaults are chosen to produce poor results.. Try adjusting them to see how each one affects the Training Progress curves before tackling HW 36.

%% Lesson 36 — Handwritten Digit Classification (Deep Learning Toolbox)
% Dataset: MATLAB's built-in DigitDataset (10,000 images, 10 classes, 28x28)
% Network: 784 → n_hidden (ReLU + Dropout) → 10 (softmax)
% HW 36: change n_per_class and n_hidden to explore underfitting/overfitting.

%% ── Parameters to experiment with ──────────────────────────────────────────
n_per_class = 100;    % images loaded per digit class  (max 1000; try 50–500)
n_hidden    = 10;     % neurons in the hidden layer     (try 10, 64, 256, 512)
learn_rate  = 0.05;   % Adam learning rate (default 0.001; high → overfits)
%% ─────────────────────────────────────────────────────────────────────────

%% 1. Load dataset
digitPath = fullfile(matlabroot,'toolbox','nnet','nndemos', ...
                     'nndatasets','DigitDataset');
imds = imageDatastore(digitPath, ...
    'IncludeSubfolders', true, 'LabelSource', 'foldernames');
rng(356);
imds_sub = splitEachLabel(imds, n_per_class, 'randomize');
fprintf('Loaded %d images (%d per class)\n', numel(imds_sub.Files), n_per_class);

%% 2. Split into training (80%) and test (20%)
[imds_train, imds_test] = splitEachLabel(imds_sub, 0.8, 'randomize');

%% 3. Flatten images: each 28x28 → 784-element row vector
%  featureInputLayer expects N×features (rows = observations)
imgs_train = readall(imds_train);
v_train    = cellfun(@(I) double(I(:))/255, imgs_train, 'UniformOutput', false);
X_train    = [v_train{:}]';          % (0.8*n_per_class*10) × 784

imgs_test  = readall(imds_test);
v_test     = cellfun(@(I) double(I(:))/255, imgs_test,  'UniformOutput', false);
X_test     = [v_test{:}]';           % (0.2*n_per_class*10) × 784

Y_train = imds_train.Labels;         % categorical
Y_test  = imds_test.Labels;          % categorical

%% 4. Define the network architecture (layer-by-layer)
layers = [
    featureInputLayer(784,        'Name', 'input')
    fullyConnectedLayer(n_hidden, 'Name', 'fc1')
    reluLayer(                    'Name', 'relu1')
    dropoutLayer(0.3,             'Name', 'drop1')   % randomly zero 30% of
    fullyConnectedLayer(10,       'Name', 'fc2')     %   activations each batch
    softmaxLayer(                 'Name', 'softmax')
    classificationLayer(          'Name', 'output')
];

%% 5. Set training options
opts = trainingOptions('adam', ...           % adaptive learning rate
    'MaxEpochs',           50, ...
    'MiniBatchSize',       64, ...
    'InitialLearnRate',    learn_rate, ...
    'ValidationData',      {X_test, Y_test}, ...
    'ValidationFrequency', 5, ...
    'ValidationPatience',  10, ...           % early stopping
    'Plots',               'training-progress', ...
    'Verbose',             false);

%% 6. Train
net = trainNetwork(X_train, Y_train, layers, opts);
% → Watch the Training Progress window.
% → With these defaults (lr=0.05, n_hidden=10, n_per_class=100) you should
%   see overfitting: training accuracy climbs while validation lags or drops.
%   Lower learn_rate to 0.001 and raise n_per_class / n_hidden to improve.

%% 7. Evaluate on test set
Y_pred = classify(net, X_test);
acc    = 100 * mean(Y_pred == Y_test);
fprintf('n_per_class=%d  n_hidden=%d  lr=%.4f  Test acc: %.1f%%\n', ...
    n_per_class, n_hidden, learn_rate, acc);

%% 8. Visualize 25 random test examples in a 5×5 grid
%  Note: test images are stored in class order, so taking the first 25
%  would show mostly "0"s.  Use randperm to get a representative mix.
idx = randperm(numel(imgs_test), 25);
figure('Name', 'Test Set Predictions', 'NumberTitle', 'off');
for k = 1:25
    subplot(5,5,k);
    imshow(imgs_test{idx(k)});
    pred_str = char(Y_pred(idx(k)));
    true_str = char(Y_test(idx(k)));
    if strcmp(pred_str, true_str)
        title(sprintf('Pred: %s', pred_str), 'Color', 'g', 'FontSize', 11);
    else
        title(sprintf('Pred: %s (T:%s)', pred_str, true_str), ...
              'Color', 'r', 'FontSize', 11);
    end
end
sgtitle('25 Random Test Predictions  (green = correct  |  red = wrong)');

What does dropoutLayer(0.3) do? During each training mini-batch, dropout randomly sets 30% of the activations in that layer to zero — a different random subset each time. This forces the network to learn redundant, distributed representations rather than relying on a small number of neurons, which reduces overfitting. At test time dropout is automatically disabled and all activations are used (scaled to account for the dropped fraction during training). Dropout is one of the most widely used regularization techniques in deep learning.

Why flatten instead of using raw images? A fully-connected layer treats each pixel as an independent feature — it does not exploit the spatial structure of the image. Convolutional Neural Networks (Lesson 37) solve this by applying shared filters that detect local patterns regardless of position.

📝 Summary

Concept	Key Idea
Backpropagation	Chain rule applied layer-by-layer backward; computes \(\partial L/\partial \mathbf{W}^{[\ell]}\) for every layer efficiently
L2 output delta	\(\delta^{[L]} = \hat{y} - y\) for linear output + MSE loss
Hidden delta	\(\delta^{[\ell]} = (\mathbf{W}^{[\ell+1]})^\top\delta^{[\ell+1]} \odot \phi'(\mathbf{z}^{[\ell]})\)
Mini-batch SGD	Use \(B\) random samples per update; noise improves speed, helps escape local minima, acts as regularization
Learning rate \(\alpha\)	Too small → slow; too large → oscillates or diverges; adaptive methods (Adam, SCG) remove the need to tune
Momentum	Velocity \(\mathbf{v}\) accumulates past gradients; damps oscillations and builds speed in consistent directions
Training & validation curves	Underfitting: both high; good fit: both low and close; overfitting: val. rises while train. falls
Early stopping	Halt when val. loss stops improving; restore best weights; prevents overfitting automatically

📋 HW 36 — Training Parameter Exploration (DigitDataset)

📚 References

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Ch. 6.5 (Backprop) & Ch. 8 (Optimization). MIT Press.
3Blue1Brown. (2017). What is backpropagation really doing? [Video]. YouTube. youtube.com
3Blue1Brown. (2017). Backpropagation calculus [Video]. YouTube. youtube.com
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.
Prechelt, L. (1998). Early stopping — but when? In Neural Networks: Tricks of the Trade, Springer, pp. 55–69.
Kingma, D. P. & Ba, J. (2015). Adam: A method for stochastic optimization. ICLR 2015.

← Previous
L35: Neural Networks

Next Lesson
L37: Convolutional Neural Networks

→

Backpropagation & Training Neural Networks

Learning Objectives

Notation used in this lesson

🌍 Background & Motivation

Closing the loop

Historical note & the AI boom

🧠 The Chain Rule & Computational Graphs

Chain rule review

Computational graphs

Simple numerical walkthrough

🔀 Backpropagation in a Multi-Layer Network

Two-layer network with L2 loss

The ReLU derivative

The general backpropagation algorithm

🎲 Stochastic Gradient Descent (SGD)

What does "stochastic" mean?

What is the "noise" and how does it help?

The three variants

⚙️ The Learning Rate

Four regimes

Momentum

Adaptive learning rates

📈 Training & Validation Curves

How patternnet splits data

Four diagnostic patterns

Interpreting the patterns

Early stopping

▶ Recommended Videos

💻 Worked Example — Handwritten Digit Recognition

📝 Summary

📚 References

How `patternnet` splits data