Lesson 37 — Convolutional Neural Networks

Learning Objectives

Explain why fully-connected networks fail to scale to image data and how CNNs address this with local connectivity and parameter sharing.
Describe the convolution operation mathematically and trace how a filter slides across an input volume.
Compute the output spatial dimensions of a conv layer given filter size \(F\), stride \(S\), and padding \(P\).
Explain max pooling and why it provides translation invariance.
Track the volume shape \([H \times W \times D]\) at each layer through a complete CNN.
Build and train a CNN in MATLAB using the Deep Learning Toolbox and compare its performance to a fully-connected network on the same dataset.

Notation used in this lesson

\(\mathbf{X}\)Input volume, size \(H \times W \times C\)

\(\mathbf{W}_k\)Weight tensor of filter \(k\), size \(F \times F \times C\)

\(b_k\)Bias for filter \(k\)

\(F,\;S,\;P\)Filter size, stride, zero-padding

\(K\)Number of filters (output depth)

\(\mathbf{A}[i,j,k]\)Output activation: row \(i\), column \(j\), filter \(k\)

🌍 Background & Motivation

Why fully-connected networks struggle with images

In Lessons 35–36 we built fully-connected networks that treat every input feature identically. For image data this is a problem: a 28×28 grayscale image has 784 features, manageable, but a 256×256 RGB image has 196,608 features. A single hidden FC layer with 1,024 neurons would require over 200 million weights — before even reaching the output layer.

Even ignoring the parameter count, there is a deeper issue: FC layers have no notion of spatial structure. A pixel at position (0, 0) is treated identically to one at (127, 127). But meaningful image features — edges, textures, corners — are local and they appear at many positions. There is no reason to learn a separate detector for an edge in the top-left versus the top-right of the image.

Convolutional Neural Networks (CNNs) encode two structural assumptions that fix both problems:

Local connectivity: Each neuron connects to only a small spatial region of the input — its receptive field.
Parameter sharing: The same filter weights are used at every spatial position. One edge detector learned anywhere applies everywhere.

The historical breakthrough

The idea of convolutional networks dates to Yann LeCun's LeNet (1998), trained on handwritten digits — the same DigitDataset you are using in this course. For over a decade, CNNs were considered too computationally expensive to scale.

In 2012, Krizhevsky, Sutskever, and Hinton trained AlexNet on GPUs and entered the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Their top-5 error rate was 15.3% — more than 10 percentage points better than the next competitor. The result was a shock to the computer vision community and launched the modern deep learning era. By 2015 the best models (ResNet, 152 layers) had surpassed human-level performance on the same benchmark.

Depth matters: The progression was rapid — 8 layers (AlexNet, 2012) → 19 layers (VGGNet, 2014) → 22 layers (GoogLeNet, 2014) → 152 layers (ResNet, 2015). Each jump brought major accuracy gains, but only became possible as researchers solved the vanishing gradient problem with better initialization, batch normalization, and residual connections.

Physics applications

CNNs are now standard tools across physics: galaxy morphology classification and gravitational lens detection in astrophysics; phase identification from diffraction patterns in materials science; SAR image interpretation and GEO object classification in space domain awareness; and particle track reconstruction in high-energy physics.

🧮 The Convolution Operation

Mathematical definition

A convolutional layer applies \(K\) filters in parallel. Each filter \(\mathbf{W}_k\) has size \(F \times F \times C\) (spatial extent \(F\), spanning all \(C\) input channels). The filter slides across the input, computing a dot product at each position. The output activation at spatial position \((i, j)\) for filter \(k\) is:

\mathbf{A}[i,\,j,\,k] \;=\; \phi\!\left( \sum_{di=0}^{F-1}\sum_{dj=0}^{F-1}\sum_{dc=0}^{C-1} \mathbf{X}[\,i{\cdot}S+di,\;\;j{\cdot}S+dj,\;\;dc\,] \cdot \mathbf{W}_k[\,di,\;dj,\;dc\,] \;+\; b_k \right)

where \(\phi\) is the activation function (typically ReLU), \(S\) is the stride, and \(b_k\) is a scalar bias. The three nested sums simply compute a dot product between a local \(F \times F \times C\) patch of the input and the filter — then apply bias and activation.

What are channels?

The depth dimension \(C\) of the input volume is called the number of channels. Its meaning depends on where you are in the network:

Location	\(C\)	What each channel represents
First layer input — color image	3	Red, Green, Blue pixel intensities. A 480×640 photo is a \(480\times640\times3\) volume. Each of the 3 planes holds one color channel.
First layer input — grayscale image	1	A single intensity value per pixel. The DigitDataset images are \(28\times28\times1\).
First layer input — multispectral / physics data	\(C\)	Any set of co-registered 2-D measurements: radar bands, spectral wavelengths, polarization states, time steps of a field, … CNNs treat them identically to RGB.
Hidden layer output	\(K\)	One learned feature map per filter. Channel \(k\) encodes "how strongly filter \(k\) activated at each spatial position." These are not human-interpretable colors — they are abstract learned features.

The key point is that every filter spans all \(C\) input channels simultaneously. When a 3×3 filter processes an RGB image, it sees a 3×3×3 = 27-element patch and learns a single scalar response. This lets a filter detect, say, "a red edge" as opposed to a "green edge" — colour and spatial structure are learned jointly. The innermost sum \(\sum_{dc}\) in the convolution formula is what folds all channels into one response.

Visualizing the sliding filter

Two 3×3 filters scan the same 5×5 input. Filter k=1 (gold) is a diagonal+center detector; Filter k=2 (teal) is a left-edge (Sobel-X) detector. The animation alternates between them — watch the active overlay on the input and the highlighted cell filling in each output map. Running both filters in parallel stacks the two maps into a 3×3×2 output volume (right). The numbers inside each filter are the weights — the values the network learns during backpropagation.

Gold (k=1): diagonal+center detector → map [4,3,4 / 3,5,3 / 4,3,4]. Teal (k=2): Sobel-X left-edge detector → map [3,0,0 / 2,0,0 / 3,0,0] after ReLU. Both maps stack into the 3×3×2 output volume on the right. The filter values shown are the learned weights — backpropagation adjusts them to minimize classification loss.

Depth column: At every spatial position \((i,j)\) in the output there are \(K\) values — one per filter — stacked along the depth axis. All \(K\) neurons in a depth column \(\mathbf{A}[i,j,:]\) look at exactly the same \(F\times F\) input patch but apply different filters, each detecting a different feature (edge orientation, color blob, texture, …).

⚙️ Key Hyperparameters & Output Size

Parameter	Symbol	Effect	Typical values
Filter size	\(F\)	Receptive field of each neuron; larger = captures more context	3, 5, 7
Stride	\(S\)	Step between filter positions; larger = smaller output, downsampling	1, 2
Padding	\(P\)	Zeros added to input border; \(P{=}\lfloor F/2\rfloor\) keeps spatial size unchanged (same)	0 (valid), 1 (same for F=3)
Filters	\(K\)	Number of distinct feature detectors; equals output depth	32, 64, 128, 256

The output feature map size (height or width) is:

H_\text{out} = \left\lfloor \frac{H_\text{in} - F + 2P}{S} \right\rfloor + 1

For a 28×28 input, \(F=3\), \(P=1\) (same padding), \(S=1\): \(H_\text{out} = (28-3+2)/1+1 = 28\). Spatial size is preserved.

Same input, \(P=0\), \(S=1\): \(H_\text{out} = (28-3)/1+1 = 26\). Slightly smaller.

Parameter count per conv layer: \(F \times F \times C_\text{in} \times K + K\). Each of the \(K\) filters has \(F\times F \times C_\text{in}\) weights plus one bias. Note that \(C_\text{in}\) is the depth of the input volume — a filter must span the full input depth to compute a proper dot product.

👁️ Feature Maps — What Filters Learn

The remarkable property of CNNs is that the filters are learned from data, not hand-designed. Given enough examples and a good loss function, the network discovers which local patterns are useful for the task.

Visualization studies (Zeiler & Fergus, 2014) revealed a consistent feature hierarchy across virtually every trained CNN:

Layer depth	What filters detect	Example
Layer 1–2	Oriented edges, color blobs, Gabor-like patterns	Horizontal/vertical/diagonal edges
Layer 3–4	Textures, corners, junctions of edges	Grids, checkerboards, curve endings
Layer 5+	Object parts & high-level semantic features	Wheel shapes, eyes, digit strokes

This hierarchy is not programmed in — it emerges automatically from gradient descent on any sufficiently large image dataset. Lower layers are nearly identical across networks trained on completely different tasks; only the deepest layers specialize.

Transfer learning exploits this: take a network pre-trained on a large general dataset (e.g., ImageNet), freeze the early layers (which have learned universal edge and texture detectors), and fine-tune only the later layers on your specific task. This is why CNNs work well even with small domain-specific datasets.

📊 Pooling — Downsampling with Translation Invariance

After a conv+ReLU block, a pooling layer reduces spatial dimensions. The most common is max pooling: divide the feature map into non-overlapping windows and take the maximum in each.

p_{ij} = \max_{(m,n)\,\in\,\text{window}_{ij}} a_{mn}

A 2×2 max pool with stride 2 halves both H and W, reducing the number of activations by 75%.

Each 2×2 colored region contributes its maximum to the corresponding output cell. The bold values are the winners. Spatial size: 4×4 → 2×2, parameters added: zero.

Why max pooling helps: If a feature (e.g., a vertical edge) is detected slightly to the left of where the network expects it, max pooling still passes a strong activation. This translation invariance makes the network more robust to small shifts in the input. Pooling also aggressively reduces the number of activations fed to subsequent layers, cutting computation and acting as a form of regularization.

Average pooling takes the mean instead of the maximum. It is less common in classification networks but is used in global average pooling layers at the end of modern architectures (e.g., ResNet), which replace the large FC layers entirely.

🧱 Full CNN Architecture — Tracking Volume Dimensions

A CNN is a sequence of layers, each transforming a 3-D volume \([H \times W \times D]\) into another. Spatial dimensions \(H\) and \(W\) shrink at pooling layers; depth \(D\) grows at conv layers. The network trades spatial resolution for richer feature representations, then flattens and classifies. The architecture below is the one used in the Lesson 37 worked example and HW 37 for classifying the MATLAB DigitDataset (28×28 grayscale handwritten digits, 10 classes).

Volume flow for a CNN on 28×28 grayscale images. Spatial dimensions shrink at each pooling layer; depth grows at each conv layer (shown as stacked blue layers). After two Conv+Pool blocks, the 7×7×16 volume is flattened to 784 values and classified by a fully-connected layer.

Worked example — DigitDataset CNN

Layer	Output volume \([H \times W \times D]\)	Parameters
Input image	\([28 \times 28 \times 1]\)	—
Conv (F=3, K=8, P=1, S=1) + BN + ReLU	\([28 \times 28 \times 8]\)	\(3{\times}3{\times}1{\times}8 + 8 = 80\)
MaxPool 2×2, S=2	\([14 \times 14 \times 8]\)	0
Conv (F=3, K=16, P=1, S=1) + BN + ReLU	\([14 \times 14 \times 16]\)	\(3{\times}3{\times}8{\times}16 + 16 = 1{,}168\)
MaxPool 2×2, S=2	\([7 \times 7 \times 16]\)	0
Flatten	\([784]\)	0
FC → 10 + Softmax	\([10]\)	\(784{\times}10 + 10 = 7{,}850\)
Total		≈9,100

⚡ Parameter Efficiency: CNN vs. Fully-Connected

Recall the L36 network: featureInputLayer(784) → FC(10) → FC(10) required flattening the image and had \(784 \times 10 + 10 \times 10 = 7{,}940\) parameters but still struggled because FC layers ignore spatial structure.

Layer type	Input	Parameters	Spatial awareness
FC (784 → 10)	784 flattened pixels	7,850	None — every pixel treated independently
Conv (3×3, 8 filters, on 28×28)	28×28×1 image	80	Yes — same filter applied at every position
Full CNN (two conv blocks)	28×28×1 image	≈9,100	Yes — hierarchical feature extraction

Key insight: The CNN has slightly more total parameters than the L36 FC network, but those parameters are used far more efficiently — each filter weight is shared across all \(H \times W\) spatial positions, and the architecture explicitly encodes the spatial inductive bias of image data. The result is dramatically better accuracy on the same dataset.

🎬 Recommended Videos

Watch these two videos before class. The first gives an excellent visual walkthrough of how CNNs work end-to-end; the second covers CNN architecture, training, and design choices in depth from a university course perspective.

▶

Convolutional Neural Networks Explained (CNN Visualized)

Futurology — YouTube

▶

Lecture 5 — Convolutional Neural Networks

CS231n Stanford — YouTube

🖥 Interactive tool: CNN Explainer lets you run a small CNN in your browser and watch activations, filters, and feature maps update in real time as you change the input image. Great for building intuition before the homework.

💻 Worked Example — CNN on DigitDataset

This example mirrors the Lesson 36 FC network but replaces every featureInputLayer + fullyConnectedLayer with proper CNN layers. The key difference: images are no longer flattened. The network sees the full 28×28 image and can exploit spatial structure.

Three parameters at the top control the experiment. The defaults are chosen to show good generalization. To make a fair parameter-count comparison against the L36 FC network, run the Lesson 36 example with n_hidden = 11, learn_rate = 0.001, and n_per_class = 200 (50 epochs). That FC network has \(784 \times 11 + 11 + 11 \times 10 + 10 \approx 8{,}755\) parameters — roughly equal to this CNN’s ≈9,100. Compare accuracy and Training Progress plots: the CNN should win decisively despite similar parameter counts, because it exploits the spatial structure of images.

%% Lesson 37 — CNN on MATLAB DigitDataset
% Parallel to L36 but uses convolutional layers: no flattening needed.
% Architecture: imageInputLayer → Conv+BN+ReLU → Pool → Conv+BN+ReLU → Pool → FC → Softmax
clear; clc;

%% ── Parameters to experiment with ──────────────────────────────────────────
n_per_class = 200;   % images per digit class     (max 1000; try 100–500)
n_filters   = 8;     % filters in first Conv layer (second gets 2×n_filters)
learn_rate  = 0.01;  % Adam learning rate          (try 0.001, 0.01, 0.05)
%% ─────────────────────────────────────────────────────────────────────────

%% 1. Load dataset — no flattening: images stay as 28×28 arrays
digitPath = fullfile(matlabroot,'toolbox','nnet','nndemos', ...
                     'nndatasets','DigitDataset');
imds = imageDatastore(digitPath, ...
    'IncludeSubfolders', true, 'LabelSource', 'foldernames');
rng(356);
imds_sub = splitEachLabel(imds, n_per_class, 'randomize');
[imds_train, imds_test] = splitEachLabel(imds_sub, 0.8, 'randomize');

fprintf('Train: %d  |  Test: %d images\n', ...
    numel(imds_train.Files), numel(imds_test.Files));

%% 2. Define CNN architecture
% imageInputLayer accepts full images — the network learns spatial features
layers = [
    imageInputLayer([28 28 1], 'Normalization', 'rescale-zero-one', ...
                               'Name', 'input')

    % Block 1: 28×28×1 → 28×28×n_filters → 14×14×n_filters
    convolution2dLayer(3, n_filters, 'Padding', 'same', 'Name', 'conv1')
    batchNormalizationLayer('Name', 'bn1')
    reluLayer('Name', 'relu1')
    maxPooling2dLayer(2, 'Stride', 2, 'Name', 'pool1')

    % Block 2: 14×14×n_filters → 14×14×(2*n_filters) → 7×7×(2*n_filters)
    convolution2dLayer(3, 2*n_filters, 'Padding', 'same', 'Name', 'conv2')
    batchNormalizationLayer('Name', 'bn2')
    reluLayer('Name', 'relu2')
    maxPooling2dLayer(2, 'Stride', 2, 'Name', 'pool2')

    % Classifier
    fullyConnectedLayer(10,  'Name', 'fc')
    softmaxLayer('Name', 'softmax')
    classificationLayer('Name', 'output')
];

%% 3. Set training options (same API as L36)
opts = trainingOptions('adam', ...
    'MaxEpochs',           20, ...
    'MiniBatchSize',       64, ...
    'InitialLearnRate',    learn_rate, ...
    'ValidationData',      imds_test, ...
    'ValidationFrequency', 5, ...
    'Plots',               'training-progress', ...
    'Verbose',             false);

%% 4. Train — pass imageDatastore directly (no manual flattening!)
net = trainNetwork(imds_train, layers, opts);

%% 5. Evaluate
Y_pred = classify(net, imds_test);
Y_true = imds_test.Labels;
acc    = 100 * mean(Y_pred == Y_true);
fprintf('n_per_class=%d  n_filters=%d  lr=%.4f  Test acc: %.1f%%\n', ...
    n_per_class, n_filters, learn_rate, acc);

%% 6. Visualize 25 random test examples in a 5×5 grid
%  Test images are stored in class order, so randperm gives a varied mix.
imgs_test = readall(imds_test);
idx = randperm(numel(imgs_test), 25);
figure('Name', 'CNN Test Predictions', 'NumberTitle', 'off');
for k = 1:25
    subplot(5,5,k);
    imshow(imgs_test{idx(k)});
    pred_str = char(Y_pred(idx(k)));
    true_str = char(Y_true(idx(k)));
    if strcmp(pred_str, true_str)
        title(sprintf('Pred: %s', pred_str), 'Color', 'g', 'FontSize', 11);
    else
        title(sprintf('Pred:%s (T:%s)', pred_str, true_str), ...
              'Color', 'r', 'FontSize', 11);
    end
end
sgtitle('CNN Test Predictions  (green = correct  |  red = wrong)');

What does batchNormalizationLayer do? Batch normalization normalizes the activations within each mini-batch to have zero mean and unit variance, then applies learned scale and shift parameters. This stabilizes training, allows higher learning rates, and reduces sensitivity to weight initialization. It has largely replaced dropout in modern CNN architectures.

Why no flattening? imageInputLayer accepts full 2-D images. The CNN layers operate on the \(H\times W\times C\) volume directly. MATLAB flattens automatically before the fullyConnectedLayer. This is a major difference from the L36 FC approach, where you had to manually compute [v{:}]' to get an \(N\times784\) matrix.

📝 Summary

Concept	Key Idea
Local connectivity	Each neuron connects to only a small spatial region; captures nearby correlations efficiently
Parameter sharing	Same filter weights used at every position; one edge detector works everywhere in the image
Convolution output size	\(\lfloor(H - F + 2P)/S\rfloor + 1\); same padding (\(P = \lfloor F/2\rfloor\)) preserves spatial size
Feature hierarchy	Early layers: edges; middle: textures; deep: object parts — all learned from data
Max pooling	Take regional maxima; halves spatial size; provides translation invariance; zero extra parameters
Volume notation	Track \([H\times W\times D]\) at each layer; \(H,W\) shrink at pool layers; \(D\) grows at conv layers
MATLAB CNN	`imageInputLayer` + `convolution2dLayer` + `maxPooling2dLayer`; pass imageDatastore directly to `trainNetwork`
vs. FC (L36)	CNN: same or fewer parameters, but spatially aware → higher accuracy on image tasks

📋 HW 37 — CNN on DigitDataset

📚 References

Karpathy, A. et al. CS231n: Convolutional Neural Networks for Visual Recognition — Lecture 5. Stanford University. cs231n.github.io/convolutional-networks/
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. NeurIPS 2012.
Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. ECCV 2014.

← L36: Backpropagation & Training

Next Lesson
L38: Transformers & LLMs

→

Convolutional Neural Networks (CNNs)

Learning Objectives

Notation used in this lesson

🌍 Background & Motivation

Why fully-connected networks struggle with images

The historical breakthrough

Physics applications

🧮 The Convolution Operation

Mathematical definition

What are channels?

Visualizing the sliding filter

⚙️ Key Hyperparameters & Output Size

👁️ Feature Maps — What Filters Learn

📊 Pooling — Downsampling with Translation Invariance

🧱 Full CNN Architecture — Tracking Volume Dimensions

Worked example — DigitDataset CNN

⚡ Parameter Efficiency: CNN vs. Fully-Connected

🎬 Recommended Videos

💻 Worked Example — CNN on DigitDataset

📝 Summary

📚 References