Learning Objectives
- Explain why fully-connected networks fail to scale to image data and how CNNs address this with local connectivity and parameter sharing.
- Describe the convolution operation mathematically and trace how a filter slides across an input volume.
- Compute the output spatial dimensions of a conv layer given filter size \(F\), stride \(S\), and padding \(P\).
- Explain max pooling and why it provides translation invariance.
- Track the volume shape \([H \times W \times D]\) at each layer through a complete CNN.
- Build and train a CNN in MATLAB using the Deep Learning Toolbox and compare its performance to a fully-connected network on the same dataset.
Notation used in this lesson
Background & Motivation
Why fully-connected networks struggle with images
In Lessons 35–36 we built fully-connected networks that treat every input feature identically. For image data this is a problem: a 28×28 grayscale image has 784 features, manageable, but a 256×256 RGB image has 196,608 features. A single hidden FC layer with 1,024 neurons would require over 200 million weights — before even reaching the output layer.
Even ignoring the parameter count, there is a deeper issue: FC layers have no notion of spatial structure. A pixel at position (0, 0) is treated identically to one at (127, 127). But meaningful image features — edges, textures, corners — are local and they appear at many positions. There is no reason to learn a separate detector for an edge in the top-left versus the top-right of the image.
Convolutional Neural Networks (CNNs) encode two structural assumptions that fix both problems:
- Local connectivity: Each neuron connects to only a small spatial region of the input — its receptive field.
- Parameter sharing: The same filter weights are used at every spatial position. One edge detector learned anywhere applies everywhere.
The historical breakthrough
The idea of convolutional networks dates to Yann LeCun's LeNet (1998), trained on handwritten digits — the same DigitDataset you are using in this course. For over a decade, CNNs were considered too computationally expensive to scale.
In 2012, Krizhevsky, Sutskever, and Hinton trained AlexNet on GPUs and entered the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Their top-5 error rate was 15.3% — more than 10 percentage points better than the next competitor. The result was a shock to the computer vision community and launched the modern deep learning era. By 2015 the best models (ResNet, 152 layers) had surpassed human-level performance on the same benchmark.
Physics applications
CNNs are now standard tools across physics: galaxy morphology classification and gravitational lens detection in astrophysics; phase identification from diffraction patterns in materials science; SAR image interpretation and GEO object classification in space domain awareness; and particle track reconstruction in high-energy physics.
The Convolution Operation
Mathematical definition
A convolutional layer applies \(K\) filters in parallel. Each filter \(\mathbf{W}_k\) has size \(F \times F \times C\) (spatial extent \(F\), spanning all \(C\) input channels). The filter slides across the input, computing a dot product at each position. The output activation at spatial position \((i, j)\) for filter \(k\) is:
where \(\phi\) is the activation function (typically ReLU), \(S\) is the stride, and \(b_k\) is a scalar bias. The three nested sums simply compute a dot product between a local \(F \times F \times C\) patch of the input and the filter — then apply bias and activation.
What are channels?
The depth dimension \(C\) of the input volume is called the number of channels. Its meaning depends on where you are in the network:
| Location | \(C\) | What each channel represents |
|---|---|---|
| First layer input — color image | 3 | Red, Green, Blue pixel intensities. A 480×640 photo is a \(480\times640\times3\) volume. Each of the 3 planes holds one color channel. |
| First layer input — grayscale image | 1 | A single intensity value per pixel. The DigitDataset images are \(28\times28\times1\). |
| First layer input — multispectral / physics data | \(C\) | Any set of co-registered 2-D measurements: radar bands, spectral wavelengths, polarization states, time steps of a field, … CNNs treat them identically to RGB. |
| Hidden layer output | \(K\) | One learned feature map per filter. Channel \(k\) encodes "how strongly filter \(k\) activated at each spatial position." These are not human-interpretable colors — they are abstract learned features. |
The key point is that every filter spans all \(C\) input channels simultaneously. When a 3×3 filter processes an RGB image, it sees a 3×3×3 = 27-element patch and learns a single scalar response. This lets a filter detect, say, "a red edge" as opposed to a "green edge" — colour and spatial structure are learned jointly. The innermost sum \(\sum_{dc}\) in the convolution formula is what folds all channels into one response.
Visualizing the sliding filter
Two 3×3 filters scan the same 5×5 input. Filter k=1 (gold) is a diagonal+center detector; Filter k=2 (teal) is a left-edge (Sobel-X) detector. The animation alternates between them — watch the active overlay on the input and the highlighted cell filling in each output map. Running both filters in parallel stacks the two maps into a 3×3×2 output volume (right). The numbers inside each filter are the weights — the values the network learns during backpropagation.
Key Hyperparameters & Output Size
| Parameter | Symbol | Effect | Typical values |
|---|---|---|---|
| Filter size | \(F\) | Receptive field of each neuron; larger = captures more context | 3, 5, 7 |
| Stride | \(S\) | Step between filter positions; larger = smaller output, downsampling | 1, 2 |
| Padding | \(P\) | Zeros added to input border; \(P{=}\lfloor F/2\rfloor\) keeps spatial size unchanged (same) | 0 (valid), 1 (same for F=3) |
| Filters | \(K\) | Number of distinct feature detectors; equals output depth | 32, 64, 128, 256 |
The output feature map size (height or width) is:
For a 28×28 input, \(F=3\), \(P=1\) (same padding), \(S=1\): \(H_\text{out} = (28-3+2)/1+1 = 28\). Spatial size is preserved.
Same input, \(P=0\), \(S=1\): \(H_\text{out} = (28-3)/1+1 = 26\). Slightly smaller.
Feature Maps — What Filters Learn
The remarkable property of CNNs is that the filters are learned from data, not hand-designed. Given enough examples and a good loss function, the network discovers which local patterns are useful for the task.
Visualization studies (Zeiler & Fergus, 2014) revealed a consistent feature hierarchy across virtually every trained CNN:
| Layer depth | What filters detect | Example |
|---|---|---|
| Layer 1–2 | Oriented edges, color blobs, Gabor-like patterns | Horizontal/vertical/diagonal edges |
| Layer 3–4 | Textures, corners, junctions of edges | Grids, checkerboards, curve endings |
| Layer 5+ | Object parts & high-level semantic features | Wheel shapes, eyes, digit strokes |
This hierarchy is not programmed in — it emerges automatically from gradient descent on any sufficiently large image dataset. Lower layers are nearly identical across networks trained on completely different tasks; only the deepest layers specialize.
Pooling — Downsampling with Translation Invariance
After a conv+ReLU block, a pooling layer reduces spatial dimensions. The most common is max pooling: divide the feature map into non-overlapping windows and take the maximum in each.
A 2×2 max pool with stride 2 halves both H and W, reducing the number of activations by 75%.
Why max pooling helps: If a feature (e.g., a vertical edge) is detected slightly to the left of where the network expects it, max pooling still passes a strong activation. This translation invariance makes the network more robust to small shifts in the input. Pooling also aggressively reduces the number of activations fed to subsequent layers, cutting computation and acting as a form of regularization.
Full CNN Architecture — Tracking Volume Dimensions
A CNN is a sequence of layers, each transforming a 3-D volume \([H \times W \times D]\) into another. Spatial dimensions \(H\) and \(W\) shrink at pooling layers; depth \(D\) grows at conv layers. The network trades spatial resolution for richer feature representations, then flattens and classifies. The architecture below is the one used in the Lesson 37 worked example and HW 37 for classifying the MATLAB DigitDataset (28×28 grayscale handwritten digits, 10 classes).
Worked example — DigitDataset CNN
| Layer | Output volume \([H \times W \times D]\) | Parameters |
|---|---|---|
| Input image | \([28 \times 28 \times 1]\) | — |
| Conv (F=3, K=8, P=1, S=1) + BN + ReLU | \([28 \times 28 \times 8]\) | \(3{\times}3{\times}1{\times}8 + 8 = 80\) |
| MaxPool 2×2, S=2 | \([14 \times 14 \times 8]\) | 0 |
| Conv (F=3, K=16, P=1, S=1) + BN + ReLU | \([14 \times 14 \times 16]\) | \(3{\times}3{\times}8{\times}16 + 16 = 1{,}168\) |
| MaxPool 2×2, S=2 | \([7 \times 7 \times 16]\) | 0 |
| Flatten | \([784]\) | 0 |
| FC → 10 + Softmax | \([10]\) | \(784{\times}10 + 10 = 7{,}850\) |
| Total | ≈9,100 |
Parameter Efficiency: CNN vs. Fully-Connected
Recall the L36 network: featureInputLayer(784) → FC(10) → FC(10) required flattening the image and had \(784 \times 10 + 10 \times 10 = 7{,}940\) parameters but still struggled because FC layers ignore spatial structure.
| Layer type | Input | Parameters | Spatial awareness |
|---|---|---|---|
| FC (784 → 10) | 784 flattened pixels | 7,850 | None — every pixel treated independently |
| Conv (3×3, 8 filters, on 28×28) | 28×28×1 image | 80 | Yes — same filter applied at every position |
| Full CNN (two conv blocks) | 28×28×1 image | ≈9,100 | Yes — hierarchical feature extraction |
Recommended Videos
Watch these two videos before class. The first gives an excellent visual walkthrough of how CNNs work end-to-end; the second covers CNN architecture, training, and design choices in depth from a university course perspective.
Worked Example — CNN on DigitDataset
This example mirrors the Lesson 36 FC network but replaces every featureInputLayer + fullyConnectedLayer with proper CNN layers. The key difference: images are no longer flattened. The network sees the full 28×28 image and can exploit spatial structure.
Three parameters at the top control the experiment. The defaults are chosen to show good generalization. To make a fair parameter-count comparison against the L36 FC network, run the Lesson 36 example with n_hidden = 11, learn_rate = 0.001, and n_per_class = 200 (50 epochs). That FC network has \(784 \times 11 + 11 + 11 \times 10 + 10 \approx 8{,}755\) parameters — roughly equal to this CNN’s ≈9,100. Compare accuracy and Training Progress plots: the CNN should win decisively despite similar parameter counts, because it exploits the spatial structure of images.
%% Lesson 37 — CNN on MATLAB DigitDataset
% Parallel to L36 but uses convolutional layers: no flattening needed.
% Architecture: imageInputLayer → Conv+BN+ReLU → Pool → Conv+BN+ReLU → Pool → FC → Softmax
clear; clc;
%% ── Parameters to experiment with ──────────────────────────────────────────
n_per_class = 200; % images per digit class (max 1000; try 100–500)
n_filters = 8; % filters in first Conv layer (second gets 2×n_filters)
learn_rate = 0.01; % Adam learning rate (try 0.001, 0.01, 0.05)
%% ─────────────────────────────────────────────────────────────────────────
%% 1. Load dataset — no flattening: images stay as 28×28 arrays
digitPath = fullfile(matlabroot,'toolbox','nnet','nndemos', ...
'nndatasets','DigitDataset');
imds = imageDatastore(digitPath, ...
'IncludeSubfolders', true, 'LabelSource', 'foldernames');
rng(356);
imds_sub = splitEachLabel(imds, n_per_class, 'randomize');
[imds_train, imds_test] = splitEachLabel(imds_sub, 0.8, 'randomize');
fprintf('Train: %d | Test: %d images\n', ...
numel(imds_train.Files), numel(imds_test.Files));
%% 2. Define CNN architecture
% imageInputLayer accepts full images — the network learns spatial features
layers = [
imageInputLayer([28 28 1], 'Normalization', 'rescale-zero-one', ...
'Name', 'input')
% Block 1: 28×28×1 → 28×28×n_filters → 14×14×n_filters
convolution2dLayer(3, n_filters, 'Padding', 'same', 'Name', 'conv1')
batchNormalizationLayer('Name', 'bn1')
reluLayer('Name', 'relu1')
maxPooling2dLayer(2, 'Stride', 2, 'Name', 'pool1')
% Block 2: 14×14×n_filters → 14×14×(2*n_filters) → 7×7×(2*n_filters)
convolution2dLayer(3, 2*n_filters, 'Padding', 'same', 'Name', 'conv2')
batchNormalizationLayer('Name', 'bn2')
reluLayer('Name', 'relu2')
maxPooling2dLayer(2, 'Stride', 2, 'Name', 'pool2')
% Classifier
fullyConnectedLayer(10, 'Name', 'fc')
softmaxLayer('Name', 'softmax')
classificationLayer('Name', 'output')
];
%% 3. Set training options (same API as L36)
opts = trainingOptions('adam', ...
'MaxEpochs', 20, ...
'MiniBatchSize', 64, ...
'InitialLearnRate', learn_rate, ...
'ValidationData', imds_test, ...
'ValidationFrequency', 5, ...
'Plots', 'training-progress', ...
'Verbose', false);
%% 4. Train — pass imageDatastore directly (no manual flattening!)
net = trainNetwork(imds_train, layers, opts);
%% 5. Evaluate
Y_pred = classify(net, imds_test);
Y_true = imds_test.Labels;
acc = 100 * mean(Y_pred == Y_true);
fprintf('n_per_class=%d n_filters=%d lr=%.4f Test acc: %.1f%%\n', ...
n_per_class, n_filters, learn_rate, acc);
%% 6. Visualize 25 random test examples in a 5×5 grid
% Test images are stored in class order, so randperm gives a varied mix.
imgs_test = readall(imds_test);
idx = randperm(numel(imgs_test), 25);
figure('Name', 'CNN Test Predictions', 'NumberTitle', 'off');
for k = 1:25
subplot(5,5,k);
imshow(imgs_test{idx(k)});
pred_str = char(Y_pred(idx(k)));
true_str = char(Y_true(idx(k)));
if strcmp(pred_str, true_str)
title(sprintf('Pred: %s', pred_str), 'Color', 'g', 'FontSize', 11);
else
title(sprintf('Pred:%s (T:%s)', pred_str, true_str), ...
'Color', 'r', 'FontSize', 11);
end
end
sgtitle('CNN Test Predictions (green = correct | red = wrong)');
batchNormalizationLayer do? Batch normalization normalizes the activations within each mini-batch to have zero mean and unit variance, then applies learned scale and shift parameters. This stabilizes training, allows higher learning rates, and reduces sensitivity to weight initialization. It has largely replaced dropout in modern CNN architectures.
imageInputLayer accepts full 2-D images. The CNN layers operate on the \(H\times W\times C\) volume directly. MATLAB flattens automatically before the fullyConnectedLayer. This is a major difference from the L36 FC approach, where you had to manually compute [v{:}]' to get an \(N\times784\) matrix.
Summary
| Concept | Key Idea |
|---|---|
| Local connectivity | Each neuron connects to only a small spatial region; captures nearby correlations efficiently |
| Parameter sharing | Same filter weights used at every position; one edge detector works everywhere in the image |
| Convolution output size | \(\lfloor(H - F + 2P)/S\rfloor + 1\); same padding (\(P = \lfloor F/2\rfloor\)) preserves spatial size |
| Feature hierarchy | Early layers: edges; middle: textures; deep: object parts — all learned from data |
| Max pooling | Take regional maxima; halves spatial size; provides translation invariance; zero extra parameters |
| Volume notation | Track \([H\times W\times D]\) at each layer; \(H,W\) shrink at pool layers; \(D\) grows at conv layers |
| MATLAB CNN | imageInputLayer + convolution2dLayer + maxPooling2dLayer; pass imageDatastore directly to trainNetwork |
| vs. FC (L36) | CNN: same or fewer parameters, but spatially aware → higher accuracy on image tasks |
References
- Karpathy, A. et al. CS231n: Convolutional Neural Networks for Visual Recognition — Lecture 5. Stanford University. cs231n.github.io/convolutional-networks/
- LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
- Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. NeurIPS 2012.
- Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. ECCV 2014.