Lesson 37 • Physics 356

Convolutional Neural Networks (CNNs)

⏱ ~40 min read

Learning Objectives

Notation used in this lesson

\(\mathbf{X}\)Input volume, size \(H \times W \times C\)
\(\mathbf{W}_k\)Weight tensor of filter \(k\), size \(F \times F \times C\)
\(b_k\)Bias for filter \(k\)
\(F,\;S,\;P\)Filter size, stride, zero-padding
\(K\)Number of filters (output depth)
\(\mathbf{A}[i,j,k]\)Output activation: row \(i\), column \(j\), filter \(k\)

🌍 Background & Motivation

Why fully-connected networks struggle with images

In Lessons 35–36 we built fully-connected networks that treat every input feature identically. For image data this is a problem: a 28×28 grayscale image has 784 features, manageable, but a 256×256 RGB image has 196,608 features. A single hidden FC layer with 1,024 neurons would require over 200 million weights — before even reaching the output layer.

Even ignoring the parameter count, there is a deeper issue: FC layers have no notion of spatial structure. A pixel at position (0, 0) is treated identically to one at (127, 127). But meaningful image features — edges, textures, corners — are local and they appear at many positions. There is no reason to learn a separate detector for an edge in the top-left versus the top-right of the image.

Convolutional Neural Networks (CNNs) encode two structural assumptions that fix both problems:

The historical breakthrough

The idea of convolutional networks dates to Yann LeCun's LeNet (1998), trained on handwritten digits — the same DigitDataset you are using in this course. For over a decade, CNNs were considered too computationally expensive to scale.

In 2012, Krizhevsky, Sutskever, and Hinton trained AlexNet on GPUs and entered the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Their top-5 error rate was 15.3% — more than 10 percentage points better than the next competitor. The result was a shock to the computer vision community and launched the modern deep learning era. By 2015 the best models (ResNet, 152 layers) had surpassed human-level performance on the same benchmark.

Depth matters: The progression was rapid — 8 layers (AlexNet, 2012) → 19 layers (VGGNet, 2014) → 22 layers (GoogLeNet, 2014) → 152 layers (ResNet, 2015). Each jump brought major accuracy gains, but only became possible as researchers solved the vanishing gradient problem with better initialization, batch normalization, and residual connections.

Physics applications

CNNs are now standard tools across physics: galaxy morphology classification and gravitational lens detection in astrophysics; phase identification from diffraction patterns in materials science; SAR image interpretation and GEO object classification in space domain awareness; and particle track reconstruction in high-energy physics.

🧮 The Convolution Operation

Mathematical definition

A convolutional layer applies \(K\) filters in parallel. Each filter \(\mathbf{W}_k\) has size \(F \times F \times C\) (spatial extent \(F\), spanning all \(C\) input channels). The filter slides across the input, computing a dot product at each position. The output activation at spatial position \((i, j)\) for filter \(k\) is:

\[ \mathbf{A}[i,\,j,\,k] \;=\; \phi\!\left( \sum_{di=0}^{F-1}\sum_{dj=0}^{F-1}\sum_{dc=0}^{C-1} \mathbf{X}[\,i{\cdot}S+di,\;\;j{\cdot}S+dj,\;\;dc\,] \cdot \mathbf{W}_k[\,di,\;dj,\;dc\,] \;+\; b_k \right) \]

where \(\phi\) is the activation function (typically ReLU), \(S\) is the stride, and \(b_k\) is a scalar bias. The three nested sums simply compute a dot product between a local \(F \times F \times C\) patch of the input and the filter — then apply bias and activation.

What are channels?

The depth dimension \(C\) of the input volume is called the number of channels. Its meaning depends on where you are in the network:

Location\(C\)What each channel represents
First layer input — color image 3 Red, Green, Blue pixel intensities. A 480×640 photo is a \(480\times640\times3\) volume. Each of the 3 planes holds one color channel.
First layer input — grayscale image 1 A single intensity value per pixel. The DigitDataset images are \(28\times28\times1\).
First layer input — multispectral / physics data \(C\) Any set of co-registered 2-D measurements: radar bands, spectral wavelengths, polarization states, time steps of a field, … CNNs treat them identically to RGB.
Hidden layer output \(K\) One learned feature map per filter. Channel \(k\) encodes "how strongly filter \(k\) activated at each spatial position." These are not human-interpretable colors — they are abstract learned features.

The key point is that every filter spans all \(C\) input channels simultaneously. When a 3×3 filter processes an RGB image, it sees a 3×3×3 = 27-element patch and learns a single scalar response. This lets a filter detect, say, "a red edge" as opposed to a "green edge" — colour and spatial structure are learned jointly. The innermost sum \(\sum_{dc}\) in the convolution formula is what folds all channels into one response.

Visualizing the sliding filter

Two 3×3 filters scan the same 5×5 input. Filter k=1 (gold) is a diagonal+center detector; Filter k=2 (teal) is a left-edge (Sobel-X) detector. The animation alternates between them — watch the active overlay on the input and the highlighted cell filling in each output map. Running both filters in parallel stacks the two maps into a 3×3×2 output volume (right). The numbers inside each filter are the weights — the values the network learns during backpropagation.

Input (5×5) Filters k=1 ▼ k=2 ▼ Output Maps k=1 (after ReLU) k=2 (after ReLU) K=2 Output Vol. = = 0 0 1 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 0 0 0 1 0 0 1 0 1 0 1 0 1 0 1 -1 0 1 -2 0 2 -1 0 1 4 3 4 3 5 3 4 3 4 3 0 0 2 0 0 3 0 0 k=2 k=1 3×3×2 Stride S=1, no padding → output per filter: (5−3)/1+1 = 3×3 → K=2 filters produce a 3×3×2 volume
Gold (k=1): diagonal+center detector → map [4,3,4 / 3,5,3 / 4,3,4].  Teal (k=2): Sobel-X left-edge detector → map [3,0,0 / 2,0,0 / 3,0,0] after ReLU. Both maps stack into the 3×3×2 output volume on the right. The filter values shown are the learned weights — backpropagation adjusts them to minimize classification loss.
Depth column: At every spatial position \((i,j)\) in the output there are \(K\) values — one per filter — stacked along the depth axis. All \(K\) neurons in a depth column \(\mathbf{A}[i,j,:]\) look at exactly the same \(F\times F\) input patch but apply different filters, each detecting a different feature (edge orientation, color blob, texture, …).

⚙️ Key Hyperparameters & Output Size

ParameterSymbolEffectTypical values
Filter size\(F\)Receptive field of each neuron; larger = captures more context3, 5, 7
Stride\(S\)Step between filter positions; larger = smaller output, downsampling1, 2
Padding\(P\)Zeros added to input border; \(P{=}\lfloor F/2\rfloor\) keeps spatial size unchanged (same)0 (valid), 1 (same for F=3)
Filters\(K\)Number of distinct feature detectors; equals output depth32, 64, 128, 256

The output feature map size (height or width) is:

\[ H_\text{out} = \left\lfloor \frac{H_\text{in} - F + 2P}{S} \right\rfloor + 1 \]

For a 28×28 input, \(F=3\), \(P=1\) (same padding), \(S=1\): \(H_\text{out} = (28-3+2)/1+1 = 28\). Spatial size is preserved.

Same input, \(P=0\), \(S=1\): \(H_\text{out} = (28-3)/1+1 = 26\). Slightly smaller.

Parameter count per conv layer: \(F \times F \times C_\text{in} \times K + K\). Each of the \(K\) filters has \(F\times F \times C_\text{in}\) weights plus one bias. Note that \(C_\text{in}\) is the depth of the input volume — a filter must span the full input depth to compute a proper dot product.

👁️ Feature Maps — What Filters Learn

The remarkable property of CNNs is that the filters are learned from data, not hand-designed. Given enough examples and a good loss function, the network discovers which local patterns are useful for the task.

Visualization studies (Zeiler & Fergus, 2014) revealed a consistent feature hierarchy across virtually every trained CNN:

Layer depthWhat filters detectExample
Layer 1–2Oriented edges, color blobs, Gabor-like patternsHorizontal/vertical/diagonal edges
Layer 3–4Textures, corners, junctions of edgesGrids, checkerboards, curve endings
Layer 5+Object parts & high-level semantic featuresWheel shapes, eyes, digit strokes

This hierarchy is not programmed in — it emerges automatically from gradient descent on any sufficiently large image dataset. Lower layers are nearly identical across networks trained on completely different tasks; only the deepest layers specialize.

Transfer learning exploits this: take a network pre-trained on a large general dataset (e.g., ImageNet), freeze the early layers (which have learned universal edge and texture detectors), and fine-tune only the later layers on your specific task. This is why CNNs work well even with small domain-specific datasets.

📊 Pooling — Downsampling with Translation Invariance

After a conv+ReLU block, a pooling layer reduces spatial dimensions. The most common is max pooling: divide the feature map into non-overlapping windows and take the maximum in each.

\[ p_{ij} = \max_{(m,n)\,\in\,\text{window}_{ij}} a_{mn} \]

A 2×2 max pool with stride 2 halves both H and W, reducing the number of activations by 75%.

Input (4×4) Output (2×2) 1 3 2 4 5 2 8 1 3 7 9 2 1 4 6 5 5 8 7 9 2×2 stride 2
Each 2×2 colored region contributes its maximum to the corresponding output cell. The bold values are the winners. Spatial size: 4×4 → 2×2, parameters added: zero.

Why max pooling helps: If a feature (e.g., a vertical edge) is detected slightly to the left of where the network expects it, max pooling still passes a strong activation. This translation invariance makes the network more robust to small shifts in the input. Pooling also aggressively reduces the number of activations fed to subsequent layers, cutting computation and acting as a form of regularization.

Average pooling takes the mean instead of the maximum. It is less common in classification networks but is used in global average pooling layers at the end of modern architectures (e.g., ResNet), which replace the large FC layers entirely.

🧱 Full CNN Architecture — Tracking Volume Dimensions

A CNN is a sequence of layers, each transforming a 3-D volume \([H \times W \times D]\) into another. Spatial dimensions \(H\) and \(W\) shrink at pooling layers; depth \(D\) grows at conv layers. The network trades spatial resolution for richer feature representations, then flattens and classifies. The architecture below is the one used in the Lesson 37 worked example and HW 37 for classifying the MATLAB DigitDataset (28×28 grayscale handwritten digits, 10 classes).

Input 28×28×1 Conv+BN +ReLU Conv1 28×28×8 MaxPool 2×2, S=2 Pool1 14×14×8 Conv+BN +ReLU Conv2 14×14×16 MaxPool 2×2, S=2 Pool2 7×7×16 Flatten Flatten 784 FC Output 10 classes Feature volume Output vector Stacking depth = channel count
Volume flow for a CNN on 28×28 grayscale images. Spatial dimensions shrink at each pooling layer; depth grows at each conv layer (shown as stacked blue layers). After two Conv+Pool blocks, the 7×7×16 volume is flattened to 784 values and classified by a fully-connected layer.

Worked example — DigitDataset CNN

LayerOutput volume \([H \times W \times D]\)Parameters
Input image\([28 \times 28 \times 1]\)
Conv (F=3, K=8, P=1, S=1) + BN + ReLU\([28 \times 28 \times 8]\)\(3{\times}3{\times}1{\times}8 + 8 = 80\)
MaxPool 2×2, S=2\([14 \times 14 \times 8]\)0
Conv (F=3, K=16, P=1, S=1) + BN + ReLU\([14 \times 14 \times 16]\)\(3{\times}3{\times}8{\times}16 + 16 = 1{,}168\)
MaxPool 2×2, S=2\([7 \times 7 \times 16]\)0
Flatten\([784]\)0
FC → 10 + Softmax\([10]\)\(784{\times}10 + 10 = 7{,}850\)
Total≈9,100

Parameter Efficiency: CNN vs. Fully-Connected

Recall the L36 network: featureInputLayer(784) → FC(10) → FC(10) required flattening the image and had \(784 \times 10 + 10 \times 10 = 7{,}940\) parameters but still struggled because FC layers ignore spatial structure.

Layer typeInputParametersSpatial awareness
FC (784 → 10)784 flattened pixels7,850None — every pixel treated independently
Conv (3×3, 8 filters, on 28×28)28×28×1 image80Yes — same filter applied at every position
Full CNN (two conv blocks)28×28×1 image≈9,100Yes — hierarchical feature extraction
Key insight: The CNN has slightly more total parameters than the L36 FC network, but those parameters are used far more efficiently — each filter weight is shared across all \(H \times W\) spatial positions, and the architecture explicitly encodes the spatial inductive bias of image data. The result is dramatically better accuracy on the same dataset.

🎬 Recommended Videos

Watch these two videos before class. The first gives an excellent visual walkthrough of how CNNs work end-to-end; the second covers CNN architecture, training, and design choices in depth from a university course perspective.

🖥 Interactive tool: CNN Explainer lets you run a small CNN in your browser and watch activations, filters, and feature maps update in real time as you change the input image. Great for building intuition before the homework.

💻 Worked Example — CNN on DigitDataset

This example mirrors the Lesson 36 FC network but replaces every featureInputLayer + fullyConnectedLayer with proper CNN layers. The key difference: images are no longer flattened. The network sees the full 28×28 image and can exploit spatial structure.

Three parameters at the top control the experiment. The defaults are chosen to show good generalization. To make a fair parameter-count comparison against the L36 FC network, run the Lesson 36 example with n_hidden = 11, learn_rate = 0.001, and n_per_class = 200 (50 epochs). That FC network has \(784 \times 11 + 11 + 11 \times 10 + 10 \approx 8{,}755\) parameters — roughly equal to this CNN’s ≈9,100. Compare accuracy and Training Progress plots: the CNN should win decisively despite similar parameter counts, because it exploits the spatial structure of images.

%% Lesson 37 — CNN on MATLAB DigitDataset
% Parallel to L36 but uses convolutional layers: no flattening needed.
% Architecture: imageInputLayer → Conv+BN+ReLU → Pool → Conv+BN+ReLU → Pool → FC → Softmax
clear; clc;

%% ── Parameters to experiment with ──────────────────────────────────────────
n_per_class = 200;   % images per digit class     (max 1000; try 100–500)
n_filters   = 8;     % filters in first Conv layer (second gets 2×n_filters)
learn_rate  = 0.01;  % Adam learning rate          (try 0.001, 0.01, 0.05)
%% ─────────────────────────────────────────────────────────────────────────

%% 1. Load dataset — no flattening: images stay as 28×28 arrays
digitPath = fullfile(matlabroot,'toolbox','nnet','nndemos', ...
                     'nndatasets','DigitDataset');
imds = imageDatastore(digitPath, ...
    'IncludeSubfolders', true, 'LabelSource', 'foldernames');
rng(356);
imds_sub = splitEachLabel(imds, n_per_class, 'randomize');
[imds_train, imds_test] = splitEachLabel(imds_sub, 0.8, 'randomize');

fprintf('Train: %d  |  Test: %d images\n', ...
    numel(imds_train.Files), numel(imds_test.Files));

%% 2. Define CNN architecture
% imageInputLayer accepts full images — the network learns spatial features
layers = [
    imageInputLayer([28 28 1], 'Normalization', 'rescale-zero-one', ...
                               'Name', 'input')

    % Block 1: 28×28×1 → 28×28×n_filters → 14×14×n_filters
    convolution2dLayer(3, n_filters, 'Padding', 'same', 'Name', 'conv1')
    batchNormalizationLayer('Name', 'bn1')
    reluLayer('Name', 'relu1')
    maxPooling2dLayer(2, 'Stride', 2, 'Name', 'pool1')

    % Block 2: 14×14×n_filters → 14×14×(2*n_filters) → 7×7×(2*n_filters)
    convolution2dLayer(3, 2*n_filters, 'Padding', 'same', 'Name', 'conv2')
    batchNormalizationLayer('Name', 'bn2')
    reluLayer('Name', 'relu2')
    maxPooling2dLayer(2, 'Stride', 2, 'Name', 'pool2')

    % Classifier
    fullyConnectedLayer(10,  'Name', 'fc')
    softmaxLayer('Name', 'softmax')
    classificationLayer('Name', 'output')
];

%% 3. Set training options (same API as L36)
opts = trainingOptions('adam', ...
    'MaxEpochs',           20, ...
    'MiniBatchSize',       64, ...
    'InitialLearnRate',    learn_rate, ...
    'ValidationData',      imds_test, ...
    'ValidationFrequency', 5, ...
    'Plots',               'training-progress', ...
    'Verbose',             false);

%% 4. Train — pass imageDatastore directly (no manual flattening!)
net = trainNetwork(imds_train, layers, opts);

%% 5. Evaluate
Y_pred = classify(net, imds_test);
Y_true = imds_test.Labels;
acc    = 100 * mean(Y_pred == Y_true);
fprintf('n_per_class=%d  n_filters=%d  lr=%.4f  Test acc: %.1f%%\n', ...
    n_per_class, n_filters, learn_rate, acc);

%% 6. Visualize 25 random test examples in a 5×5 grid
%  Test images are stored in class order, so randperm gives a varied mix.
imgs_test = readall(imds_test);
idx = randperm(numel(imgs_test), 25);
figure('Name', 'CNN Test Predictions', 'NumberTitle', 'off');
for k = 1:25
    subplot(5,5,k);
    imshow(imgs_test{idx(k)});
    pred_str = char(Y_pred(idx(k)));
    true_str = char(Y_true(idx(k)));
    if strcmp(pred_str, true_str)
        title(sprintf('Pred: %s', pred_str), 'Color', 'g', 'FontSize', 11);
    else
        title(sprintf('Pred:%s (T:%s)', pred_str, true_str), ...
              'Color', 'r', 'FontSize', 11);
    end
end
sgtitle('CNN Test Predictions  (green = correct  |  red = wrong)');
What does batchNormalizationLayer do? Batch normalization normalizes the activations within each mini-batch to have zero mean and unit variance, then applies learned scale and shift parameters. This stabilizes training, allows higher learning rates, and reduces sensitivity to weight initialization. It has largely replaced dropout in modern CNN architectures.
Why no flattening? imageInputLayer accepts full 2-D images. The CNN layers operate on the \(H\times W\times C\) volume directly. MATLAB flattens automatically before the fullyConnectedLayer. This is a major difference from the L36 FC approach, where you had to manually compute [v{:}]' to get an \(N\times784\) matrix.

📝 Summary

ConceptKey Idea
Local connectivityEach neuron connects to only a small spatial region; captures nearby correlations efficiently
Parameter sharingSame filter weights used at every position; one edge detector works everywhere in the image
Convolution output size\(\lfloor(H - F + 2P)/S\rfloor + 1\); same padding (\(P = \lfloor F/2\rfloor\)) preserves spatial size
Feature hierarchyEarly layers: edges; middle: textures; deep: object parts — all learned from data
Max poolingTake regional maxima; halves spatial size; provides translation invariance; zero extra parameters
Volume notationTrack \([H\times W\times D]\) at each layer; \(H,W\) shrink at pool layers; \(D\) grows at conv layers
MATLAB CNNimageInputLayer + convolution2dLayer + maxPooling2dLayer; pass imageDatastore directly to trainNetwork
vs. FC (L36)CNN: same or fewer parameters, but spatially aware → higher accuracy on image tasks
📋 HW 37 — CNN on DigitDataset

📚 References

  1. Karpathy, A. et al. CS231n: Convolutional Neural Networks for Visual Recognition — Lecture 5. Stanford University. cs231n.github.io/convolutional-networks/
  2. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
  3. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. NeurIPS 2012.
  4. Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. ECCV 2014.