HW 38 — CNNs, Transformers & Large Language Models
Overview
This final ML assignment focuses on conceptual understanding of modern architectures — CNNs, transformers, and LLMs. Rather than implementing full backpropagation (you did that in HW 35), you will reason about these architectures mathematically, trace computations by hand for small examples, and reflect on their applications to physics and space domain awareness.
An optional MATLAB section allows you to experiment with MATLAB's Deep Learning Toolbox for a simple CNN — this is encouraged but not required.
Part A — Convolutional Neural Networks
A1. Conceptual
- Explain in your own words: why does parameter sharing in a convolutional layer reduce the number of parameters compared to a fully-connected layer? Give a specific numerical example: compare parameters for a \(32 \times 32\) input, FC output of 32 neurons vs. one Conv layer with 32 filters of size \(3 \times 3\).
- What does max pooling do to (a) spatial dimensions and (b) the number of channels? Why does pooling help a CNN generalize to slightly shifted or scaled inputs?
- A CNN trained on radar satellite imagery achieves 95% accuracy on training data but only 72% on test data. List three possible causes and one mitigation strategy for each.
A2. Hand Calculation
Perform the following convolution by hand. Show all work.
Input (4×4, zero-padded to 6×6 with pad=1):
3×3 filter (stride 1):
- Compute the output feature map (4×4). What are the dimensions of the output given input \(H \times W\), filter \(F \times F\), padding \(P\), stride \(S = 1\)? Derive the formula.
- Apply a 2×2 max pool to your feature map. What is the output size?
- The Laplacian filter \(K\) above detects edges (regions of rapid intensity change). If this filter is the learned weight of a convolutional layer, what does it mean for the network to have "learned" this filter? In what layer (early vs. late) would you expect to find this type of filter in a deep CNN trained on images?
Part B — The Attention Mechanism
B1. Conceptual
- Explain the Query-Key-Value metaphor in your own words. What does the dot product \(\mathbf{q}_i \cdot \mathbf{k}_j\) measure? What does the softmax do to these dot products?
- Why do we divide by \(\sqrt{d_k}\) before the softmax? What would happen if \(d_k = 512\) and we didn't scale? (Hint: consider what large dot products do to softmax outputs.)
- In self-attention, a sequence attends to itself. For a sentence like "The satellite maneuvered and it changed altitude," describe qualitatively what a well-trained attention head might learn: which token would "it" attend to most strongly, and why?
B2. Small Attention Calculation
Consider a sequence of 3 tokens with embedding dimension \(d_k = 2\). Given:
- Compute the scaled dot-product matrix \(\mathbf{A} = \frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\) (3×3).
- Apply softmax row-wise to \(\mathbf{A}\) to get the attention weights \(\hat{\mathbf{A}}\). (You may use the formula \(\text{softmax}(\mathbf{a})_i = e^{a_i}/\sum_j e^{a_j}\); compute numerically to 3 decimal places.)
- Compute the output \(\hat{\mathbf{A}}\mathbf{V}\) (3×2). Interpret: what is the output for token 3 (the last row)? Which input token does it most closely resemble, and why?
Part C — Large Language Models
- Describe the pre-training task for a decoder-only LLM (like GPT). Why is this task "self-supervised"? What does "self-supervised" mean in the context of the ML pipeline from Lesson 33?
- GPT-3 has 175 billion parameters. Assuming each parameter is stored as a 16-bit float (2 bytes), how much memory is required to store the model weights alone? Express in gigabytes. (For reference: a modern GPU has ~80 GB VRAM.)
- An LLM is sometimes described as performing "in-context learning" — giving it a few examples in the prompt improves its performance on a new task. How does this differ from gradient-descent training? Which LLM parameters change during in-context learning?
- Propose a specific application of an LLM or transformer in space domain awareness or physics research. Describe: (a) what task it would perform, (b) what data it would be trained or fine-tuned on, and (c) one major challenge or limitation for this application.
Part D — Module Reflection
- Trace the full ML pipeline (from Lesson 33) as it applies to training a CNN to classify satellite imagery. For each step — data, features, model, loss, optimization, evaluation — give a concrete description specific to this application. Which step do you think is most difficult in practice, and why?
- In this module you learned four model families: linear classifier (L34), gradient descent (L35), neural networks (L36–37), and CNNs/transformers (L38). For each of the following real-world problems, identify the most appropriate model family and justify your choice:
- Predicting orbital period from 3 orbital parameters (regression).
- Classifying a 256×256 radar image of a satellite as debris vs. active.
- Modeling 10-step temporal sequences of satellite telemetry to detect anomalies.
- Predicting satellite drag coefficient from 50 tabular sensor features.
Optional MATLAB CNN Demo
If you have the MATLAB Deep Learning Toolbox, implement a simple CNN using MATLAB's layer API and train it on the built-in digitDataset (handwritten digit recognition). This gives you hands-on experience with a production-grade CNN framework.
%% Optional HW38 — CNN with MATLAB Deep Learning Toolbox
% Requires: Deep Learning Toolbox
% Load built-in digit dataset
digitDatasetPath = fullfile(matlabroot,'toolbox','nnet','nndemos', ...
'nndatasets','DigitDataset');
imds = imageDatastore(digitDatasetPath, ...
'IncludeSubfolders',true,'LabelSource','foldernames');
[imdsTrain, imdsTest] = splitEachLabel(imds, 0.8, 'randomized');
% Define CNN architecture
layers = [
imageInputLayer([28 28 1])
convolution2dLayer(3, 16, 'Padding','same')
batchNormalizationLayer
reluLayer
maxPooling2dLayer(2, 'Stride',2)
convolution2dLayer(3, 32, 'Padding','same')
batchNormalizationLayer
reluLayer
maxPooling2dLayer(2, 'Stride',2)
fullyConnectedLayer(128)
reluLayer
fullyConnectedLayer(10)
softmaxLayer
classificationLayer
];
% Training options
opts = trainingOptions('sgdm', ...
'MaxEpochs', 10, ...
'MiniBatchSize', 128, ...
'InitialLearnRate', 0.01, ...
'Plots', 'training-progress', ...
'Verbose', false);
% Train
net = trainNetwork(imdsTrain, layers, opts);
% Evaluate
y_pred = classify(net, imdsTest);
acc = mean(y_pred == imdsTest.Labels);
fprintf('Test accuracy: %.2f%%\n', acc*100);
For full bonus credit: (1) plot the test confusion matrix, (2) visualize the learned filters of the first convolutional layer, and (3) show 5 misclassified examples with their predicted vs. true labels.