Homework Assignment

HW 36 — Training Parameter Exploration

📘 Related: Lesson 36 🛠 MATLAB + Deep Learning Toolbox required 📁 Dataset: MATLAB DigitDataset

🌍 Background

In the Lesson 36 worked example you trained a small fully-connected network on handwritten digits. The default parameters — a high learning rate, small network, and limited data — produce an overfitting pattern in the Training Progress plot: training accuracy climbs while validation accuracy lags. In this assignment you will systematically vary those parameters to diagnose and correct each of the training curve patterns discussed in the lesson.

Use the same dataset loading, flattening, and Deep Learning Toolbox layer API (featureInputLayer, fullyConnectedLayer, trainingOptions, trainNetwork) from the Lesson 36 example. Use 200 images per class (2,000 total) and an 80/20 train/test split unless a problem specifies otherwise. Set rng(356) at the top of your script so your data split is reproducible.

Problem 1 — Fixing Underfitting: Network Size

  1. Train your network with at least four different hidden layer sizes: 10, 50, 100, and 200 neurons. Use 'adam' as your optimizer, 50 max epochs, and learn_rate = 0.001 (the Adam default — a stable starting point). For each run, record the test accuracy.
  2. For each architecture, examine the Training Progress window and report what you observe. Does the pattern change as the network grows? At what hidden-layer size, if any, do you see a transition in behavior?

Problem 2 — Inducing and Controlling Overfitting

  1. Using a very large network on limited data is a reliable way to produce overfitting. Create a small subset (30 images per class, 300 total) and train a large network on it. Run the experiment twice — once with early stopping active ('ValidationPatience', 6) and once with it disabled ('ValidationPatience', Inf). Use 200 max epochs and 'adam' with its default learning rate.
    rng(356);
    imds_small = splitEachLabel(imds, 30, 'randomize');
    [imds_s_train, imds_s_test] = splitEachLabel(imds_small, 0.8, 'randomize');
    
    % Flatten (same pattern as Lesson 36 example)
    imgs_s    = readall(imds_s_train);
    v_s       = cellfun(@(I) double(I(:))/255, imgs_s, 'UniformOutput', false);
    X_s_train = [v_s{:}]';
    imgs_st   = readall(imds_s_test);
    v_st      = cellfun(@(I) double(I(:))/255, imgs_st, 'UniformOutput', false);
    X_s_test  = [v_st{:}]';
    Y_s_train = imds_s_train.Labels;
    Y_s_test  = imds_s_test.Labels;
    
    layers_big = [
        featureInputLayer(784)
        fullyConnectedLayer(200)
        reluLayer
        dropoutLayer(0.3)
        fullyConnectedLayer(10)
        softmaxLayer
        classificationLayer
    ];
    
    % Run 1: early stopping active
    opts1 = trainingOptions('adam', 'MaxEpochs', 200, ...
        'ValidationData',     {X_s_test, Y_s_test}, ...
        'ValidationPatience', 6, ...
        'Plots', 'training-progress', 'Verbose', false);
    net1 = trainNetwork(X_s_train, Y_s_train, layers_big, opts1);
    
    % Run 2: early stopping disabled
    opts2 = trainingOptions('adam', 'MaxEpochs', 200, ...
        'ValidationData',     {X_s_test, Y_s_test}, ...
        'ValidationPatience', Inf, ...
        'Plots', 'training-progress', 'Verbose', false);
    net2 = trainNetwork(X_s_train, Y_s_train, layers_big, opts2);
    
  2. Examine the Training Progress plot for Run 2 (no early stopping). Do you see the overfitting pattern? Approximately which epoch does validation accuracy reach its maximum?
  3. Compare the final test accuracy of Run 1 (early stopping) vs. Run 2 (no early stopping). Does early stopping help? Why?

Problem 3 — Effect of Learning Rate

  1. Return to the full 200-per-class dataset and your best architecture from Problem 1. Switch to SGD with momentum ('sgdm') and try at least four learning rates (e.g., 0.001, 0.01, 0.1, 0.5). Use 100 max epochs, mini-batch size 64, and momentum 0.9. For each run record the test accuracy and describe the shape of the Training Progress curve.
    lr = 0.01;   % vary this
    opts = trainingOptions('sgdm', ...
        'MaxEpochs',        100, ...
        'MiniBatchSize',    64, ...
        'InitialLearnRate', lr, ...
        'Momentum',         0.9, ...
        'ValidationData',   {X_test, Y_test}, ...
        'Plots',            'training-progress', ...
        'Verbose',          false);
    net    = trainNetwork(X_train, Y_train, layers, opts);
    Y_pred = classify(net, X_test);
    acc    = 100 * mean(Y_pred == Y_test);
    fprintf('lr=%.3f  Test acc: %.1f%%\n', lr, acc);
    
  2. For the learning rate that gave the best accuracy, include the Training Progress plot in your submission. Which of the four curve patterns from Figure 2 of the lesson does it most resemble?
  3. Now retrain with 'adam' (keeping all other options the same, and removing the explicit InitialLearnRate so it uses the Adam default of 0.001). Compare its test accuracy and convergence speed to your best 'sgdm' result. Was Adam better or worse? Explain in 2–3 sentences what advantage Adam provides even if its peak accuracy on this problem is lower than a well-tuned SGDM.