Lesson 35 — Neural Networks

Learning Objectives

Describe a feedforward neural network in terms of layers, neurons, weights, biases, and activations.
Write the forward pass equations for a multi-layer network in both scalar and matrix form.
Compare common activation functions: sigmoid, tanh, and ReLU, and explain why ReLU is preferred.
State the Universal Approximation Theorem and understand its significance and limitations.
Implement the forward pass of a shallow (1 hidden layer) neural network in MATLAB.

Notation used in this lesson

\(\mathbf{x}\)Input feature vector, \(\mathbf{x} \in \mathbb{R}^d\) where \(d\) is the number of features

\(L\)Number of layers (subscripts 1…L index each layer)

\(\mathbf{W}_\ell\)Weight matrix, layer \(\ell\)

\(\mathbf{b}_\ell\)Bias vector, layer \(\ell\)

\(\mathbf{z}_\ell\)Pre-activation (linear combination)

\(\mathbf{a}_\ell\)Activation (post-activation)

\(n_\ell\)Number of neurons in layer \(\ell\)

\(\phi(\cdot)\)Activation function

\(\hat{\mathbf{y}}\)Network output (predictions)

Subscript convention: Subscript \(\ell\) denotes the layer index (e.g. \(\mathbf{W}_1\) is the first layer's weight matrix, \(\mathbf{W}_2\) the second, etc.). This avoids superscript clutter when equations also involve powers or transposes.

🌍 Background & Motivation

The limitation of linear models

The linear classifier from Lesson 34 is powerful but fundamentally limited: it can only learn linear decision boundaries. Real data is rarely linearly separable. Satellite orbit families, material phases, particle signatures — these require curved, complex boundaries.

A naive fix might be to simply stack multiple linear layers. But this does not help: the composition of linear transformations is itself a linear transformation. No matter how many layers you add, the model collapses to a single matrix multiply — equivalent to one linear layer with different weights. Stacking linear layers buys you nothing.

Why nonlinearity is essential: A neural network interleaves linear transformations with a nonlinear activation function \(\phi\). Because \(\phi\) is nonlinear, the composition \(\phi(\mathbf{W}_2 \phi(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2)\) cannot collapse to a single linear layer — it genuinely increases the model's expressive power with each added layer.

Neural networks exploit this by stacking many such layers. The composition of these layers can represent arbitrarily complex functions — a result formalized by the Universal Approximation Theorem.

Biological inspiration (and where it ends)

Artificial neurons are loosely inspired by biological neurons: they receive inputs, compute a weighted sum, and fire (activate) if the sum exceeds a threshold. However, modern neural networks are best thought of as parameterized function approximators, not brain simulations. The biological analogy is more historical than mechanistic.

🧠 Key Concepts

1. Network Architecture

A feedforward neural network (also called a multi-layer perceptron, MLP) consists of:

Input layer: receives the feature vector \(\mathbf{x} \in \mathbb{R}^d\). No computation here.
Hidden layers: one or more layers where learning happens. Each neuron receives all outputs from the previous layer.
Output layer: produces the prediction \(\hat{\mathbf{y}}\). For regression: linear output. For classification: sigmoid (binary) or softmax (multi-class).

We describe the architecture by listing the number of neurons per layer. Example: a network with input \(d=10\), one hidden layer of 64 neurons, and output of 1 is written as [10 → 64 → 1].

Mathematical model of a single neuron: weighted inputs summed with bias, then passed through activation function — Figure 1: A single artificial neuron. Inputs x_i are multiplied by learned weights w_i, summed with bias b to form the pre-activation z, then passed through a nonlinear activation function f to produce output a. (Source: CS231n Lecture 4.)

2-layer neural network with 3 inputs, 4 hidden neurons, and 2 outputs — Figure 2a: A 2-layer network [3 → 4 → 2].

3-layer neural network with 3 inputs, two hidden layers of 4 neurons each, and 1 output — Figure 2b: A 3-layer network [3 → 4 → 4 → 1].

Every neuron in one layer connects to every neuron in the next (“fully connected”). CS231n counts layers by the number of weight matrices, so these are called 2-layer and 3-layer networks respectively (the input layer is not counted). (Source: CS231n Lecture 4.)

2. The Forward Pass (Single Layer)

For layer \(\ell\), the computation is:

\underbrace{\mathbf{z}_\ell}_{\text{pre-activation}} = \mathbf{W}_\ell \mathbf{a}_{\ell-1} + \mathbf{b}_\ell \qquad \longrightarrow \qquad \underbrace{\mathbf{a}_{\ell-1}}_{\text{activation}} = \phi\!\left(\mathbf{z}_{\ell-1}\right)

Here, \(\mathbf{a}_{\ell-1}\) is the activation output from the previous layer (for the first hidden layer, \(\mathbf{a}_0 = \mathbf{x}\)), \(\mathbf{W}_\ell \in \mathbb{R}^{n_\ell \times n_{\ell-1}}\) is the weight matrix, \(\mathbf{b}_\ell \in \mathbb{R}^{n_\ell}\) is the bias vector, and \(\phi\) is the nonlinear activation function applied element-wise.

3. Full Forward Pass (Two-Layer Network)

A network with one hidden layer and a binary classification output:

\mathbf{a}_0 = \mathbf{x} \quad \text{(input)} \] \[ \mathbf{z}_1 = \mathbf{W}_1 \mathbf{a}_0 + \mathbf{b}_1 \] \[ \mathbf{a}_1 = \text{ReLU}(\mathbf{z}_1) \] \[ \mathbf{z}_2 = \mathbf{W}_2 \mathbf{a}_1 + \mathbf{b}_2 \] \[ \hat{y} = \phi(\mathbf{z}_2)

The network applies a learned nonlinear transformation from \(\mathbf{x}\) to \(\hat{y}\) in three matrix operations. The power comes from learning \(\mathbf{W}_1, \mathbf{b}_1, \mathbf{W}_2, b_2\) jointly.

4. Activation Functions

Many different nonlinear functions can serve as \(\phi\). The choice affects training stability, gradient flow, and model performance. Some of the most commonly used activation functions are shown below, ordered by prevalence in modern practice.

Function	Formula	Range	Common Use
ReLU	\(\max(0, z)\)	\([0,\infty)\)	Hidden layers (modern default)
Sigmoid	\(\sigma(z) = \frac{1}{1+e^{-z}}\)	\((0,1)\)	Binary output layer
Softmax	\(\frac{e^{z_k}}{\sum_j e^{z_j}}\)	\((0,1)\), sums to 1	Multi-class output layer
Tanh	\(\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\)	\((-1,1)\)	Hidden layers (older architectures)

ReLU activation function — ReLU — max(0, z)

Sigmoid activation function — Sigmoid — squashes to (0, 1)

Tanh activation function — Tanh — squashes to (−1, 1)

Figure 3: Common activation functions. Sigmoid and tanh saturate (flat slope) for large |z|, causing vanishing gradients. ReLU has slope exactly 1 for z > 0, making it the modern default for hidden layers. (Source: CS231n.)

Why ReLU? Sigmoid and tanh saturate (gradient ≈ 0) for large inputs — the “vanishing gradient” problem that makes deep networks hard to train. ReLU has gradient exactly 1 for positive inputs, enabling training of much deeper networks.

5. Parameter Count

Every weight matrix and bias vector in the network contains learnable parameters — values adjusted during training. For each layer \(\ell\), the weight matrix \(\mathbf{W}_\ell \in \mathbb{R}^{n_\ell \times n_{\ell-1}}\) contributes \(n_\ell \cdot n_{\ell-1}\) parameters, and the bias vector \(\mathbf{b}_\ell \in \mathbb{R}^{n_\ell}\) contributes \(n_\ell\) more. For a three-layer network \(d \to n_1 \to n_2 \to n_\text{out}\), the total is:

\text{Total params} = \underbrace{d \cdot n_1 + n_1}_{\mathbf{W}_1,\mathbf{b}_1} + \underbrace{n_1 \cdot n_2 + n_2}_{\mathbf{W}_2,\mathbf{b}_2} + \underbrace{n_2 \cdot n_\text{out} + n_\text{out}}_{\mathbf{W}_3,\mathbf{b}_3}

Understanding the parameter count helps you reason about model capacity and overfitting risk — more parameters means more expressive power, but also a greater demand for training data. To put this in perspective, some well-known models:

Model	Year	Parameters	Task
LeNet-5	1998	~60 K	Handwritten digit recognition
AlexNet	2012	~61 M	ImageNet classification
ResNet-50	2015	~25 M	ImageNet classification
GPT-2	2019	~1.5 B	Text generation
GPT-3	2020	~175 B	General language tasks
GPT-4 (est.)	2023	~1.8 T	General language & reasoning

6. Universal Approximation Theorem

A feedforward network with one hidden layer and a nonlinear activation can approximate any continuous function on a compact domain to arbitrary accuracy, given enough neurons. This is a theoretical existence result — it does not say learning is easy, only that the capacity is there.

Important caveat: The theorem guarantees expressiveness, not learnability. In practice, deep networks (many layers, fewer neurons each) generalize better and are easier to train than very wide shallow networks.

🎬 Recommended Videos

CS231n Lecture 4 covers neural network forward passes, activation functions, and the biological motivation:

▶

Lecture 4 — Introduction to Neural Networks

CS231n Stanford — YouTube

Watch 55:11 – 1:44:00 only.

The 3Blue1Brown series gives the best visual intuition for what the network is doing:

▶

But what is a neural network? (Chapter 1)

3Blue1Brown — YouTube

💻 Worked Example — MATLAB Forward Pass with `patternnet`

We use MATLAB's patternnet (Deep Learning Toolbox) to build a shallow network, run a forward pass before and after training, and classify all samples. This example uses a synthetic spectral dataset — not sat_data.mat, which you will work with in HW 35.

%% Lesson 35 — Neural Network Forward Pass with patternnet
% Domain: synthetic spectral classification (3 features, 2 classes)
% HW 35 uses sat_data.mat — this example uses a different dataset.

rng(356);
d = 3;   K = 2;   N_per = 60;   N = 2 * N_per;

%% --- Generate synthetic spectral data ---
% Two emission source types measured at three spectral bands
X = [randn(N_per, d)*0.5 + [0.2  0.8  0.3];     % Class 1
     randn(N_per, d)*0.5 + [0.8  0.2  0.7]]';   % Class 2
% X is d × N  (patternnet expects each column = one sample)

labels = [ones(1, N_per), 2*ones(1, N_per)];     % 1 × N

% Convert integer labels to one-hot encoding (K × N)
T = full(ind2vec(labels));    % e.g. label 1 → [1;0], label 2 → [0;1]

fprintf('Dataset: %d samples, %d features, %d classes\n', N, d, K);

%% --- 1. Define network: 3 → 10 → 2 ---
net = patternnet(10);         % one hidden layer, 10 neurons
net = configure(net, X, T);  % set input/output sizes (random weights assigned)
view(net);                    % opens network architecture diagram

%% --- 2. Forward pass BEFORE training (random initial weights) ---
x1 = X(:, 1);                % d × 1 column (first sample)
p1 = net(x1);                % K × 1 softmax probabilities

fprintf('\n--- Forward Pass (BEFORE training) ---\n');
fprintf('Input:     [%.3f  %.3f  %.3f]\n', x1);
fprintf('Output:    [%.4f  %.4f]\n', p1);
fprintf('Predicted: Class %d  |  True: Class %d\n', vec2ind(p1), labels(1));
fprintf('(Random weights — prediction is not yet meaningful)\n');

%% --- 3. Train the network ---
net = train(net, X, T);

%% --- 4. Forward pass AFTER training ---
p1_trained = net(x1);

fprintf('\n--- Forward Pass (AFTER training) ---\n');
fprintf('Output:    [%.4f  %.4f]\n', p1_trained);
fprintf('Predicted: Class %d  |  True: Class %d\n', ...
    vec2ind(p1_trained), labels(1));

%% --- Classify all samples and compute accuracy ---
P_all = net(X);                        % K × N output probabilities
pred  = vec2ind(P_all);                % 1 × N predicted class labels
acc   = mean(pred == labels) * 100;
fprintf('\nOverall accuracy: %.1f%%\n', acc);

Why patternnet? For small datasets (hundreds of samples, few features), patternnet uses a second-order optimizer (scaled conjugate gradient) that processes all data each iteration. This converges far faster and more reliably than mini-batch SGD, which was designed for large-scale deep learning.

📝 Summary

Concept	Key Idea
Layer computation	\(\mathbf{z}_\ell = \mathbf{W}_\ell\mathbf{a}_{\ell-1}+\mathbf{b}_\ell\), \(\mathbf{a}_\ell=\phi(\mathbf{z}_\ell)\)
ReLU	\(\max(0,z)\) — default hidden layer activation; avoids vanishing gradients
Parameters	Each layer \(\ell\) has \(n_\ell \times n_{\ell-1}\) weights + \(n_\ell\) biases
Forward pass	Chain of matrix-vector products + pointwise nonlinearities
Universal approximation	One hidden layer + nonlinearity can approximate any continuous function

📋 HW 35 — Shallow Neural Network (sat_data.mat)

📚 References

Karpathy, A. et al. (2017). CS231n: Convolutional Neural Networks for Visual Recognition — Neural Networks Part 1 notes. Stanford University. http://cs231n.github.io/neural-networks-1/
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chs. 6–7. MIT Press.
3Blue1Brown. (2017). But what is a neural network? [Video]. YouTube.
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314.
He, K. et al. (2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet. ICCV 2015.

← L34: Linear Classification

Next Lesson
L36: Backprop & Training NNs

→

Neural Networks — Architecture & Forward Pass

Learning Objectives

Notation used in this lesson

🌍 Background & Motivation

The limitation of linear models

Biological inspiration (and where it ends)

🧠 Key Concepts

1. Network Architecture

2. The Forward Pass (Single Layer)

3. Full Forward Pass (Two-Layer Network)

4. Activation Functions

5. Parameter Count

6. Universal Approximation Theorem

🎬 Recommended Videos

💻 Worked Example — MATLAB Forward Pass with patternnet

📝 Summary

📚 References

💻 Worked Example — MATLAB Forward Pass with `patternnet`