Lesson 35 • Physics 356

Neural Networks — Architecture & Forward Pass

⏱ ~45 min read

Learning Objectives

Notation used in this lesson

\(\mathbf{x}\)Input feature vector, \(\mathbf{x} \in \mathbb{R}^d\) where \(d\) is the number of features
\(L\)Number of layers (subscripts 1…L index each layer)
\(\mathbf{W}_\ell\)Weight matrix, layer \(\ell\)
\(\mathbf{b}_\ell\)Bias vector, layer \(\ell\)
\(\mathbf{z}_\ell\)Pre-activation (linear combination)
\(\mathbf{a}_\ell\)Activation (post-activation)
\(n_\ell\)Number of neurons in layer \(\ell\)
\(\phi(\cdot)\)Activation function
\(\hat{\mathbf{y}}\)Network output (predictions)

Subscript convention: Subscript \(\ell\) denotes the layer index (e.g. \(\mathbf{W}_1\) is the first layer's weight matrix, \(\mathbf{W}_2\) the second, etc.). This avoids superscript clutter when equations also involve powers or transposes.

🌍 Background & Motivation

The limitation of linear models

The linear classifier from Lesson 34 is powerful but fundamentally limited: it can only learn linear decision boundaries. Real data is rarely linearly separable. Satellite orbit families, material phases, particle signatures — these require curved, complex boundaries.

A naive fix might be to simply stack multiple linear layers. But this does not help: the composition of linear transformations is itself a linear transformation. No matter how many layers you add, the model collapses to a single matrix multiply — equivalent to one linear layer with different weights. Stacking linear layers buys you nothing.

Why nonlinearity is essential: A neural network interleaves linear transformations with a nonlinear activation function \(\phi\). Because \(\phi\) is nonlinear, the composition \(\phi(\mathbf{W}_2 \phi(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2)\) cannot collapse to a single linear layer — it genuinely increases the model's expressive power with each added layer.

Neural networks exploit this by stacking many such layers. The composition of these layers can represent arbitrarily complex functions — a result formalized by the Universal Approximation Theorem.

Biological inspiration (and where it ends)

Artificial neurons are loosely inspired by biological neurons: they receive inputs, compute a weighted sum, and fire (activate) if the sum exceeds a threshold. However, modern neural networks are best thought of as parameterized function approximators, not brain simulations. The biological analogy is more historical than mechanistic.

🧠 Key Concepts

1. Network Architecture

A feedforward neural network (also called a multi-layer perceptron, MLP) consists of:

We describe the architecture by listing the number of neurons per layer. Example: a network with input \(d=10\), one hidden layer of 64 neurons, and output of 1 is written as [10 → 64 → 1].

Mathematical model of a single neuron: weighted inputs summed with bias, then passed through activation function
Figure 1: A single artificial neuron. Inputs xi are multiplied by learned weights wi, summed with bias b to form the pre-activation z, then passed through a nonlinear activation function f to produce output a. (Source: CS231n Lecture 4.)
2-layer neural network with 3 inputs, 4 hidden neurons, and 2 outputs
Figure 2a: A 2-layer network [3 → 4 → 2].
3-layer neural network with 3 inputs, two hidden layers of 4 neurons each, and 1 output
Figure 2b: A 3-layer network [3 → 4 → 4 → 1].

Every neuron in one layer connects to every neuron in the next (“fully connected”). CS231n counts layers by the number of weight matrices, so these are called 2-layer and 3-layer networks respectively (the input layer is not counted). (Source: CS231n Lecture 4.)

2. The Forward Pass (Single Layer)

For layer \(\ell\), the computation is:

\[ \underbrace{\mathbf{z}_\ell}_{\text{pre-activation}} = \mathbf{W}_\ell \mathbf{a}_{\ell-1} + \mathbf{b}_\ell \qquad \longrightarrow \qquad \underbrace{\mathbf{a}_{\ell-1}}_{\text{activation}} = \phi\!\left(\mathbf{z}_{\ell-1}\right) \]

Here, \(\mathbf{a}_{\ell-1}\) is the activation output from the previous layer (for the first hidden layer, \(\mathbf{a}_0 = \mathbf{x}\)), \(\mathbf{W}_\ell \in \mathbb{R}^{n_\ell \times n_{\ell-1}}\) is the weight matrix, \(\mathbf{b}_\ell \in \mathbb{R}^{n_\ell}\) is the bias vector, and \(\phi\) is the nonlinear activation function applied element-wise.

3. Full Forward Pass (Two-Layer Network)

A network with one hidden layer and a binary classification output:

\[ \mathbf{a}_0 = \mathbf{x} \quad \text{(input)} \] \[ \mathbf{z}_1 = \mathbf{W}_1 \mathbf{a}_0 + \mathbf{b}_1 \] \[ \mathbf{a}_1 = \text{ReLU}(\mathbf{z}_1) \] \[ \mathbf{z}_2 = \mathbf{W}_2 \mathbf{a}_1 + \mathbf{b}_2 \] \[ \hat{y} = \phi(\mathbf{z}_2) \]

The network applies a learned nonlinear transformation from \(\mathbf{x}\) to \(\hat{y}\) in three matrix operations. The power comes from learning \(\mathbf{W}_1, \mathbf{b}_1, \mathbf{W}_2, b_2\) jointly.

4. Activation Functions

Many different nonlinear functions can serve as \(\phi\). The choice affects training stability, gradient flow, and model performance. Some of the most commonly used activation functions are shown below, ordered by prevalence in modern practice.

FunctionFormulaRangeCommon Use
ReLU \(\max(0, z)\) \([0,\infty)\) Hidden layers (modern default)
Sigmoid \(\sigma(z) = \frac{1}{1+e^{-z}}\) \((0,1)\) Binary output layer
Softmax \(\frac{e^{z_k}}{\sum_j e^{z_j}}\) \((0,1)\), sums to 1 Multi-class output layer
Tanh \(\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\) \((-1,1)\) Hidden layers (older architectures)
ReLU activation function
ReLU — max(0, z)
Sigmoid activation function
Sigmoid — squashes to (0, 1)
Tanh activation function
Tanh — squashes to (−1, 1)

Figure 3: Common activation functions. Sigmoid and tanh saturate (flat slope) for large |z|, causing vanishing gradients. ReLU has slope exactly 1 for z > 0, making it the modern default for hidden layers. (Source: CS231n.)

Why ReLU? Sigmoid and tanh saturate (gradient ≈ 0) for large inputs — the “vanishing gradient” problem that makes deep networks hard to train. ReLU has gradient exactly 1 for positive inputs, enabling training of much deeper networks.

5. Parameter Count

Every weight matrix and bias vector in the network contains learnable parameters — values adjusted during training. For each layer \(\ell\), the weight matrix \(\mathbf{W}_\ell \in \mathbb{R}^{n_\ell \times n_{\ell-1}}\) contributes \(n_\ell \cdot n_{\ell-1}\) parameters, and the bias vector \(\mathbf{b}_\ell \in \mathbb{R}^{n_\ell}\) contributes \(n_\ell\) more. For a three-layer network \(d \to n_1 \to n_2 \to n_\text{out}\), the total is:

\[ \text{Total params} = \underbrace{d \cdot n_1 + n_1}_{\mathbf{W}_1,\mathbf{b}_1} + \underbrace{n_1 \cdot n_2 + n_2}_{\mathbf{W}_2,\mathbf{b}_2} + \underbrace{n_2 \cdot n_\text{out} + n_\text{out}}_{\mathbf{W}_3,\mathbf{b}_3} \]

Understanding the parameter count helps you reason about model capacity and overfitting risk — more parameters means more expressive power, but also a greater demand for training data. To put this in perspective, some well-known models:

ModelYearParametersTask
LeNet-51998~60 KHandwritten digit recognition
AlexNet2012~61 MImageNet classification
ResNet-502015~25 MImageNet classification
GPT-22019~1.5 BText generation
GPT-32020~175 BGeneral language tasks
GPT-4 (est.)2023~1.8 TGeneral language & reasoning

6. Universal Approximation Theorem

A feedforward network with one hidden layer and a nonlinear activation can approximate any continuous function on a compact domain to arbitrary accuracy, given enough neurons. This is a theoretical existence result — it does not say learning is easy, only that the capacity is there.

Important caveat: The theorem guarantees expressiveness, not learnability. In practice, deep networks (many layers, fewer neurons each) generalize better and are easier to train than very wide shallow networks.

🎬 Recommended Videos

CS231n Lecture 4 covers neural network forward passes, activation functions, and the biological motivation:

The 3Blue1Brown series gives the best visual intuition for what the network is doing:

💻 Worked Example — MATLAB Forward Pass with patternnet

We use MATLAB's patternnet (Deep Learning Toolbox) to build a shallow network, run a forward pass before and after training, and classify all samples. This example uses a synthetic spectral dataset — not sat_data.mat, which you will work with in HW 35.

%% Lesson 35 — Neural Network Forward Pass with patternnet
% Domain: synthetic spectral classification (3 features, 2 classes)
% HW 35 uses sat_data.mat — this example uses a different dataset.

rng(356);
d = 3;   K = 2;   N_per = 60;   N = 2 * N_per;

%% --- Generate synthetic spectral data ---
% Two emission source types measured at three spectral bands
X = [randn(N_per, d)*0.5 + [0.2  0.8  0.3];     % Class 1
     randn(N_per, d)*0.5 + [0.8  0.2  0.7]]';   % Class 2
% X is d × N  (patternnet expects each column = one sample)

labels = [ones(1, N_per), 2*ones(1, N_per)];     % 1 × N

% Convert integer labels to one-hot encoding (K × N)
T = full(ind2vec(labels));    % e.g. label 1 → [1;0], label 2 → [0;1]

fprintf('Dataset: %d samples, %d features, %d classes\n', N, d, K);

%% --- 1. Define network: 3 → 10 → 2 ---
net = patternnet(10);         % one hidden layer, 10 neurons
net = configure(net, X, T);  % set input/output sizes (random weights assigned)
view(net);                    % opens network architecture diagram

%% --- 2. Forward pass BEFORE training (random initial weights) ---
x1 = X(:, 1);                % d × 1 column (first sample)
p1 = net(x1);                % K × 1 softmax probabilities

fprintf('\n--- Forward Pass (BEFORE training) ---\n');
fprintf('Input:     [%.3f  %.3f  %.3f]\n', x1);
fprintf('Output:    [%.4f  %.4f]\n', p1);
fprintf('Predicted: Class %d  |  True: Class %d\n', vec2ind(p1), labels(1));
fprintf('(Random weights — prediction is not yet meaningful)\n');

%% --- 3. Train the network ---
net = train(net, X, T);

%% --- 4. Forward pass AFTER training ---
p1_trained = net(x1);

fprintf('\n--- Forward Pass (AFTER training) ---\n');
fprintf('Output:    [%.4f  %.4f]\n', p1_trained);
fprintf('Predicted: Class %d  |  True: Class %d\n', ...
    vec2ind(p1_trained), labels(1));

%% --- Classify all samples and compute accuracy ---
P_all = net(X);                        % K × N output probabilities
pred  = vec2ind(P_all);                % 1 × N predicted class labels
acc   = mean(pred == labels) * 100;
fprintf('\nOverall accuracy: %.1f%%\n', acc);
Why patternnet? For small datasets (hundreds of samples, few features), patternnet uses a second-order optimizer (scaled conjugate gradient) that processes all data each iteration. This converges far faster and more reliably than mini-batch SGD, which was designed for large-scale deep learning.

📝 Summary

ConceptKey Idea
Layer computation\(\mathbf{z}_\ell = \mathbf{W}_\ell\mathbf{a}_{\ell-1}+\mathbf{b}_\ell\), \(\mathbf{a}_\ell=\phi(\mathbf{z}_\ell)\)
ReLU\(\max(0,z)\) — default hidden layer activation; avoids vanishing gradients
ParametersEach layer \(\ell\) has \(n_\ell \times n_{\ell-1}\) weights + \(n_\ell\) biases
Forward passChain of matrix-vector products + pointwise nonlinearities
Universal approximationOne hidden layer + nonlinearity can approximate any continuous function
📋 HW 35 — Shallow Neural Network (sat_data.mat)

📚 References

  1. Karpathy, A. et al. (2017). CS231n: Convolutional Neural Networks for Visual Recognition — Neural Networks Part 1 notes. Stanford University. http://cs231n.github.io/neural-networks-1/
  2. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chs. 6–7. MIT Press.
  3. 3Blue1Brown. (2017). But what is a neural network? [Video]. YouTube.
  4. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314.
  5. He, K. et al. (2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet. ICCV 2015.