Lesson 34 — Linear Classification & Loss Functions

Learning Objectives

Write the linear score function and interpret its geometric meaning as a hyperplane.
Explain what a decision boundary is and derive it for a binary linear classifier.
Explain the sigmoid function and why it maps a single score to a probability.
Define the softmax function as the multi-class generalization of sigmoid.
Compare MSE, binary cross-entropy, and multi-class cross-entropy loss functions.
Set up a multi-class linear classification problem in MATLAB on real space-object data.

Notation used in this lesson

\(\mathbf{x} \in \mathbb{R}^d\)Feature vector (\(d\) features)

\(\mathbf{w} \in \mathbb{R}^d\)Weight vector

\(b \in \mathbb{R}\)Scalar bias

\(s\)Raw score (pre-activation)

\(\hat{p}\)Predicted probability (binary)

\(y \in \{0,1\}\)Binary label

\(K\)Number of classes

\(s_k\)Score for class \(k\)

\(P_k\)Predicted probability of class \(k\)

\(\mathbf{W} \in \mathbb{R}^{K \times d}\)Weight matrix (multi-class)

\(L\)Loss (scalar, measures error)

\(\sigma(s)\)Sigmoid function (binary)

\(\text{softmax}(\mathbf{s})\)Softmax function (multi-class)

🌍 Background & Motivation

From regression to classification

In Lesson 33 we introduced regression — predicting a continuous output. Classification is the complementary problem: given features \(\mathbf{x}\), assign the example to one of \(K\) discrete classes. The simplest case is binary classification: debris vs. active satellite, signal vs. noise, threat vs. benign.

The CSpOC (Combined Space Operations Center) tracks thousands of objects in Earth orbit. Classifying these objects — determining whether a radar return corresponds to an active satellite, rocket body, or debris — is precisely a binary (or multi-class) classification problem where the features are derived from radar cross-section (RCS), brightness, and orbital elements. RCS is the effective scattering area of an object as seen by a radar: large, metallic active satellites tend to have high RCS, while small debris fragments have low RCS, making it a useful discriminating feature.

Why not just use regression for classification?

You could fit a regression model and threshold at 0.5. But regression losses (MSE) penalize confident correct predictions and can produce unbounded outputs. We want a model that outputs a probability in \([0,1]\), and a loss function that penalizes confident wrong predictions harshly. That leads us to the sigmoid and cross-entropy.

🧠 Key Concepts

1. The Linear Score Function

The simplest classifier computes a score as a linear function of the features:

s = \mathbf{w}^\top \mathbf{x} + b \qquad s \in \mathbb{R}

Here \(\mathbf{w}\) is a weight vector (one weight per feature) and \(b\) is a scalar bias. Together, \(\boldsymbol{\theta} = (\mathbf{w}, b)\) are the model's parameters — the things we will learn from data.

Geometric interpretation: In 2D feature space (\(d=2\)), the equation \(\mathbf{w}^\top \mathbf{x} + b = 0\) defines a line. In \(d\) dimensions it defines a hyperplane. Points on one side have \(s > 0\); points on the other side have \(s < 0\).

2. Decision Boundary

We classify a new example as class 1 if \(s > 0\), class 0 if \(s < 0\):

\hat{y} = \begin{cases} 1 & \text{if } \mathbf{w}^\top \mathbf{x} + b > 0 \\ 0 & \text{otherwise} \end{cases}

The decision boundary is the set of points where \(s = 0\), i.e., the hyperplane \(\mathbf{w}^\top \mathbf{x} + b = 0\). Learning the classifier means finding the \(\mathbf{w}\) and \(b\) that place this boundary correctly between the two classes.

3. Interpreting the Linear Classifier: Template Matching

There are two complementary ways to understand what the weight vector \(\mathbf{w}\) actually learns. The first is template matching (from CS231n, Stanford).

Think of \(\mathbf{w}\) as a learned prototype or template for the positive class. The raw score is simply an inner product:

s = \mathbf{w}^\top \mathbf{x} + b

A large positive \(s\) means \(\mathbf{x}\) is highly aligned with the template \(\mathbf{w}\) — the input looks like the learned prototype. A large negative \(s\) means \(\mathbf{x}\) is anti-aligned — it looks like the opposite class. This is conceptually similar to nearest-neighbor classification, but instead of comparing against every training example we compare against a single learned prototype per class.

Learned weight templates for each class in CIFAR-10 — Figure from CS231n notes (Karpathy et al., Stanford 2017): Each tile shows the row \(\mathbf{w}_k\) reshaped into image dimensions — the learned template for that class. The horse template appears two-headed because the linear classifier merges left- and right-facing examples into a single prototype. The ship template shows strong blue pixels consistent with water and sky.

In our space-object context: The feature vector is \(\mathbf{x} = [\text{RCS},\ \text{brightness}]^\top\). After training, \(\mathbf{w}\) points roughly toward the centroid of the satellite cluster (high RCS, bright). A new object is assigned to whichever class its feature vector most resembles — the one with the higher dot-product similarity.

4. Interpreting the Linear Classifier: Geometric View

The second interpretation is geometric. The score \(s = \mathbf{w}^\top \mathbf{x} + b\) is a linear function over feature space, and its geometry directly reveals what the classifier is doing.

The set of all points where \(s = 0\) is the decision boundary — a hyperplane (a line in 2D).
The weight vector \(\mathbf{w}\) is perpendicular to this hyperplane and points toward the positive class. Moving along \(\mathbf{w}\) increases the score; moving against it decreases it.
The bias \(b\) shifts the hyperplane away from the origin. Without it, every decision boundary is forced through the origin — a severe restriction that prevents the model from fitting data that is not origin-centered.

Cartoon of feature space showing three linear classifiers as hyperplanes — Figure from CS231n notes (Karpathy et al., Stanford 2017): Feature space with each example as a point (shown in 2D for visualization). Each linear classifier cuts the space with a hyperplane; points on opposite sides receive opposite-sign scores. The arrow perpendicular to each boundary shows the direction of increasing score for that class.

Why bias matters for space objects: Both satellite and debris clusters sit at large positive RCS and brightness values — far from the origin. A boundary forced through the origin cannot cleanly separate them. The bias \(b\) gives the model the freedom to place the boundary wherever the data requires.

5. The Sigmoid Function

Rather than a hard threshold, we often want a smooth probability estimate. The sigmoid function maps any real number to \((0,1)\):

\sigma(s) = \frac{1}{1 + e^{-s}}

Key properties:

\(\sigma(0) = 0.5\) — at the decision boundary, the model is 50/50.
\(\sigma(s) \to 1\) as \(s \to +\infty\) — large positive scores → confident class 1.
\(\sigma(s) \to 0\) as \(s \to -\infty\) — large negative scores → confident class 0.
\(\sigma'(s) = \sigma(s)(1-\sigma(s))\) — a handy identity for backprop.

Figure: The sigmoid function \(\sigma(s) = 1/(1+e^{-s})\). The gold dot marks \(\sigma(0) = 0.5\).

We define the predicted probability as \(\hat{p} = \sigma(\mathbf{w}^\top \mathbf{x} + b)\).

6. From Binary to Multi-Class: The Softmax Function

The sigmoid function works perfectly for binary classification (two classes). When there are \(K > 2\) classes, we need to produce \(K\) probabilities — one per class — that are all positive and sum to 1. The softmax function does exactly this.

Given a vector of \(K\) raw scores \(\mathbf{s} = [s_1, s_2, \ldots, s_K]^\top\), softmax returns a probability vector:

P_k = \frac{e^{s_k}}{\displaystyle\sum_{j=1}^{K} e^{s_j}}, \qquad k = 1, \ldots, K

Key properties:

\(P_k > 0\) for all \(k\) — exponential is always positive.
\(\sum_{k=1}^{K} P_k = 1\) — the outputs form a valid probability distribution.
The class with the largest score receives the highest probability.
Adding a constant to all scores leaves the probabilities unchanged (only differences matter).

Softmax reduces to sigmoid when \(K = 2\). With two classes, \(s_1\) and \(s_2\): \[ P_1 = \frac{e^{s_1}}{e^{s_1} + e^{s_2}} = \frac{1}{1 + e^{-(s_1 - s_2)}} = \sigma(s_1 - s_2) \] The sigmoid is just a special case of softmax. This means everything you learn about logistic regression (binary) generalizes directly to multi-class softmax regression.

In the multi-class linear classifier, a separate weight vector is learned for each class. These are stacked into a weight matrix \(\mathbf{W} \in \mathbb{R}^{K \times d}\):

\mathbf{s} = \mathbf{W}\mathbf{x} + \mathbf{b} \in \mathbb{R}^K, \qquad P_k = \text{softmax}(\mathbf{s})_k = \frac{e^{s_k}}{\sum_j e^{s_j}}

For a batch of \(N\) examples packed into a matrix \(\mathbf{X} \in \mathbb{R}^{d \times N}\), the entire forward pass is:

\mathbf{Z} = \mathbf{W}\mathbf{X} + \mathbf{b} \in \mathbb{R}^{K \times N}, \qquad P_{kn} = \frac{e^{Z_{kn}}}{\sum_{j=1}^{K} e^{Z_{jn}}}

Numerical stability: Computing \(e^{s_k}\) for large \(s_k\) can overflow. In practice, subtract the maximum score before exponentiating — this leaves \(P_k\) unchanged but keeps numbers in a safe range: \[ P_k = \frac{e^{s_k - \max_j s_j}}{\sum_{j} e^{s_j - \max_j s_j}} \] In MATLAB: Z = Z - max(Z, [], 1); before exp(Z).

7. Loss Functions

A loss function \(L(\hat{y}, y)\) quantifies how wrong the model's prediction is. We want to minimize the average loss over all training examples.

7a. Mean Squared Error (MSE) — for regression

L_{\text{MSE}} = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)^2

MSE is the natural loss for regression. It penalizes large errors quadratically. However, for classification it has poor gradient behavior — the loss can saturate when the sigmoid output is near 0 or 1.

7b. Binary Cross-Entropy — for binary classification

L_{\text{BCE}} = -\frac{1}{N} \sum_{i=1}^{N} \bigl[ y_i \log(\hat{p}_i) + (1-y_i) \log(1-\hat{p}_i) \bigr]

Cross-entropy has two key advantages over MSE for classification:

Probabilistic interpretation: it is the negative log-likelihood under a Bernoulli model.
Better gradients: when the model is confidently wrong, cross-entropy provides a large gradient signal to correct it quickly.

Concrete example. Suppose we have \(N = 3\) training objects and the current model produces these predictions:

\(i\)	Object	\(y_i\) (true label)	\(\hat{p}_i\) (predicted prob. of satellite)	Loss term
1	Satellite	1	0.90	\(-\log(0.90) \approx 0.105\) ✓ low
2	Debris	0	0.15	\(-\log(1-0.15) \approx 0.163\) ✓ low
3	Satellite	1	0.08	\(-\log(0.08) \approx 2.526\) ✗ high

\(L_\text{BCE} = \tfrac{1}{3}(0.105 + 0.163 + 2.526) \approx 0.931\)

Notice: when \(y_i = 1\), only the \(\log(\hat{p}_i)\) term matters (the \((1-y_i)\) factor zeros out the other term). When \(y_i = 0\), only the \(\log(1-\hat{p}_i)\) term matters. Object 3 — a satellite the model nearly called debris — dominates the loss and will receive the largest gradient correction.

Intuition for cross-entropy: If \(y_i = 1\) and \(\hat{p}_i \approx 0\), then \(\log(\hat{p}_i) \to -\infty\) — very large penalty. If \(\hat{p}_i \approx 1\), then \(\log(\hat{p}_i) \approx 0\) — small penalty. The loss rewards confidence when correct and punishes it when wrong.

7c. Multi-class Cross-Entropy — for softmax classifiers

When there are \(K\) classes, the binary cross-entropy generalizes naturally. The true label for each example is represented as a one-hot vector \(\mathbf{y} \in \{0,1\}^K\) with a 1 in the position of the correct class and 0s elsewhere. The multi-class cross-entropy loss is:

L_{\text{CE}} = -\frac{1}{N} \sum_{n=1}^{N} \sum_{k=1}^{K} y_{kn} \log P_{kn}

Because \(y_{kn}\) is one-hot, only one term in the inner sum is nonzero for each example — the term corresponding to the true class. So the loss simplifies to:

L_{\text{CE}} = -\frac{1}{N} \sum_{n=1}^{N} \log P_{y_n, n}

i.e., the average negative log-probability assigned to the correct class. When \(K = 2\) and the labels are one-hot \([y, 1-y]^\top\), this reduces exactly to the binary cross-entropy in section 7b.

Connection to HW 34: The homework uses \(K = 4\) classes (Active Payload, Inactive Payload, Rocket Body, Debris) with features packed into \(\mathbf{X} \in \mathbb{R}^{2 \times N}\) and one-hot labels \(\mathbf{Y} \in \{0,1\}^{4 \times N}\). The model is \(\mathbf{Z} = \mathbf{W}\mathbf{X} + \mathbf{b}\), \(P_{kn} = \text{softmax}(\mathbf{Z})_{kn}\), and the loss is the multi-class cross-entropy above.

8. The Big Picture: Logistic & Softmax Regression

The components covered in this lesson combine into two closely related models:

	Binary (2 classes)	Multi-class (\(K\) classes)
Scores	\(s = \mathbf{w}^\top\mathbf{x} + b\)	\(\mathbf{s} = \mathbf{W}\mathbf{x} + \mathbf{b}\)
Activation	Sigmoid \(\sigma(s)\)	Softmax \(P_k = e^{s_k}/\sum e^{s_j}\)
Loss	Binary cross-entropy	Multi-class cross-entropy
Name	Logistic regression	Softmax regression

In both cases we find \(\boldsymbol{\theta}\) that minimizes \(L\). This requires a numerical optimization method — that is the subject of Lesson 35.

▶ Recommended Video

Two recommended videos — StatQuest for a step-by-step derivation of logistic regression and cross-entropy, and CS231n Lecture 2 for the broader linear classification framework covered in this lesson:

▶

Logistic Regression — Details Pt 1: Coefficients

StatQuest with Josh Starmer — YouTube

▶

CS231n Lecture 2 — Image Classification, Linear Classify, SVM, Softmax

Stanford University (Karpathy, Johnson, Li) — YouTube

💻 MATLAB Tips & Tricks

Building a One-Hot Label Matrix

Many classifiers require labels as a one-hot matrix \(\mathbf{Y}\) rather than a vector of integers. If ID is an \(N \times 1\) vector of class labels (integers 1–K), you need a \(K \times N\) matrix where column \(n\) has a 1 in row ID(n) and zeros everywhere else.

Two common approaches in MATLAB:

Option 1 — Using ind2vec (requires Neural Network Toolbox)

% ID is N×1, e.g. [1; 3; 2; 4; 1]
% ind2vec expects a row vector, so transpose first
Y_oh = full(ind2vec(ID'));   % K × N one-hot matrix
% full() converts the sparse output to a regular dense matrix

Option 2 — Manual construction (no toolbox required)

K = 4;   % number of classes
N = length(ID);
Y_oh = zeros(K, N);
for n = 1:N
    Y_oh(ID(n), n) = 1;
end

Option 3 — Using logical indexing with eye (compact, no toolbox)

% eye(K) is the K×K identity matrix — each column is already one-hot.
% Indexing its columns by ID selects the right one-hot vector for each object.
K = 4;
Y_oh = eye(K)(:, ID);   % K × N one-hot matrix

Verify your result: Each column of Y_oh should sum to 1, and the row index of the 1 in column n should equal ID(n). A quick check: all(sum(Y_oh,1) == 1) should return true, and [~, rows] = max(Y_oh) should reproduce ID'.

📝 Summary

Concept	Formula / Key Idea
Linear score	\(s = \mathbf{w}^\top \mathbf{x} + b\)
Decision boundary	Hyperplane \(\mathbf{w}^\top \mathbf{x} + b = 0\)
Sigmoid	\(\sigma(s) = 1/(1+e^{-s})\), maps \(\mathbb{R} \to (0,1)\) — binary
Softmax	\(P_k = e^{s_k}/\sum_j e^{s_j}\), maps \(\mathbb{R}^K \to \Delta^{K-1}\) — multi-class
MSE loss	\(\frac{1}{N}\sum(\hat{y}-y)^2\) — for regression
Binary cross-entropy	\(-\frac{1}{N}\sum[y\log\hat{p} + (1-y)\log(1-\hat{p})]\)
Multi-class cross-entropy	\(-\frac{1}{N}\sum_n\sum_k y_{kn}\log P_{kn}\)
Logistic regression	Linear score + sigmoid + binary CE (2 classes)
Softmax regression	Linear scores + softmax + multi-class CE (\(K\) classes)

📋 HW 34 — Linear Classifier (Space Objects)

📚 References

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Ch. 6. MIT Press.
Karpathy, A., Johnson, J., & Li, F.-F. (2017). CS231n: Convolutional Neural Networks for Visual Recognition — Lecture 2 notes: Linear Classification. Stanford University. cs231n.github.io/linear-classify
StatQuest. (2019). Logistic Regression [Video series]. YouTube.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Ch. 4. Springer.
Ng, A. (2012). Machine Learning (Coursera lecture notes). Stanford University.

← L33: What is ML & AI?

Next Lesson
L35: Gradient Descent & SGD

→