Lesson 33 — What is ML & AI?

Learning Objectives

Distinguish artificial intelligence, machine learning, and deep learning.
Classify ML problems as supervised, unsupervised, or reinforcement learning.
Describe the end-to-end ML pipeline: data → model → loss → optimization → evaluation.
Explain why ML is relevant to physics and engineering applications.
Recognize the difference between model parameters and hyperparameters.
Construct a polynomial regression model and write its loss function.
Explain overfitting, regularization, and the bias-variance tradeoff.
Use MATLAB's fminsearch with an anonymous loss function to train a model.

Notation used in this lesson

\(\mathbf{x}\)Input feature vector

\(y\)True label/output

\(\hat{y}\)Model prediction

\(f(\cdot)\)Learned function (model)

\(L\)Loss function

\(N\)Number of training samples

\(\boldsymbol{\beta}\)Model parameters (weights)

\(\lambda\)Regularization strength (hyperparameter)

\(D\)Polynomial degree

🌍 Background & Motivation

The old way: explicit rules

Classical programming takes the form rules + data → answers. A spacecraft attitude-control algorithm, a ballistic trajectory solver, a radar signal processor — each is a set of equations a physicist or engineer derives from first principles and encodes explicitly. This works beautifully when we understand the physics completely.

The ML way: data → model → answers

Machine learning flips the script: data + answers → rules (model). Instead of writing the decision logic yourself, you collect examples \((\mathbf{x}_i, y_i)\) and let an optimization algorithm discover a function \(f\) such that \(f(\mathbf{x}_i) \approx y_i\). The model is the extracted rules, encoded in its parameters.

This is not magic — it works because many real-world relationships are too complex to derive from first principles (e.g., classifying satellite maneuver intent from radar cross-section, or detecting anomalies in sensor streams). ML excels precisely in these high-dimensional, data-rich regimes.

Why physicists should care

ML is increasingly a core tool in physics: gravitational-wave signal detection (LIGO), particle identification at CERN, materials discovery, orbit determination, and more. Understanding the mathematical foundations — which are essentially applied calculus and linear algebra — puts you ahead of practitioners who treat ML as a black box.

🧠 Key Concepts

The AI ⊃ ML ⊃ Deep Learning hierarchy

Artificial Intelligence (AI) is the broad goal: machines that exhibit intelligent behavior. Machine Learning (ML) is one approach to AI: systems that improve from experience without being explicitly programmed. Deep Learning (DL) is a subset of ML using multi-layer neural networks.

Analogy: AI is like "transportation"; ML is "motor vehicles"; deep learning is "electric cars." All electric cars are motor vehicles, but not all motor vehicles are electric cars.

Three types of learning

Type	Training Data	Goal	Physics Example
Supervised	Labeled pairs \((\mathbf{x}_i, y_i)\)	Learn \(f: \mathbf{x} \to y\)	Classify space objects by radar signature
Unsupervised	Unlabeled \(\mathbf{x}_i\)	Find structure / clusters	Grouping satellite orbits by behavior
Reinforcement	Reward signals	Learn policy via trial & error	Autonomous spacecraft attitude control

Regression vs. Classification

Within supervised learning, the output type matters:

Regression: predict a continuous value. Example: predict the apogee altitude of a satellite from telemetry features. Output \(\hat{y} \in \mathbb{R}\).
Classification: predict a discrete class label. Example: "is this radar return from debris or an active satellite?" Output \(\hat{y} \in \{0,1\}\) (binary) or \(\hat{y} \in \{1,\ldots,K\}\) (multi-class).

The ML Pipeline

Every ML project follows the same high-level workflow:

\underbrace{\text{Raw Data}}_{\text{collection}} \;\longrightarrow\; \underbrace{\text{Features } \mathbf{x}}_{\text{preprocessing}} \;\longrightarrow\; \underbrace{f_{\boldsymbol{\theta}}(\mathbf{x})}_{\text{model}} \;\longrightarrow\; \underbrace{L(\hat{y}, y)}_{\text{loss}} \;\longrightarrow\; \underbrace{\min_{\boldsymbol{\theta}} L}_{\text{optimization}}

Data collection: gather labeled examples. Quality matters more than quantity.
Feature engineering: transform raw measurements into a numeric vector \(\mathbf{x}\).
Model selection: choose the form of \(f\) (linear, neural network, etc.).
Loss function: define what "wrong" means quantitatively.
Optimization: update model parameters \(\boldsymbol{\theta}\) to reduce loss.
Evaluation: test on held-out data; report accuracy, F1, RMSE, etc.

Parameters vs. Hyperparameters

Parameters (e.g., weights \(\mathbf{W}\), biases \(\mathbf{b}\)) are learned from data during training. Hyperparameters (e.g., learning rate \(\alpha\), number of layers, batch size) are set by the user before training. Confusing the two is a common beginner mistake.

Overfitting and Generalization

A model overfits when it memorizes the training data but fails on new data. We guard against this by splitting data into training, validation, and test sets, and by monitoring validation loss during training. The validation loss is not used to update the model, but rather to indicate how well the model can generalize to new data.

\underbrace{L_{\text{train}} \ll L_{\text{test}}}_{\text{overfitting sign}} \qquad\qquad \underbrace{L_{\text{train}} \approx L_{\text{test}} \approx \text{small}}_{\text{good generalization}}

📈 Regression: Fitting a Model to Data

What is regression?

Regression is the supervised learning task of predicting a continuous output \(y\) from input features \(x\). The machine learning framing is identical to curve fitting you already know — the difference is in how we think about the model, the loss, and generalization.

Formally, we assume the data was generated by some unknown function \(f\) plus noise:

y = f(x) + \varepsilon

Our goal is to learn an approximation \(\hat{y} = f_{\boldsymbol{\beta}}(x)\) whose parameters \(\boldsymbol{\beta}\) we optimize from data.

Polynomial regression model

A polynomial of degree \(D\) is one of the simplest parametric models. For a scalar input \(x\):

\hat{y} = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_D x^D = \sum_{k=0}^{D} \beta_k\, x^k

Here \(x\) is the input (a single observed value, e.g. a physics score) and \(\boldsymbol{\beta} = [\beta_0, \beta_1, \ldots, \beta_D]^\top\) is the vector of model parameters — the coefficients we will learn from training data. Although the model is nonlinear in \(x\), it is linear in the parameters \(\boldsymbol{\beta}\), which makes the math tractable. We can write the prediction for a single input compactly using a feature vector \(\boldsymbol{\phi}(x)\) that collects the powers of \(x\):

\boldsymbol{\phi}(x) = \begin{bmatrix} 1 \\ x \\ x^2 \\ \vdots \\ x^D \end{bmatrix} \qquad \Rightarrow \qquad \hat{y} = \boldsymbol{\beta}^\top \boldsymbol{\phi}(x)

For \(N\) training points \(x_1, x_2, \ldots, x_N\), each has its own feature vector \(\boldsymbol{\phi}(x_i)\). Stacking these as rows gives the design matrix \(\boldsymbol{\Phi}\) (size \(N \times (D+1)\)):

\boldsymbol{\Phi} = \begin{bmatrix} \boldsymbol{\phi}(x_1)^\top \\ \vdots \\ \boldsymbol{\phi}(x_N)^\top \end{bmatrix} = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^D \\ \vdots & & & & \vdots \\ 1 & x_N & x_N^2 & \cdots & x_N^D \end{bmatrix}

With this notation the predictions for all \(N\) training points collapse to a single matrix–vector product:

\hat{\mathbf{y}} = \boldsymbol{\Phi}\,\boldsymbol{\beta}

The loss function (Mean Squared Error)

To measure how well our model fits the training data we use Mean Squared Error (MSE):

L(\boldsymbol{\beta}) = \frac{1}{N} \sum_{i=1}^{N} \bigl(\hat{y}_i - y_i\bigr)^2

Squaring the errors ensures all terms are positive and penalizes large errors more than small ones. Training the model means finding the \(\boldsymbol{\beta}\) that minimizes \(L\):

\hat{\boldsymbol{\beta}} = \underset{\boldsymbol{\beta}}{\operatorname{argmin}}\; L(\boldsymbol{\beta})

Connection to what you already know: MSE minimization is exactly least-squares curve fitting. The ML framing just makes the loss explicit and separates training data from test data.

⚖️ Overfitting, Regularization & the Bias-Variance Tradeoff

Overfitting

As you increase the polynomial degree \(D\), the model becomes more flexible and can fit the training data more exactly — eventually passing through every training point perfectly. But a perfectly fit training curve often performs worse on new test data, because it has learned the noise, not the signal. This is overfitting.

Classic example: A degree-1 polynomial (a line) may underfit a nonlinear trend. A degree-15 polynomial may overfit by wiggling through every noisy training point. The test MSE will be high in both cases — but for opposite reasons.

Bias-Variance Tradeoff

Recall from the data model that \(y = f(x) + \varepsilon\), where \(\varepsilon\) is random measurement noise with variance \(\sigma^2_\varepsilon\). If we train our model on many different datasets drawn from the same process, our prediction \(\hat{y}\) will vary. The expected test error (expected squared difference between our prediction and the true value) decomposes into exactly three terms:

\underbrace{\mathbb{E}\!\left[(\hat{y} - y)^2\right]}_{\text{expected test error}} \;=\; \underbrace{\bigl(\mathbb{E}[\hat{y}] - f(x)\bigr)^2}_{\text{Bias}^2} \;+\; \underbrace{\mathbb{E}\!\left[\bigl(\hat{y} - \mathbb{E}[\hat{y}]\bigr)^2\right]}_{\text{Variance}} \;+\; \underbrace{\sigma^2_\varepsilon}_{\text{irreducible noise}}

Bias\(^2\) — the squared difference between the average prediction \(\mathbb{E}[\hat{y}]\) and the true value \(f(x)\). Bias measures systematic error: does the model consistently predict too high, too low, or miss the shape of \(f\)? It is squared because we are measuring error in squared units (consistent with MSE), and because the bias itself — \(\mathbb{E}[\hat{y}] - f(x)\) — can be positive or negative; squaring makes it always non-negative. A model that is too simple (e.g.\ a straight line fit to curved data) has high bias.
Variance — how much the prediction \(\hat{y}\) fluctuates across different training sets. A model with high variance has "memorized" the training data rather than learning the underlying trend, so small changes in the training set cause large swings in the predictions. A high-degree polynomial typically has high variance.
Irreducible noise \(\sigma^2_\varepsilon\) — the variance of the observation noise \(\varepsilon\) built into the data-generating process \(y = f(x) + \varepsilon\). No matter how good our model is, we cannot predict the random noise on each measurement. This sets a hard lower bound on the test error.

Good models have both low bias and low variance. Increasing model complexity decreases bias but increases variance. The goal is to find the sweet spot between the two.

Regularization

Regularization is a technique for controlling overfitting by adding a penalty to the loss function that discourages large parameter values. The most common form is L2 regularization (also called Ridge regression):

L_{\text{reg}}(\boldsymbol{\beta}) = \underbrace{\frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)^2}_{\text{data fit (MSE)}} \;+\; \underbrace{\lambda \sum_{k=1}^{D} \beta_k^2}_{\text{regularization penalty}}

The second term penalizes large coefficients. \(\lambda \geq 0\) is a hyperparameter you choose before training:

\(\lambda = 0\): no regularization — pure MSE, free to overfit.
Small \(\lambda\): mild penalty, allows moderate flexibility.
Large \(\lambda\): strong penalty, forces small coefficients, model behaves more like a low-degree polynomial.

Note: The bias term \(\beta_0\) is typically not regularized, since penalizing the intercept would shift all predictions toward zero regardless of the data.

With regularization there is generally no closed-form solution, so we minimize \(L_{\text{reg}}\) numerically. In MATLAB, fminsearch does this without requiring you to supply the gradient.

▶ Recommended Video

StatQuest gives a clear, visual explanation of bias and variance — the core tradeoff behind overfitting and regularization covered in this lesson.

▶

Machine Learning Fundamentals: Bias and Variance

StatQuest with Josh Starmer — YouTube

💻 Worked Example — MATLAB

This example walks through the full ML pipeline using a quadratic polynomial model, a regularized MSE loss, and MATLAB's fminsearch for optimization.

%% Lesson 33 — Linear Regression with fminsearch
% Model:  y_hat = beta(1) + beta(2)*x   (line: intercept + slope)
% Loss:   L = mean((y_hat - y).^2)

rng(42);

%% 1. Generate synthetic data (in HW 33 you will load physics_grades.mat)
N = 80;
x = linspace(0, 4, N)';
y = 2*x + 1 + randn(N,1)*0.8;   % noisy line: true slope=2, intercept=1

%% 2. Train/test split  (75 / 25)
n_train = round(0.75 * N);
idx = randperm(N);
x_train = x(idx(1:n_train));     y_train = y(idx(1:n_train));
x_test  = x(idx(n_train+1:end)); y_test  = y(idx(n_train+1:end));

%% 3. Set up the linear model as an anonymous function
% beta(1) = intercept, beta(2) = slope
model = @(beta, x_in) beta(1) + beta(2)*x_in;

%% 4. Set up the loss function (MSE on training data)
loss_train = @(beta) mean((model(beta, x_train) - y_train).^2);

%% 5. Optimize using fminsearch (no gradient required)
beta0    = [0; 0];               % initial guess
beta_opt = fminsearch(loss_train, beta0);

fprintf('Learned parameters: intercept=%.3f, slope=%.3f\n', beta_opt);
fprintf('Training loss: %.4f\n', loss_train(beta_opt));

%% 6. Evaluate on test data
loss_test = @(beta) mean((model(beta, x_test) - y_test).^2);
fprintf('Test loss:     %.4f\n', loss_test(beta_opt));

%% 7. Plot results
x_fine = linspace(min(x), max(x), 200)';
figure; hold on;
scatter(x_train, y_train, 25, 'b', 'filled', 'DisplayName','Train');
scatter(x_test,  y_test,  25, 'r', 'filled', 'DisplayName','Test');
plot(x_fine, model(beta_opt, x_fine), 'k-', 'LineWidth', 2, 'DisplayName','Fitted model');
xlabel('x'); ylabel('y');
title('Linear Regression via fminsearch');
legend; grid on;

Why fminsearch? For a simple quadratic there is a closed-form solution, but fminsearch works for any loss function — including ones with extra terms you add yourself. In HW 33 you will extend this loss to include a regularization penalty.

📝 Summary

Concept	Key Idea
ML definition	Systems that learn \(f: \mathbf{x} \to y\) from data rather than explicit rules.
Supervised learning	Labeled pairs \((\mathbf{x}_i, y_i)\); learn to predict \(y\) from \(\mathbf{x}\).
ML pipeline	Data → features → model → loss → optimization → evaluation.
Polynomial model	\(\hat{y} = \sum_{k=0}^{D} \beta_k x^k\); parameters \(\boldsymbol{\beta}\) learned from data.
MSE loss	\(L = \frac{1}{N}\sum(\hat{y}_i - y_i)^2\); measures average squared error.
Regularization (L2)	\(L_{\text{reg}} = \text{MSE} + \lambda\sum_{k=1}^{D}\beta_k^2\); penalizes large coefficients to prevent overfitting.
Bias-variance tradeoff	Simple models: high bias, low variance. Complex models: low bias, high variance.
Parameters vs. hyperparameters	\(\boldsymbol{\beta}\) learned during training; \(\lambda\), \(D\) set beforehand.
Optimization	`fminsearch` minimizes any anonymous loss function numerically.

📋 HW 33 — Regression from an ML Perspective

📚 References

Maini, V. & Sabri, S. (2017). Machine Learning for Humans. Medium. Parts 1 & 2.1 — Introduction and Supervised Learning (pp. 1–29). medium.com/machine-learning-for-humans
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. deeplearningbook.org
MathWorks. fminsearch — Multidimensional unconstrained nonlinear minimization. mathworks.com/help/matlab/ref/fminsearch.html

← Home

Next Lesson
L34: Supervised Learning

→

What is Machine Learning & AI?

Learning Objectives

Notation used in this lesson

🌍 Background & Motivation

The old way: explicit rules

The ML way: data → model → answers

Why physicists should care

🧠 Key Concepts

The AI ⊃ ML ⊃ Deep Learning hierarchy

Three types of learning

Regression vs. Classification

The ML Pipeline

Parameters vs. Hyperparameters

Overfitting and Generalization

📈 Regression: Fitting a Model to Data

What is regression?

Polynomial regression model

The loss function (Mean Squared Error)

⚖️ Overfitting, Regularization & the Bias-Variance Tradeoff

Overfitting

Bias-Variance Tradeoff

Regularization

▶ Recommended Video

💻 Worked Example — MATLAB

📝 Summary

📚 References