Learning Objectives
- Distinguish artificial intelligence, machine learning, and deep learning.
- Classify ML problems as supervised, unsupervised, or reinforcement learning.
- Describe the end-to-end ML pipeline: data → model → loss → optimization → evaluation.
- Explain why ML is relevant to physics and engineering applications.
- Recognize the difference between model parameters and hyperparameters.
- Construct a polynomial regression model and write its loss function.
- Explain overfitting, regularization, and the bias-variance tradeoff.
- Use MATLAB's
fminsearchwith an anonymous loss function to train a model.
Notation used in this lesson
Background & Motivation
The old way: explicit rules
Classical programming takes the form rules + data → answers. A spacecraft attitude-control algorithm, a ballistic trajectory solver, a radar signal processor — each is a set of equations a physicist or engineer derives from first principles and encodes explicitly. This works beautifully when we understand the physics completely.
The ML way: data → model → answers
Machine learning flips the script: data + answers → rules (model). Instead of writing the decision logic yourself, you collect examples \((\mathbf{x}_i, y_i)\) and let an optimization algorithm discover a function \(f\) such that \(f(\mathbf{x}_i) \approx y_i\). The model is the extracted rules, encoded in its parameters.
This is not magic — it works because many real-world relationships are too complex to derive from first principles (e.g., classifying satellite maneuver intent from radar cross-section, or detecting anomalies in sensor streams). ML excels precisely in these high-dimensional, data-rich regimes.
Why physicists should care
ML is increasingly a core tool in physics: gravitational-wave signal detection (LIGO), particle identification at CERN, materials discovery, orbit determination, and more. Understanding the mathematical foundations — which are essentially applied calculus and linear algebra — puts you ahead of practitioners who treat ML as a black box.
Key Concepts
The AI ⊃ ML ⊃ Deep Learning hierarchy
Artificial Intelligence (AI) is the broad goal: machines that exhibit intelligent behavior. Machine Learning (ML) is one approach to AI: systems that improve from experience without being explicitly programmed. Deep Learning (DL) is a subset of ML using multi-layer neural networks.
Three types of learning
| Type | Training Data | Goal | Physics Example |
|---|---|---|---|
| Supervised | Labeled pairs \((\mathbf{x}_i, y_i)\) | Learn \(f: \mathbf{x} \to y\) | Classify space objects by radar signature |
| Unsupervised | Unlabeled \(\mathbf{x}_i\) | Find structure / clusters | Grouping satellite orbits by behavior |
| Reinforcement | Reward signals | Learn policy via trial & error | Autonomous spacecraft attitude control |
Regression vs. Classification
Within supervised learning, the output type matters:
- Regression: predict a continuous value. Example: predict the apogee altitude of a satellite from telemetry features. Output \(\hat{y} \in \mathbb{R}\).
- Classification: predict a discrete class label. Example: "is this radar return from debris or an active satellite?" Output \(\hat{y} \in \{0,1\}\) (binary) or \(\hat{y} \in \{1,\ldots,K\}\) (multi-class).
The ML Pipeline
Every ML project follows the same high-level workflow:
- Data collection: gather labeled examples. Quality matters more than quantity.
- Feature engineering: transform raw measurements into a numeric vector \(\mathbf{x}\).
- Model selection: choose the form of \(f\) (linear, neural network, etc.).
- Loss function: define what "wrong" means quantitatively.
- Optimization: update model parameters \(\boldsymbol{\theta}\) to reduce loss.
- Evaluation: test on held-out data; report accuracy, F1, RMSE, etc.
Parameters vs. Hyperparameters
Parameters (e.g., weights \(\mathbf{W}\), biases \(\mathbf{b}\)) are learned from data during training. Hyperparameters (e.g., learning rate \(\alpha\), number of layers, batch size) are set by the user before training. Confusing the two is a common beginner mistake.
Overfitting and Generalization
A model overfits when it memorizes the training data but fails on new data. We guard against this by splitting data into training, validation, and test sets, and by monitoring validation loss during training. The validation loss is not used to update the model, but rather to indicate how well the model can generalize to new data.
Regression: Fitting a Model to Data
What is regression?
Regression is the supervised learning task of predicting a continuous output \(y\) from input features \(x\). The machine learning framing is identical to curve fitting you already know — the difference is in how we think about the model, the loss, and generalization.
Formally, we assume the data was generated by some unknown function \(f\) plus noise:
Our goal is to learn an approximation \(\hat{y} = f_{\boldsymbol{\beta}}(x)\) whose parameters \(\boldsymbol{\beta}\) we optimize from data.
Polynomial regression model
A polynomial of degree \(D\) is one of the simplest parametric models. For a scalar input \(x\):
Here \(x\) is the input (a single observed value, e.g. a physics score) and \(\boldsymbol{\beta} = [\beta_0, \beta_1, \ldots, \beta_D]^\top\) is the vector of model parameters — the coefficients we will learn from training data. Although the model is nonlinear in \(x\), it is linear in the parameters \(\boldsymbol{\beta}\), which makes the math tractable. We can write the prediction for a single input compactly using a feature vector \(\boldsymbol{\phi}(x)\) that collects the powers of \(x\):
For \(N\) training points \(x_1, x_2, \ldots, x_N\), each has its own feature vector \(\boldsymbol{\phi}(x_i)\). Stacking these as rows gives the design matrix \(\boldsymbol{\Phi}\) (size \(N \times (D+1)\)):
With this notation the predictions for all \(N\) training points collapse to a single matrix–vector product:
The loss function (Mean Squared Error)
To measure how well our model fits the training data we use Mean Squared Error (MSE):
Squaring the errors ensures all terms are positive and penalizes large errors more than small ones. Training the model means finding the \(\boldsymbol{\beta}\) that minimizes \(L\):
Overfitting, Regularization & the Bias-Variance Tradeoff
Overfitting
As you increase the polynomial degree \(D\), the model becomes more flexible and can fit the training data more exactly — eventually passing through every training point perfectly. But a perfectly fit training curve often performs worse on new test data, because it has learned the noise, not the signal. This is overfitting.
Bias-Variance Tradeoff
Recall from the data model that \(y = f(x) + \varepsilon\), where \(\varepsilon\) is random measurement noise with variance \(\sigma^2_\varepsilon\). If we train our model on many different datasets drawn from the same process, our prediction \(\hat{y}\) will vary. The expected test error (expected squared difference between our prediction and the true value) decomposes into exactly three terms:
- Bias\(^2\) — the squared difference between the average prediction \(\mathbb{E}[\hat{y}]\) and the true value \(f(x)\). Bias measures systematic error: does the model consistently predict too high, too low, or miss the shape of \(f\)? It is squared because we are measuring error in squared units (consistent with MSE), and because the bias itself — \(\mathbb{E}[\hat{y}] - f(x)\) — can be positive or negative; squaring makes it always non-negative. A model that is too simple (e.g.\ a straight line fit to curved data) has high bias.
- Variance — how much the prediction \(\hat{y}\) fluctuates across different training sets. A model with high variance has "memorized" the training data rather than learning the underlying trend, so small changes in the training set cause large swings in the predictions. A high-degree polynomial typically has high variance.
- Irreducible noise \(\sigma^2_\varepsilon\) — the variance of the observation noise \(\varepsilon\) built into the data-generating process \(y = f(x) + \varepsilon\). No matter how good our model is, we cannot predict the random noise on each measurement. This sets a hard lower bound on the test error.
Good models have both low bias and low variance. Increasing model complexity decreases bias but increases variance. The goal is to find the sweet spot between the two.
Regularization
Regularization is a technique for controlling overfitting by adding a penalty to the loss function that discourages large parameter values. The most common form is L2 regularization (also called Ridge regression):
The second term penalizes large coefficients. \(\lambda \geq 0\) is a hyperparameter you choose before training:
- \(\lambda = 0\): no regularization — pure MSE, free to overfit.
- Small \(\lambda\): mild penalty, allows moderate flexibility.
- Large \(\lambda\): strong penalty, forces small coefficients, model behaves more like a low-degree polynomial.
With regularization there is generally no closed-form solution, so we minimize \(L_{\text{reg}}\) numerically. In MATLAB, fminsearch does this without requiring you to supply the gradient.
Recommended Video
StatQuest gives a clear, visual explanation of bias and variance — the core tradeoff behind overfitting and regularization covered in this lesson.
Worked Example — MATLAB
This example walks through the full ML pipeline using a quadratic polynomial model, a regularized MSE loss, and MATLAB's fminsearch for optimization.
%% Lesson 33 — Linear Regression with fminsearch
% Model: y_hat = beta(1) + beta(2)*x (line: intercept + slope)
% Loss: L = mean((y_hat - y).^2)
rng(42);
%% 1. Generate synthetic data (in HW 33 you will load physics_grades.mat)
N = 80;
x = linspace(0, 4, N)';
y = 2*x + 1 + randn(N,1)*0.8; % noisy line: true slope=2, intercept=1
%% 2. Train/test split (75 / 25)
n_train = round(0.75 * N);
idx = randperm(N);
x_train = x(idx(1:n_train)); y_train = y(idx(1:n_train));
x_test = x(idx(n_train+1:end)); y_test = y(idx(n_train+1:end));
%% 3. Set up the linear model as an anonymous function
% beta(1) = intercept, beta(2) = slope
model = @(beta, x_in) beta(1) + beta(2)*x_in;
%% 4. Set up the loss function (MSE on training data)
loss_train = @(beta) mean((model(beta, x_train) - y_train).^2);
%% 5. Optimize using fminsearch (no gradient required)
beta0 = [0; 0]; % initial guess
beta_opt = fminsearch(loss_train, beta0);
fprintf('Learned parameters: intercept=%.3f, slope=%.3f\n', beta_opt);
fprintf('Training loss: %.4f\n', loss_train(beta_opt));
%% 6. Evaluate on test data
loss_test = @(beta) mean((model(beta, x_test) - y_test).^2);
fprintf('Test loss: %.4f\n', loss_test(beta_opt));
%% 7. Plot results
x_fine = linspace(min(x), max(x), 200)';
figure; hold on;
scatter(x_train, y_train, 25, 'b', 'filled', 'DisplayName','Train');
scatter(x_test, y_test, 25, 'r', 'filled', 'DisplayName','Test');
plot(x_fine, model(beta_opt, x_fine), 'k-', 'LineWidth', 2, 'DisplayName','Fitted model');
xlabel('x'); ylabel('y');
title('Linear Regression via fminsearch');
legend; grid on;
fminsearch? For a simple quadratic there is a closed-form solution, but fminsearch works for any loss function — including ones with extra terms you add yourself. In HW 33 you will extend this loss to include a regularization penalty.
Summary
| Concept | Key Idea |
|---|---|
| ML definition | Systems that learn \(f: \mathbf{x} \to y\) from data rather than explicit rules. |
| Supervised learning | Labeled pairs \((\mathbf{x}_i, y_i)\); learn to predict \(y\) from \(\mathbf{x}\). |
| ML pipeline | Data → features → model → loss → optimization → evaluation. |
| Polynomial model | \(\hat{y} = \sum_{k=0}^{D} \beta_k x^k\); parameters \(\boldsymbol{\beta}\) learned from data. |
| MSE loss | \(L = \frac{1}{N}\sum(\hat{y}_i - y_i)^2\); measures average squared error. |
| Regularization (L2) | \(L_{\text{reg}} = \text{MSE} + \lambda\sum_{k=1}^{D}\beta_k^2\); penalizes large coefficients to prevent overfitting. |
| Bias-variance tradeoff | Simple models: high bias, low variance. Complex models: low bias, high variance. |
| Parameters vs. hyperparameters | \(\boldsymbol{\beta}\) learned during training; \(\lambda\), \(D\) set beforehand. |
| Optimization | fminsearch minimizes any anonymous loss function numerically. |
References
- Maini, V. & Sabri, S. (2017). Machine Learning for Humans. Medium. Parts 1 & 2.1 — Introduction and Supervised Learning (pp. 1–29). medium.com/machine-learning-for-humans
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. deeplearningbook.org
- MathWorks. fminsearch — Multidimensional unconstrained nonlinear minimization. mathworks.com/help/matlab/ref/fminsearch.html