Lesson 33 • Physics 356

What is Machine Learning & AI?

⏱ ~35 min read

Learning Objectives

Notation used in this lesson

\(\mathbf{x}\)Input feature vector
\(y\)True label/output
\(\hat{y}\)Model prediction
\(f(\cdot)\)Learned function (model)
\(L\)Loss function
\(N\)Number of training samples
\(\boldsymbol{\beta}\)Model parameters (weights)
\(\lambda\)Regularization strength (hyperparameter)
\(D\)Polynomial degree

🌍 Background & Motivation

The old way: explicit rules

Classical programming takes the form rules + data → answers. A spacecraft attitude-control algorithm, a ballistic trajectory solver, a radar signal processor — each is a set of equations a physicist or engineer derives from first principles and encodes explicitly. This works beautifully when we understand the physics completely.

The ML way: data → model → answers

Machine learning flips the script: data + answers → rules (model). Instead of writing the decision logic yourself, you collect examples \((\mathbf{x}_i, y_i)\) and let an optimization algorithm discover a function \(f\) such that \(f(\mathbf{x}_i) \approx y_i\). The model is the extracted rules, encoded in its parameters.

This is not magic — it works because many real-world relationships are too complex to derive from first principles (e.g., classifying satellite maneuver intent from radar cross-section, or detecting anomalies in sensor streams). ML excels precisely in these high-dimensional, data-rich regimes.

Why physicists should care

ML is increasingly a core tool in physics: gravitational-wave signal detection (LIGO), particle identification at CERN, materials discovery, orbit determination, and more. Understanding the mathematical foundations — which are essentially applied calculus and linear algebra — puts you ahead of practitioners who treat ML as a black box.

🧠 Key Concepts

The AI ⊃ ML ⊃ Deep Learning hierarchy

Artificial Intelligence (AI) is the broad goal: machines that exhibit intelligent behavior. Machine Learning (ML) is one approach to AI: systems that improve from experience without being explicitly programmed. Deep Learning (DL) is a subset of ML using multi-layer neural networks.

Analogy: AI is like "transportation"; ML is "motor vehicles"; deep learning is "electric cars." All electric cars are motor vehicles, but not all motor vehicles are electric cars.

Three types of learning

TypeTraining DataGoalPhysics Example
Supervised Labeled pairs \((\mathbf{x}_i, y_i)\) Learn \(f: \mathbf{x} \to y\) Classify space objects by radar signature
Unsupervised Unlabeled \(\mathbf{x}_i\) Find structure / clusters Grouping satellite orbits by behavior
Reinforcement Reward signals Learn policy via trial & error Autonomous spacecraft attitude control

Regression vs. Classification

Within supervised learning, the output type matters:

The ML Pipeline

Every ML project follows the same high-level workflow:

\[ \underbrace{\text{Raw Data}}_{\text{collection}} \;\longrightarrow\; \underbrace{\text{Features } \mathbf{x}}_{\text{preprocessing}} \;\longrightarrow\; \underbrace{f_{\boldsymbol{\theta}}(\mathbf{x})}_{\text{model}} \;\longrightarrow\; \underbrace{L(\hat{y}, y)}_{\text{loss}} \;\longrightarrow\; \underbrace{\min_{\boldsymbol{\theta}} L}_{\text{optimization}} \]
  1. Data collection: gather labeled examples. Quality matters more than quantity.
  2. Feature engineering: transform raw measurements into a numeric vector \(\mathbf{x}\).
  3. Model selection: choose the form of \(f\) (linear, neural network, etc.).
  4. Loss function: define what "wrong" means quantitatively.
  5. Optimization: update model parameters \(\boldsymbol{\theta}\) to reduce loss.
  6. Evaluation: test on held-out data; report accuracy, F1, RMSE, etc.

Parameters vs. Hyperparameters

Parameters (e.g., weights \(\mathbf{W}\), biases \(\mathbf{b}\)) are learned from data during training. Hyperparameters (e.g., learning rate \(\alpha\), number of layers, batch size) are set by the user before training. Confusing the two is a common beginner mistake.

Overfitting and Generalization

A model overfits when it memorizes the training data but fails on new data. We guard against this by splitting data into training, validation, and test sets, and by monitoring validation loss during training. The validation loss is not used to update the model, but rather to indicate how well the model can generalize to new data.

\[ \underbrace{L_{\text{train}} \ll L_{\text{test}}}_{\text{overfitting sign}} \qquad\qquad \underbrace{L_{\text{train}} \approx L_{\text{test}} \approx \text{small}}_{\text{good generalization}} \]

📈 Regression: Fitting a Model to Data

What is regression?

Regression is the supervised learning task of predicting a continuous output \(y\) from input features \(x\). The machine learning framing is identical to curve fitting you already know — the difference is in how we think about the model, the loss, and generalization.

Formally, we assume the data was generated by some unknown function \(f\) plus noise:

\[ y = f(x) + \varepsilon \]

Our goal is to learn an approximation \(\hat{y} = f_{\boldsymbol{\beta}}(x)\) whose parameters \(\boldsymbol{\beta}\) we optimize from data.

Polynomial regression model

A polynomial of degree \(D\) is one of the simplest parametric models. For a scalar input \(x\):

\[ \hat{y} = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_D x^D = \sum_{k=0}^{D} \beta_k\, x^k \]

Here \(x\) is the input (a single observed value, e.g. a physics score) and \(\boldsymbol{\beta} = [\beta_0, \beta_1, \ldots, \beta_D]^\top\) is the vector of model parameters — the coefficients we will learn from training data. Although the model is nonlinear in \(x\), it is linear in the parameters \(\boldsymbol{\beta}\), which makes the math tractable. We can write the prediction for a single input compactly using a feature vector \(\boldsymbol{\phi}(x)\) that collects the powers of \(x\):

\[ \boldsymbol{\phi}(x) = \begin{bmatrix} 1 \\ x \\ x^2 \\ \vdots \\ x^D \end{bmatrix} \qquad \Rightarrow \qquad \hat{y} = \boldsymbol{\beta}^\top \boldsymbol{\phi}(x) \]

For \(N\) training points \(x_1, x_2, \ldots, x_N\), each has its own feature vector \(\boldsymbol{\phi}(x_i)\). Stacking these as rows gives the design matrix \(\boldsymbol{\Phi}\) (size \(N \times (D+1)\)):

\[ \boldsymbol{\Phi} = \begin{bmatrix} \boldsymbol{\phi}(x_1)^\top \\ \vdots \\ \boldsymbol{\phi}(x_N)^\top \end{bmatrix} = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^D \\ \vdots & & & & \vdots \\ 1 & x_N & x_N^2 & \cdots & x_N^D \end{bmatrix} \]

With this notation the predictions for all \(N\) training points collapse to a single matrix–vector product:

\[ \hat{\mathbf{y}} = \boldsymbol{\Phi}\,\boldsymbol{\beta} \]

The loss function (Mean Squared Error)

To measure how well our model fits the training data we use Mean Squared Error (MSE):

\[ L(\boldsymbol{\beta}) = \frac{1}{N} \sum_{i=1}^{N} \bigl(\hat{y}_i - y_i\bigr)^2 \]

Squaring the errors ensures all terms are positive and penalizes large errors more than small ones. Training the model means finding the \(\boldsymbol{\beta}\) that minimizes \(L\):

\[ \hat{\boldsymbol{\beta}} = \underset{\boldsymbol{\beta}}{\operatorname{argmin}}\; L(\boldsymbol{\beta}) \]
Connection to what you already know: MSE minimization is exactly least-squares curve fitting. The ML framing just makes the loss explicit and separates training data from test data.

⚖️ Overfitting, Regularization & the Bias-Variance Tradeoff

Overfitting

As you increase the polynomial degree \(D\), the model becomes more flexible and can fit the training data more exactly — eventually passing through every training point perfectly. But a perfectly fit training curve often performs worse on new test data, because it has learned the noise, not the signal. This is overfitting.

Classic example: A degree-1 polynomial (a line) may underfit a nonlinear trend. A degree-15 polynomial may overfit by wiggling through every noisy training point. The test MSE will be high in both cases — but for opposite reasons.

Bias-Variance Tradeoff

Recall from the data model that \(y = f(x) + \varepsilon\), where \(\varepsilon\) is random measurement noise with variance \(\sigma^2_\varepsilon\). If we train our model on many different datasets drawn from the same process, our prediction \(\hat{y}\) will vary. The expected test error (expected squared difference between our prediction and the true value) decomposes into exactly three terms:

\[ \underbrace{\mathbb{E}\!\left[(\hat{y} - y)^2\right]}_{\text{expected test error}} \;=\; \underbrace{\bigl(\mathbb{E}[\hat{y}] - f(x)\bigr)^2}_{\text{Bias}^2} \;+\; \underbrace{\mathbb{E}\!\left[\bigl(\hat{y} - \mathbb{E}[\hat{y}]\bigr)^2\right]}_{\text{Variance}} \;+\; \underbrace{\sigma^2_\varepsilon}_{\text{irreducible noise}} \]

Good models have both low bias and low variance. Increasing model complexity decreases bias but increases variance. The goal is to find the sweet spot between the two.

Regularization

Regularization is a technique for controlling overfitting by adding a penalty to the loss function that discourages large parameter values. The most common form is L2 regularization (also called Ridge regression):

\[ L_{\text{reg}}(\boldsymbol{\beta}) = \underbrace{\frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)^2}_{\text{data fit (MSE)}} \;+\; \underbrace{\lambda \sum_{k=1}^{D} \beta_k^2}_{\text{regularization penalty}} \]

The second term penalizes large coefficients. \(\lambda \geq 0\) is a hyperparameter you choose before training:

Note: The bias term \(\beta_0\) is typically not regularized, since penalizing the intercept would shift all predictions toward zero regardless of the data.

With regularization there is generally no closed-form solution, so we minimize \(L_{\text{reg}}\) numerically. In MATLAB, fminsearch does this without requiring you to supply the gradient.

Recommended Video

StatQuest gives a clear, visual explanation of bias and variance — the core tradeoff behind overfitting and regularization covered in this lesson.

💻 Worked Example — MATLAB

This example walks through the full ML pipeline using a quadratic polynomial model, a regularized MSE loss, and MATLAB's fminsearch for optimization.

%% Lesson 33 — Linear Regression with fminsearch
% Model:  y_hat = beta(1) + beta(2)*x   (line: intercept + slope)
% Loss:   L = mean((y_hat - y).^2)

rng(42);

%% 1. Generate synthetic data (in HW 33 you will load physics_grades.mat)
N = 80;
x = linspace(0, 4, N)';
y = 2*x + 1 + randn(N,1)*0.8;   % noisy line: true slope=2, intercept=1

%% 2. Train/test split  (75 / 25)
n_train = round(0.75 * N);
idx = randperm(N);
x_train = x(idx(1:n_train));     y_train = y(idx(1:n_train));
x_test  = x(idx(n_train+1:end)); y_test  = y(idx(n_train+1:end));

%% 3. Set up the linear model as an anonymous function
% beta(1) = intercept, beta(2) = slope
model = @(beta, x_in) beta(1) + beta(2)*x_in;

%% 4. Set up the loss function (MSE on training data)
loss_train = @(beta) mean((model(beta, x_train) - y_train).^2);

%% 5. Optimize using fminsearch (no gradient required)
beta0    = [0; 0];               % initial guess
beta_opt = fminsearch(loss_train, beta0);

fprintf('Learned parameters: intercept=%.3f, slope=%.3f\n', beta_opt);
fprintf('Training loss: %.4f\n', loss_train(beta_opt));

%% 6. Evaluate on test data
loss_test = @(beta) mean((model(beta, x_test) - y_test).^2);
fprintf('Test loss:     %.4f\n', loss_test(beta_opt));

%% 7. Plot results
x_fine = linspace(min(x), max(x), 200)';
figure; hold on;
scatter(x_train, y_train, 25, 'b', 'filled', 'DisplayName','Train');
scatter(x_test,  y_test,  25, 'r', 'filled', 'DisplayName','Test');
plot(x_fine, model(beta_opt, x_fine), 'k-', 'LineWidth', 2, 'DisplayName','Fitted model');
xlabel('x'); ylabel('y');
title('Linear Regression via fminsearch');
legend; grid on;
Why fminsearch? For a simple quadratic there is a closed-form solution, but fminsearch works for any loss function — including ones with extra terms you add yourself. In HW 33 you will extend this loss to include a regularization penalty.

📝 Summary

ConceptKey Idea
ML definitionSystems that learn \(f: \mathbf{x} \to y\) from data rather than explicit rules.
Supervised learningLabeled pairs \((\mathbf{x}_i, y_i)\); learn to predict \(y\) from \(\mathbf{x}\).
ML pipelineData → features → model → loss → optimization → evaluation.
Polynomial model\(\hat{y} = \sum_{k=0}^{D} \beta_k x^k\); parameters \(\boldsymbol{\beta}\) learned from data.
MSE loss\(L = \frac{1}{N}\sum(\hat{y}_i - y_i)^2\); measures average squared error.
Regularization (L2)\(L_{\text{reg}} = \text{MSE} + \lambda\sum_{k=1}^{D}\beta_k^2\); penalizes large coefficients to prevent overfitting.
Bias-variance tradeoffSimple models: high bias, low variance. Complex models: low bias, high variance.
Parameters vs. hyperparameters\(\boldsymbol{\beta}\) learned during training; \(\lambda\), \(D\) set beforehand.
Optimizationfminsearch minimizes any anonymous loss function numerically.
📋 HW 33 — Regression from an ML Perspective

📚 References

  1. Maini, V. & Sabri, S. (2017). Machine Learning for Humans. Medium. Parts 1 & 2.1 — Introduction and Supervised Learning (pp. 1–29). medium.com/machine-learning-for-humans
  2. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. deeplearningbook.org
  3. MathWorks. fminsearch — Multidimensional unconstrained nonlinear minimization. mathworks.com/help/matlab/ref/fminsearch.html