Lesson 31 — Statistical Estimation I: Maximum Likelihood

Learning Objectives

State the PDF of a Gaussian distribution and identify its mean and variance parameters.
Write the PDF of a multivariate Gaussian using vector-matrix notation.
Describe the frequentist framework: unknown \(\theta\), estimator \(\hat\theta = T(Y)\).
Define bias, variance, and mean squared error (MSE) of an estimator.
State the maximum likelihood (ML) estimator and derive it for a simple Gaussian example.
Explain what it means for an estimator to be consistent, unbiased, and UMVU.
Explain the significance of the Cramér-Rao bound.

Notation used in this lesson

\(X \in \mathbb{R}^p\)\(X\) is a \(p\)-dimensional vector of real numbers

\(\theta\)Unknown parameter (scalar or vector)

\(\hat\theta\)Estimate of \(\theta\)

\(\bar\theta\)Mean of the estimator \(\mathbb{E}[\hat\theta]\)

\(Y\)Observed random vector (data)

\(p_\theta(y)\)PDF of \(Y\) parameterized by \(\theta\)

\(\mathbb{E}[X]\)Expected value (mean) of random variable \(X\)

\(\mu,\,\sigma^2\)Mean and variance of a Gaussian

\(R\)Covariance matrix (multivariate Gaussian)

\(\Omega\)Parameter space — the set of all admissible values of \(\theta\)

\(\text{MSE}_\theta\)Mean squared error of estimate

🔭 Motivation: Estimating Unknown Quantities from Data

In many physics experiments you cannot measure a quantity of interest directly. Instead, you observe a noisy, indirect signal and must infer the underlying quantity from that signal. This is the estimation problem.

Examples are everywhere: inferring the mass of a particle from a track in a detector, estimating the distance to a star from its apparent brightness, determining the temperature of a plasma from its emission spectrum, or recovering the true intensity of a laser beam from speckle-corrupted measurements. In each case, your measurement is a random variable that depends on the unknown quantity you want to estimate.

Two broad frameworks address this problem: frequentist (the subject of this lesson) and Bayesian (Lesson 32). Both are valuable; understanding when to use each is a core skill.

📊 Common Probability Distributions

The Gaussian (Normal) Distribution

The most widely used distribution in physics and engineering is the Gaussian distribution, written \(X \sim \mathcal{N}(\mu, \sigma^2)\). Its probability density function (PDF) is:

p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left\{ -\frac{1}{2\sigma^2}(x - \mu)^2 \right\}

Note: For a Gaussian random variable \(X \sim \mathcal{N}(\mu, \sigma^2)\): \(\mathbb{E}[X] = \mu\) and \(\mathbb{E}[(X-\mu)^2] = \sigma^2\). The parameters \(\mu\) and \(\sigma^2\) are exactly the mean and variance — they are not random variables.

The Multivariate Gaussian Distribution

When \(X \in \mathbb{R}^p\) is a vector rather than a scalar, where each element is an independently Gaussian distributed random variable, the Gaussian distribution generalizes to:

p(x) = \frac{1}{(2\pi)^{p/2}} |R|^{-1/2} \exp\!\left\{ -\frac{1}{2}(x - \mu)^T R^{-1}(x - \mu) \right\}

where \(\mu \in \mathbb{R}^p\) is the mean vector and \(R \in \mathbb{R}^{p \times p}\) is the covariance matrix (symmetric positive-definite). We write \(X \sim \mathcal{N}(\mu, R)\).

Note: For \(X \sim \mathcal{N}(\mu, R)\): \(\mathbb{E}[X] = \mu\) and \(\mathbb{E}[(X-\mu)(X-\mu)^T] = R\). The \((i,j)\) entry of \(R\) is the covariance between components \(X_i\) and \(X_j\). Diagonal entries are variances.

Other distributions

Not all physical measurements are Gaussian. Counting experiments (photons, decay events) follow Poisson distributions. Some intensity measurements (e.g., laser speckle) follow exponential distributions. The estimation framework below applies to any distribution — the key is knowing (or assuming) the form of the probability density function for your measurements.

📐 The Frequentist Framework

Throughout this lesson \(\theta\) denotes the unknown parameter we want to estimate — for example, the mean \(\mu\) of a Gaussian, the rate \(\lambda\) of an exponential distribution, or any other scalar or vector that characterizes \(p_\theta(y)\). Our job is to infer \(\theta\) from the observed data \(Y\).

Setup

In the frequentist approach, the unknown quantity \(\theta\) is treated as a fixed but unknown constant (not random). We observe data \(Y \in \mathbb{R}^n\) drawn from a distribution \(p_\theta(y)\) that depends on \(\theta\). Our goal is to design a function \(T: \mathbb{R}^n \to \Omega\) such that

\hat\theta = T(Y)

is a good estimate of the true \(\theta\). Because \(T\) is a function of the random vector \(Y\), the estimate \(\hat\theta\) is itself a random variable — it varies across repeated experiments.

Note: The estimate \(\hat\theta\) is a random variable because it is a function of the random data \(Y\). The mean of the estimate, \(\bar\theta = \mathbb{E}[\hat\theta]\), tells us where the estimates tend to cluster on average.

Bias, variance, and MSE

Three quantities characterize the quality of an estimator:

Quantity	Definition	Meaning
Bias	\(\text{bias}_\theta = \bar\theta - \theta\)	Systematic offset: does the estimator on average give the right answer?
Variance	\(\text{var}_\theta = \mathbb{E}_\theta\!\left[(\hat\theta - \bar\theta)^2\right]\)	Spread of estimates around their mean: how consistent is it?
MSE	\(\text{MSE}_\theta = \mathbb{E}_\theta\!\left[(\hat\theta - \theta)^2\right]\)	Overall error from the true value: the quantity we most care about.

These are related by the fundamental identity:

\text{MSE}_\theta = \text{var}_\theta + (\text{bias}_\theta)^2

An estimator is unbiased if \(\text{bias}_\theta = 0\) for all \(\theta\), i.e., \(\bar\theta = \theta\). An unbiased estimator's MSE equals its variance.

Consistency

An estimator \(\hat\theta_n\) based on \(n\) observations is consistent if it converges to the true \(\theta\) in probability as \(n \to \infty\). Consistency is the minimum property we expect of an estimator: with enough data, it should converge to the truth.

The uniformly minimum variance unbiased (UMVU) estimator is the best possible unbiased estimator — it achieves the lowest variance among all unbiased estimators, for every value of \(\theta\).

🎯 The Maximum Likelihood (ML) Estimator

Definition

The most widely used frequentist estimator is the maximum likelihood (ML) estimator. The idea is simple: choose the value of \(\theta\) that makes the observed data \(Y\) most probable.

\hat\theta_{\mathrm{ML}} = \underset{\theta \in \Omega}{\operatorname{arg\,max}}\; p_\theta(Y) = \underset{\theta \in \Omega}{\operatorname{arg\,max}}\; \log p_\theta(Y)

Note: We maximize \(\log p_\theta(Y)\) (the log-likelihood) rather than \(p_\theta(Y)\) directly. Since \(\log\) is a monotonically increasing function, the maximum is at the same \(\theta\). The log turns products of probabilities into sums, which is much easier to differentiate and optimize.

Properties of the ML estimator

Consistent: \(\hat\theta_{\mathrm{ML}} \to \theta\) as the number of observations \(n \to \infty\).
Asymptotically efficient: As \(n \to \infty\), the ML estimate attains the Cramér-Rao bound — the theoretical lower bound on the variance of any unbiased estimator. No unbiased estimator can do better asymptotically.
Not necessarily unbiased for finite \(n\), but bias vanishes as \(n\) grows.

Note: The Cramér-Rao bound says that the variance of any unbiased estimator is bounded below by \(1 / \mathcal{I}(\theta)\), where \(\mathcal{I}(\theta)\) is the Fisher information. The ML estimator achieves this bound asymptotically — it is as good as any unbiased estimator can be, given enough data.

Deriving the ML estimator: Gaussian mean (Example 2.3)

Let \(\{Y_k\}_{k=1}^n\) be i.i.d. random variables with distribution \(\mathcal{N}(\theta, \sigma^2)\) where \(\sigma^2\) is known and \(\theta\) is the unknown mean. The log-likelihood for \(n\) independent observations is:

\log p_\theta(Y) = -\frac{1}{2\sigma^2}\sum_{k=1}^n (Y_k - \theta)^2 - \frac{n}{2}\log(2\pi\sigma^2)

Differentiating with respect to \(\theta\) and setting equal to zero:

\frac{d \log p_\theta(Y)}{d\theta}\bigg|_{\theta = \hat\theta} = \frac{1}{\sigma^2}\sum_{k=1}^n (Y_k - \hat\theta) = 0

Solving for \(\hat\theta\):

\hat\theta_{\mathrm{ML}} = \frac{1}{n}\sum_{k=1}^n Y_k

Note: The ML estimate of the Gaussian mean is simply the sample mean — the average of the observations. Moreover, for this specific case, the ML estimator is also unbiased and UMVU (the best possible unbiased estimator). This is a special property of the Gaussian; most ML estimators are not unbiased for finite \(n\).

▶ Recommended Video

An intuitive and visual walkthrough of maximum likelihood estimation — builds up the core idea from first principles and explains why MLE is the most important theory in statistics.

▶

Maximum Likelihood Estimation: The Engine Behind Every AI Model

Practical Stats — YouTube

💻 Worked Example — MATLAB

This example demonstrates ML estimation of the mean of a Gaussian distribution. We generate data, compute the ML estimate, and verify that it is unbiased by averaging over many realizations. The Gaussian distribution is used here for illustration — your HW 31 applies ML estimation to a different distribution (exponential).

%% Lesson 31 — ML Estimation of Gaussian Mean (demonstration)
% True parameter: theta = 3 (the mean)
% Distribution: Y ~ N(theta, sigma^2), sigma^2 = 4 (known)
% ML estimator: theta_hat = mean(Y)
% (HW 31 uses a different distribution -- do not use this directly)

rng(42);
theta_true = 3.0;
sigma2     = 4.0;

%% Single experiment: estimate from n=20 samples
n = 20;
Y = theta_true + sqrt(sigma2) * randn(n, 1);
theta_hat = mean(Y);
fprintf('True theta = %.2f,  ML estimate = %.4f\n', theta_true, theta_hat);

%% Demonstrate unbiasedness: average over M=1000 experiments
M     = 1000;
n_vec = [5, 10, 20, 50, 100, 200];
mean_est = zeros(size(n_vec));
var_est  = zeros(size(n_vec));

for j = 1:length(n_vec)
  estimates = zeros(M, 1);
  for i = 1:M
    Y_i = theta_true + sqrt(sigma2) * randn(n_vec(j), 1);
    estimates(i) = mean(Y_i);
  end
  mean_est(j) = mean(estimates);
  var_est(j)  = var(estimates);
end

%% Plot
figure;
subplot(2,1,1);
plot(n_vec, mean_est, 'bo-', 'LineWidth', 1.5);
yline(theta_true, 'r--', 'LineWidth', 1.5);
xlabel('n (samples)'); ylabel('Mean of $\hat{\theta}$', 'Interpreter', 'latex');
title('ML Estimator Mean vs. Number of Samples');
legend({'Mean of estimates', 'True $\theta$'}, 'Interpreter', 'latex'); grid on;

subplot(2,1,2);
semilogy(n_vec, var_est, 'bs-', 'LineWidth', 1.5); hold on;
semilogy(n_vec, sigma2 ./ n_vec, 'r--', 'LineWidth', 1.5);
xlabel('n (samples)'); ylabel('$\mathrm{Var}(\hat{\theta})$ [log scale]', 'Interpreter', 'latex');
title('Variance of ML Estimator (scales as $\sigma^2/n$)', 'Interpreter', 'latex');
legend({'Empirical variance', '$\sigma^2/n$ (Cramer-Rao bound)'}, 'Interpreter', 'latex'); grid on;

What to observe: The mean of the estimates stays near the true \(\theta = 3\) for all \(n\) (unbiasedness). The variance decreases as \(1/n\), matching the Cramér-Rao bound for this problem. This is the signature of a consistent, efficient estimator.

📝 Summary

Concept	Key Idea
Frequentist framework	\(\theta\) is unknown but fixed. Estimate it from data \(Y\) using \(\hat\theta = T(Y)\).
Bias	\(\bar\theta - \theta\): systematic offset of the estimator from the truth.
Variance	Spread of the estimator around its mean across repeated experiments.
MSE	\(\text{var} + \text{bias}^2\): overall mean squared distance from the true \(\theta\).
Consistency	Estimator converges to \(\theta\) as \(n \to \infty\). Minimum requirement.
ML estimator	\(\hat\theta_{\mathrm{ML}} = \arg\max_\theta \log p_\theta(Y)\). Consistent and asymptotically efficient.
Cramér-Rao bound	Lower bound on variance of any unbiased estimator. ML attains it asymptotically.
Gaussian mean ML	Sample mean \((1/n)\sum Y_k\) — unbiased and UMVU for this case.

📋 HW 31 — ML Estimation: Laser Speckle

📚 References

Bouman, C.A. Foundations of Computational Imaging, SIAM, 2022, §§2.2–2.3.1.
Wikipedia contributors. "Estimation theory." Wikipedia, The Free Encyclopedia, 2019. (Read through the Maximum Likelihood section.)
Scratchapixel. "Estimators." Mathematics & Physics for Computer Graphics. (Alternative reference with less mathematical notation.)

← Previous
L30: Gradient-Based Optimization

Next Lesson
L32: Bayesian Estimation

→

Statistical Estimation I: Frequentist Methods & Maximum Likelihood

Learning Objectives

Notation used in this lesson

🔭 Motivation: Estimating Unknown Quantities from Data

📊 Common Probability Distributions

The Gaussian (Normal) Distribution

The Multivariate Gaussian Distribution

Other distributions

📐 The Frequentist Framework

Setup

Bias, variance, and MSE

Consistency

🎯 The Maximum Likelihood (ML) Estimator

Definition

Properties of the ML estimator

Deriving the ML estimator: Gaussian mean (Example 2.3)

▶ Recommended Video

💻 Worked Example — MATLAB

📝 Summary

📚 References