Learning Objectives
- Explain the key difference between the frequentist and Bayesian frameworks.
- Define the prior, likelihood, and posterior distributions.
- Apply Bayes' rule to derive the posterior distribution from the forward model and prior.
- State the MAP estimator and interpret it as minimizing negative log-posterior.
- Derive the MMSE (minimum mean squared error) estimator as the posterior mean.
- Work through the Gaussian signal-in-noise example and obtain closed-form MAP/MMSE estimates.
- Explain the noise-to-signal ratio and its effect on the MMSE estimate.
- Compare ML, MAP, and MMSE estimators and identify when each is preferred.
Notation used in this lesson
Motivation: Going Beyond Maximum Likelihood
The ML estimator is asymptotically optimal โ it is as good as any unbiased estimator can be when the amount of data grows. However, it ignores one potentially powerful source of information: prior knowledge about what values \(\theta\) is likely to take.
Consider trying to estimate a smooth physical signal from a small number of noisy measurements. The ML estimator simply returns the noisy observations unchanged โ it does not know the signal is smooth. A physicist, however, would not accept a wildly oscillating estimate if they knew the true signal varies slowly. Bayesian methods formalize exactly this kind of prior knowledge.
The Bayesian Framework
Unknown as a random variable
The fundamental shift in the Bayesian approach: the unknown quantity \(X\) is modeled as a random variable, not a fixed constant. This allows us to assign a prior distribution \(p_x(x)\) that encodes our beliefs about \(X\) before seeing any data.
The three key distributions
| Name | Notation | Meaning |
|---|---|---|
| Prior | \(p_x(x)\) | What we believe about \(X\) before observing data |
| Likelihood (forward model) | \(p_{y|x}(y|x)\) | Probability of observing \(y\) if the true value is \(x\) |
| Posterior | \(p_{x|y}(x|y)\) | Updated belief about \(X\) after observing data \(y\) |
The posterior is the central object in Bayesian estimation. It combines everything we know โ the data and our prior โ into a single distribution over \(X\).
Bayes' Rule and the Posterior
Bayes' rule
Our goal is the posterior \(p_{x|y}(x|y)\) โ the distribution of the unknown \(X\) given the observed data \(Y\). In most practical problems, this distribution cannot be written down directly: we simply do not have a physics-based model that tells us the probability of a particular signal value given a particular measurement.
What we can usually formulate is the forward model \(p_{y|x}(y|x)\): the probability of obtaining measurement \(Y = y\) if the true underlying signal is \(X = x\). This is a physics model of the measurement process โ noise statistics, sensor response, blur kernels, etc. โ and is often straightforward to write down from first principles.
Bayes' rule flips the conditioning, letting us obtain the posterior we want from the forward model we have:
MAP estimator
The maximum a posteriori (MAP) estimator finds the mode of the posterior distribution:
Using Bayes' rule and dropping the constant denominator, this becomes:
MMSE estimator
The minimum mean squared error (MMSE) estimator minimizes the expected squared error \(\mathbb{E}[\|X - \hat X\|^2]\) over the posterior. It is the posterior mean:
For a Gaussian posterior, the mean and mode coincide, so \(\hat X_{\mathrm{MMSE}} = \hat X_{\mathrm{MAP}}\) in that special case.
Example 1: Gaussian Signal in Gaussian Noise (Scalar)
Problem setup
Let \(X \sim \mathcal{N}(0, \sigma_x^2)\) be the unknown scalar signal, \(W \sim \mathcal{N}(0, \sigma_w^2)\) be independent noise, and the observation be \(Y = X + W\). We want to estimate \(X\) from \(Y\).
The prior and forward model are:
Posterior distribution
By Bayes' rule, the posterior is proportional to the product of the likelihood and the prior: \[ p_{x|y}(x|y) \;\propto\; p_{y|x}(y|x)\, p_x(x) \] Multiplying the two Gaussians and completing the square in \(x\), the product takes the form of an unnormalized Gaussian in \(x\). Identifying the mean and variance of that Gaussian and then dividing by the normalizing constant \(p_y(y)\) (which does not depend on \(x\)), the posterior is:
where the posterior mean and variance are:
MMSE estimate and the noise-to-signal ratio
Since the posterior is Gaussian, \(\hat X_{\mathrm{MMSE}} = \mu_{x|y}\):
- Low noise (\(\sigma_w^2 \ll \sigma_x^2\)): attenuation โ 1 โ \(\hat X \approx Y\) (trust the data).
- High noise (\(\sigma_w^2 \gg \sigma_x^2\)): attenuation โ 0 โ \(\hat X \approx 0\) (trust the prior mean).
Compare with ML
The ML estimate treats \(X\) as a fixed unknown parameter. Maximizing the likelihood \(p_{y|x}(y|x)\) over \(x\) gives simply:
The ML estimate is unbiased (\(\mathbb{E}[\hat X_{\mathrm{ML}} | X] = X\)) but noisy โ it just returns the raw observation. The MMSE estimate is biased but has lower MSE when \(\sigma_w^2\) is not negligible.
Example 2: Multivariate Gaussian Estimation
The scalar result generalizes directly to vectors. Now \(X \in \mathbb{R}^n\) is an \(n\)-dimensional unknown โ for example, all \(n\) pixel values in an image, or \(n\) physical quantities measured simultaneously. The distribution \(X \sim \mathcal{N}(0, R_x)\) is still Gaussian, but the variance is replaced by a covariance matrix \(R_x \in \mathbb{R}^{n \times n}\).
The covariance matrix \(R_x\) captures both the variance of each individual component and the statistical relationships between components:
- Diagonal entry \([R_x]_{ii} = \mathbb{E}[X_i^2]\) โ the variance of component \(X_i\).
- Off-diagonal entry \([R_x]_{ij} = \mathbb{E}[X_i X_j]\) โ the covariance between \(X_i\) and \(X_j\). Zero means uncorrelated.
- \(R_x\) is always symmetric (\(R_x = R_x^T\)) and positive semi-definite.
- If all components are independent, \(R_x\) is diagonal (off-diagonal entries are zero).
Similarly \(W \sim \mathcal{N}(0, R_w)\) is independent noise with covariance \(R_w\), and \(Y = X + W\). The MMSE estimate of \(X\) given \(Y\) is:
The posterior covariance (uncertainty after observing \(Y\)) is:
This is the formula you will implement in HW 32 for a 50-dimensional signal reconstruction problem.
Recommended Videos
3Blue1Brown gives a geometric and visual derivation of Bayes' theorem โ ideal for building the intuition behind prior, likelihood, and posterior before working through the math.
Katie Bouman explains how Bayesian estimation and computational imaging were used to reconstruct the first image of a black hole from sparse radio telescope data โ a direct application of the prior + likelihood framework from this lesson.
Veritasium motivates Bayesian thinking with real-world examples and explains why our intuition about probabilities often fails without a prior.
A more technical look at the computational imaging methods behind the first black hole image โ sparse measurements, atmospheric corruption, and the Bayesian regularization that made reconstruction possible.
Worked Example โ MATLAB
This example demonstrates the scalar Gaussian case from Example 1: comparing the ML and MMSE estimates for a single observation. Your HW 32 extends this to the full 50-dimensional multivariate case from Example 2 โ use this example to build intuition first.
%% Lesson 32 โ Scalar Bayesian Estimation (demonstration, Example 1)
% X ~ N(0, sigma_x^2), W ~ N(0, sigma_w^2), Y = X + W
% ML estimate: X_hat_ML = Y
% MMSE estimate: X_hat_MMSE = (sigma_x^2 / (sigma_x^2 + sigma_w^2)) * Y
rng(7);
sigma_x2 = 2.0; % signal variance
sigma_w2 = 3.0; % noise variance (high noise scenario)
%% Single trial
X_true = randn * sqrt(sigma_x2);
W = randn * sqrt(sigma_w2);
Y = X_true + W;
X_ML = Y;
X_MMSE = (sigma_x2 / (sigma_x2 + sigma_w2)) * Y;
fprintf('True X = %+.4f\n', X_true);
fprintf('Observation Y = %+.4f\n', Y);
fprintf('ML estimate = %+.4f (error = %.4f)\n', X_ML, abs(X_ML - X_true));
fprintf('MMSE estimate = %+.4f (error = %.4f)\n', X_MMSE, abs(X_MMSE - X_true));
%% Monte Carlo: compare MSE over many trials
M = 10000;
err_ML = zeros(M,1);
err_MMSE = zeros(M,1);
for i = 1:M
Xi = randn * sqrt(sigma_x2);
Yi = Xi + randn * sqrt(sigma_w2);
err_ML(i) = (Yi - Xi)^2;
err_MMSE(i) = ((sigma_x2/(sigma_x2+sigma_w2))*Yi - Xi)^2;
end
fprintf('\nEmpirical MSE (ML) = %.4f\n', mean(err_ML));
fprintf('Empirical MSE (MMSE) = %.4f\n', mean(err_MMSE));
% Theoretical MSE for MMSE estimator = sigma_x|y^2 + bias^2
sigma_xgy2 = sigma_x2 * sigma_w2 / (sigma_x2 + sigma_w2);
bias2 = ((sigma_w2/(sigma_x2+sigma_w2))^2) * sigma_x2;
fprintf('Theoretical MSE (MMSE) = %.4f\n', sigma_xgy2 + bias2);
sigma_w2 to 0.1 to see the low-noise regime where ML and MMSE nearly coincide.
Comparing ML, MAP, and MMSE
| Estimator | Treats \(X\) as | Formula | Biased? | Best when |
|---|---|---|---|---|
| ML | Fixed unknown | \(\arg\max_x \log p_{y|x}(y|x)\) | Approx. unbiased for large \(n\) | Lots of data; weak/no prior |
| MAP | Random | \(\arg\max_x \{\log p_{y|x} + \log p_x\}\) | Generally yes | Prior is known; finding mode matters |
| MMSE | Random | \(\mathbb{E}[X|Y]\) (posterior mean) | Generally yes | Prior is known; minimizing MSE matters |
Summary
| Concept | Key Idea |
|---|---|
| Bayesian vs. frequentist | Bayesian models \(X\) as random and incorporates a prior; frequentist treats \(\theta\) as fixed. |
| Prior \(p_x(x)\) | Encodes knowledge about \(X\) before seeing data. |
| Posterior \(p_{x|y}(x|y)\) | Updated distribution of \(X\) after observing \(Y\); computed via Bayes' rule. |
| MAP estimator | Mode of the posterior; minimizes \(-\log p_{y|x} - \log p_x\). |
| MMSE estimator | Posterior mean; minimizes expected squared error. |
| Gaussian case (scalar) | \(\hat X_{\mathrm{MMSE}} = \frac{\sigma_x^2}{\sigma_x^2+\sigma_w^2} Y\); attenuated version of \(Y\). |
| Noise-to-signal ratio | \(\sigma_w^2/\sigma_x^2\) controls the attenuation. High noise โ strong shrinkage toward prior mean. |
| Multivariate case | \(\hat X_{\mathrm{MMSE}} = R_x(R_x+R_w)^{-1}Y\); uses signal covariance matrix as prior. |
| Gaussian special property | MAP = MMSE = conditional median; posterior mean is linear in \(Y\); posterior variance independent of \(Y\). |
References
- Bouman, C.A. Foundations of Computational Imaging, SIAM, 2022, ยงยง2.3.2โ2.3.3. (See embedded notes in the course PDF.)
- Harvard Smithsonian CfA. "Black Hole Image." Event Horizon Telescope, 2019. (Bayesian imaging methods at work.) Video: youtube.com/watch?v=iiAi6Y3yaI4