Lesson 32 • Physics 356

Statistical Estimation II: Bayesian Estimation, MAP & MMSE

โฑ ~45 min read

Learning Objectives

Notation used in this lesson

\(X\)Unknown quantity (random in Bayesian framework)
\(Y\)Observed data (random vector)
\(p_x(x)\)Prior distribution of \(X\)
\(p_{y|x}(y|x)\)Likelihood (forward model)
\(p_{x|y}(x|y)\)Posterior distribution of \(X\) given \(Y\)
\(\hat X_{\mathrm{MAP}}\)Maximum a posteriori estimate
\(\hat X_{\mathrm{MMSE}}\)Minimum MSE (posterior mean) estimate
\(\sigma_x^2,\,\sigma_w^2\)Signal and noise variances
\(R_x,\,R_w\)Signal and noise covariance matrices

๐Ÿ’ก Motivation: Going Beyond Maximum Likelihood

The ML estimator is asymptotically optimal โ€” it is as good as any unbiased estimator can be when the amount of data grows. However, it ignores one potentially powerful source of information: prior knowledge about what values \(\theta\) is likely to take.

Consider trying to estimate a smooth physical signal from a small number of noisy measurements. The ML estimator simply returns the noisy observations unchanged โ€” it does not know the signal is smooth. A physicist, however, would not accept a wildly oscillating estimate if they knew the true signal varies slowly. Bayesian methods formalize exactly this kind of prior knowledge.

When Bayesian methods help most: When the amount of data is small relative to the dimension of the unknown, or when strong prior information is available (e.g., the signal is smooth, non-negative, or constrained to a physical range). When data is abundant and prior information is weak, the ML and Bayesian estimates converge.

๐ŸŽฒ The Bayesian Framework

Unknown as a random variable

The fundamental shift in the Bayesian approach: the unknown quantity \(X\) is modeled as a random variable, not a fixed constant. This allows us to assign a prior distribution \(p_x(x)\) that encodes our beliefs about \(X\) before seeing any data.

Note: In the frequentist framework (Lesson 31), the unknown was written \(\theta\) (a fixed parameter). In the Bayesian framework, it is written \(X\) (a random variable). This is a notational convention to emphasize the philosophical difference, not a mathematical one.

The three key distributions

NameNotationMeaning
Prior\(p_x(x)\)What we believe about \(X\) before observing data
Likelihood (forward model)\(p_{y|x}(y|x)\)Probability of observing \(y\) if the true value is \(x\)
Posterior\(p_{x|y}(x|y)\)Updated belief about \(X\) after observing data \(y\)

The posterior is the central object in Bayesian estimation. It combines everything we know โ€” the data and our prior โ€” into a single distribution over \(X\).

๐Ÿ”„ Bayes' Rule and the Posterior

Bayes' rule

Our goal is the posterior \(p_{x|y}(x|y)\) โ€” the distribution of the unknown \(X\) given the observed data \(Y\). In most practical problems, this distribution cannot be written down directly: we simply do not have a physics-based model that tells us the probability of a particular signal value given a particular measurement.

What we can usually formulate is the forward model \(p_{y|x}(y|x)\): the probability of obtaining measurement \(Y = y\) if the true underlying signal is \(X = x\). This is a physics model of the measurement process โ€” noise statistics, sensor response, blur kernels, etc. โ€” and is often straightforward to write down from first principles.

Bayes' rule flips the conditioning, letting us obtain the posterior we want from the forward model we have:

\[ p_{x|y}(x|y) = \frac{p_{y|x}(y|x)\, p_x(x)}{p_y(y)} \]
Note: The denominator \(p_y(y) = \int p_{y|x}(y|x)\,p_x(x)\,dx\) is the normalizing constant (also called the partition function \(z_y\)). It does not depend on \(x\) โ€” it just ensures the posterior integrates to 1. When maximizing over \(x\), it can be dropped.

MAP estimator

The maximum a posteriori (MAP) estimator finds the mode of the posterior distribution:

\[ \hat X_{\mathrm{MAP}} = \underset{x \in \Omega}{\operatorname{arg\,max}}\; p_{x|y}(x|Y) \]

Using Bayes' rule and dropping the constant denominator, this becomes:

\begin{align} \hat X_{\mathrm{MAP}} &= \underset{x}{\operatorname{arg\,max}}\;\bigl\{ \log p_{y|x}(y|x) + \log p_x(x) \bigr\} \\[4pt] &= \underset{x}{\operatorname{arg\,min}}\;\bigl\{ -\log p_{y|x}(y|x) - \log p_x(x) \bigr\} \end{align}
Note: The MAP estimate balances two competing objectives: (1) fit the data well (minimize negative log-likelihood) and (2) be consistent with the prior (minimize negative log-prior). This is the fundamental trade-off in Bayesian estimation.

MMSE estimator

The minimum mean squared error (MMSE) estimator minimizes the expected squared error \(\mathbb{E}[\|X - \hat X\|^2]\) over the posterior. It is the posterior mean:

\[ \hat X_{\mathrm{MMSE}} = \mathbb{E}[X \mid Y] = \int x\; p_{x|y}(x|Y)\, dx \]

For a Gaussian posterior, the mean and mode coincide, so \(\hat X_{\mathrm{MMSE}} = \hat X_{\mathrm{MAP}}\) in that special case.

๐Ÿ“ˆ Example 1: Gaussian Signal in Gaussian Noise (Scalar)

Problem setup

Let \(X \sim \mathcal{N}(0, \sigma_x^2)\) be the unknown scalar signal, \(W \sim \mathcal{N}(0, \sigma_w^2)\) be independent noise, and the observation be \(Y = X + W\). We want to estimate \(X\) from \(Y\).

The prior and forward model are:

\[ p_x(x) = \frac{1}{\sqrt{2\pi\sigma_x^2}} \exp\!\left\{-\frac{x^2}{2\sigma_x^2}\right\}, \qquad p_{y|x}(y|x) = \frac{1}{\sqrt{2\pi\sigma_w^2}} \exp\!\left\{-\frac{(y-x)^2}{2\sigma_w^2}\right\} \]

Posterior distribution

By Bayes' rule, the posterior is proportional to the product of the likelihood and the prior: \[ p_{x|y}(x|y) \;\propto\; p_{y|x}(y|x)\, p_x(x) \] Multiplying the two Gaussians and completing the square in \(x\), the product takes the form of an unnormalized Gaussian in \(x\). Identifying the mean and variance of that Gaussian and then dividing by the normalizing constant \(p_y(y)\) (which does not depend on \(x\)), the posterior is:

\[ p_{x|y}(x|y) = \frac{1}{\sqrt{2\pi\sigma_{x|y}^2}} \exp\!\left\{-\frac{1}{2\sigma_{x|y}^2}(x - \mu_{x|y})^2\right\} \]

where the posterior mean and variance are:

\[ \mu_{x|y} = \frac{\sigma_x^2}{\sigma_x^2 + \sigma_w^2}\, y, \qquad \sigma_{x|y}^2 = \frac{\sigma_x^2\, \sigma_w^2}{\sigma_x^2 + \sigma_w^2} \]
Note: The posterior is still Gaussian โ€” this is a special property of Gaussian priors with Gaussian likelihoods. For jointly Gaussian \(X\) and \(Y\), the MAP, MMSE, and conditional median estimates are all identical: \(\hat X_{\mathrm{MAP}} = \hat X_{\mathrm{MMSE}} = \hat X_{\mathrm{median}}\).

MMSE estimate and the noise-to-signal ratio

Since the posterior is Gaussian, \(\hat X_{\mathrm{MMSE}} = \mu_{x|y}\):

\[ \hat X_{\mathrm{MMSE}} = \frac{\sigma_x^2}{\sigma_x^2 + \sigma_w^2}\, Y = \frac{1}{1 + \sigma_w^2/\sigma_x^2}\, Y \]
Note: The MMSE estimate is \(Y\) attenuated by a factor between 0 and 1. The attenuation depends on the noise-to-signal ratio \(\sigma_w^2/\sigma_x^2\):
  • Low noise (\(\sigma_w^2 \ll \sigma_x^2\)): attenuation โ‰ˆ 1 โ†’ \(\hat X \approx Y\) (trust the data).
  • High noise (\(\sigma_w^2 \gg \sigma_x^2\)): attenuation โ‰ˆ 0 โ†’ \(\hat X \approx 0\) (trust the prior mean).
This attenuation reduces MSE but introduces bias: \(\mathbb{E}[\hat X_{\mathrm{MMSE}} | X] = \frac{\sigma_x^2}{\sigma_x^2+\sigma_w^2} X < X\).

Compare with ML

The ML estimate treats \(X\) as a fixed unknown parameter. Maximizing the likelihood \(p_{y|x}(y|x)\) over \(x\) gives simply:

\[ \hat X_{\mathrm{ML}} = Y \]

The ML estimate is unbiased (\(\mathbb{E}[\hat X_{\mathrm{ML}} | X] = X\)) but noisy โ€” it just returns the raw observation. The MMSE estimate is biased but has lower MSE when \(\sigma_w^2\) is not negligible.

๐Ÿ”ข Example 2: Multivariate Gaussian Estimation

The scalar result generalizes directly to vectors. Now \(X \in \mathbb{R}^n\) is an \(n\)-dimensional unknown โ€” for example, all \(n\) pixel values in an image, or \(n\) physical quantities measured simultaneously. The distribution \(X \sim \mathcal{N}(0, R_x)\) is still Gaussian, but the variance is replaced by a covariance matrix \(R_x \in \mathbb{R}^{n \times n}\).

The covariance matrix \(R_x\) captures both the variance of each individual component and the statistical relationships between components:

Similarly \(W \sim \mathcal{N}(0, R_w)\) is independent noise with covariance \(R_w\), and \(Y = X + W\). The MMSE estimate of \(X\) given \(Y\) is:

\[ \hat X_{\mathrm{MMSE}} = \mathbb{E}[X \mid Y] = R_x(R_x + R_w)^{-1} Y \]

The posterior covariance (uncertainty after observing \(Y\)) is:

\[ R_{x|y} = (R_x^{-1} + R_w^{-1})^{-1} \]
Note: Two important observations for jointly Gaussian variables: (1) the posterior mean is a linear function of the data \(Y\); (2) the posterior covariance \(R_{x|y}\) does not depend on \(Y\). These two properties are special to Gaussians and do not hold for general distributions.

This is the formula you will implement in HW 32 for a 50-dimensional signal reconstruction problem.

โ–ถ Recommended Videos

3Blue1Brown gives a geometric and visual derivation of Bayes' theorem โ€” ideal for building the intuition behind prior, likelihood, and posterior before working through the math.

Katie Bouman explains how Bayesian estimation and computational imaging were used to reconstruct the first image of a black hole from sparse radio telescope data โ€” a direct application of the prior + likelihood framework from this lesson.

Veritasium motivates Bayesian thinking with real-world examples and explains why our intuition about probabilities often fails without a prior.

A more technical look at the computational imaging methods behind the first black hole image โ€” sparse measurements, atmospheric corruption, and the Bayesian regularization that made reconstruction possible.

๐Ÿ’ป Worked Example โ€” MATLAB

This example demonstrates the scalar Gaussian case from Example 1: comparing the ML and MMSE estimates for a single observation. Your HW 32 extends this to the full 50-dimensional multivariate case from Example 2 โ€” use this example to build intuition first.

%% Lesson 32 โ€” Scalar Bayesian Estimation (demonstration, Example 1)
% X ~ N(0, sigma_x^2),  W ~ N(0, sigma_w^2),  Y = X + W
% ML estimate:   X_hat_ML   = Y
% MMSE estimate: X_hat_MMSE = (sigma_x^2 / (sigma_x^2 + sigma_w^2)) * Y

rng(7);
sigma_x2 = 2.0;   % signal variance
sigma_w2 = 3.0;   % noise variance  (high noise scenario)

%% Single trial
X_true  = randn * sqrt(sigma_x2);
W       = randn * sqrt(sigma_w2);
Y       = X_true + W;

X_ML   = Y;
X_MMSE = (sigma_x2 / (sigma_x2 + sigma_w2)) * Y;

fprintf('True X     = %+.4f\n', X_true);
fprintf('Observation Y = %+.4f\n', Y);
fprintf('ML  estimate  = %+.4f  (error = %.4f)\n', X_ML,   abs(X_ML   - X_true));
fprintf('MMSE estimate = %+.4f  (error = %.4f)\n', X_MMSE, abs(X_MMSE - X_true));

%% Monte Carlo: compare MSE over many trials
M     = 10000;
err_ML   = zeros(M,1);
err_MMSE = zeros(M,1);

for i = 1:M
  Xi = randn * sqrt(sigma_x2);
  Yi = Xi + randn * sqrt(sigma_w2);
  err_ML(i)   = (Yi - Xi)^2;
  err_MMSE(i) = ((sigma_x2/(sigma_x2+sigma_w2))*Yi - Xi)^2;
end

fprintf('\nEmpirical MSE (ML)   = %.4f\n', mean(err_ML));
fprintf('Empirical MSE (MMSE) = %.4f\n', mean(err_MMSE));

% Theoretical MSE for MMSE estimator = sigma_x|y^2 + bias^2
sigma_xgy2 = sigma_x2 * sigma_w2 / (sigma_x2 + sigma_w2);
bias2      = ((sigma_w2/(sigma_x2+sigma_w2))^2) * sigma_x2;
fprintf('Theoretical MSE (MMSE) = %.4f\n', sigma_xgy2 + bias2);
What to observe: With high noise (\(\sigma_w^2 > \sigma_x^2\)), the MMSE estimate has lower MSE than ML despite being biased. The attenuation factor pulls estimates toward the prior mean (zero), which helps on average when the noise is large. Try changing sigma_w2 to 0.1 to see the low-noise regime where ML and MMSE nearly coincide.

โš–๏ธ Comparing ML, MAP, and MMSE

EstimatorTreats \(X\) asFormulaBiased?Best when
ML Fixed unknown \(\arg\max_x \log p_{y|x}(y|x)\) Approx. unbiased for large \(n\) Lots of data; weak/no prior
MAP Random \(\arg\max_x \{\log p_{y|x} + \log p_x\}\) Generally yes Prior is known; finding mode matters
MMSE Random \(\mathbb{E}[X|Y]\) (posterior mean) Generally yes Prior is known; minimizing MSE matters
When Bayesian can hurt: If the assumed prior \(p_x(x)\) does not reflect reality, both MAP and MMSE estimates can be significantly worse than ML. A wrong prior is worse than no prior.

๐Ÿ“ Summary

ConceptKey Idea
Bayesian vs. frequentistBayesian models \(X\) as random and incorporates a prior; frequentist treats \(\theta\) as fixed.
Prior \(p_x(x)\)Encodes knowledge about \(X\) before seeing data.
Posterior \(p_{x|y}(x|y)\)Updated distribution of \(X\) after observing \(Y\); computed via Bayes' rule.
MAP estimatorMode of the posterior; minimizes \(-\log p_{y|x} - \log p_x\).
MMSE estimatorPosterior mean; minimizes expected squared error.
Gaussian case (scalar)\(\hat X_{\mathrm{MMSE}} = \frac{\sigma_x^2}{\sigma_x^2+\sigma_w^2} Y\); attenuated version of \(Y\).
Noise-to-signal ratio\(\sigma_w^2/\sigma_x^2\) controls the attenuation. High noise โ†’ strong shrinkage toward prior mean.
Multivariate case\(\hat X_{\mathrm{MMSE}} = R_x(R_x+R_w)^{-1}Y\); uses signal covariance matrix as prior.
Gaussian special propertyMAP = MMSE = conditional median; posterior mean is linear in \(Y\); posterior variance independent of \(Y\).
๐Ÿ“‹ HW 32 โ€” ML vs. MMSE: Recreating MBIP Figure 2.1

๐Ÿ“š References

  1. Bouman, C.A. Foundations of Computational Imaging, SIAM, 2022, ยงยง2.3.2โ€“2.3.3. (See embedded notes in the course PDF.)
  2. Harvard Smithsonian CfA. "Black Hole Image." Event Horizon Telescope, 2019. (Bayesian imaging methods at work.) Video: youtube.com/watch?v=iiAi6Y3yaI4