Lesson 32 — Statistical Estimation II: Bayesian Estimation

Learning Objectives

Explain the key difference between the frequentist and Bayesian frameworks.
Define the prior, likelihood, and posterior distributions.
Apply Bayes' rule to derive the posterior distribution from the forward model and prior.
State the MAP estimator and interpret it as minimizing negative log-posterior.
Derive the MMSE (minimum mean squared error) estimator as the posterior mean.
Work through the Gaussian signal-in-noise example and obtain closed-form MAP/MMSE estimates.
Explain the noise-to-signal ratio and its effect on the MMSE estimate.
Compare ML, MAP, and MMSE estimators and identify when each is preferred.

Notation used in this lesson

\(X\)Unknown quantity (random in Bayesian framework)

\(Y\)Observed data (random vector)

\(p_x(x)\)Prior distribution of \(X\)

\(p_{y|x}(y|x)\)Likelihood (forward model)

\(p_{x|y}(x|y)\)Posterior distribution of \(X\) given \(Y\)

\(\hat X_{\mathrm{MAP}}\)Maximum a posteriori estimate

\(\hat X_{\mathrm{MMSE}}\)Minimum MSE (posterior mean) estimate

\(\sigma_x^2,\,\sigma_w^2\)Signal and noise variances

\(R_x,\,R_w\)Signal and noise covariance matrices

💡 Motivation: Going Beyond Maximum Likelihood

The ML estimator is asymptotically optimal — it is as good as any unbiased estimator can be when the amount of data grows. However, it ignores one potentially powerful source of information: prior knowledge about what values \(\theta\) is likely to take.

Consider trying to estimate a smooth physical signal from a small number of noisy measurements. The ML estimator simply returns the noisy observations unchanged — it does not know the signal is smooth. A physicist, however, would not accept a wildly oscillating estimate if they knew the true signal varies slowly. Bayesian methods formalize exactly this kind of prior knowledge.

When Bayesian methods help most: When the amount of data is small relative to the dimension of the unknown, or when strong prior information is available (e.g., the signal is smooth, non-negative, or constrained to a physical range). When data is abundant and prior information is weak, the ML and Bayesian estimates converge.

🎲 The Bayesian Framework

Unknown as a random variable

The fundamental shift in the Bayesian approach: the unknown quantity \(X\) is modeled as a random variable, not a fixed constant. This allows us to assign a prior distribution \(p_x(x)\) that encodes our beliefs about \(X\) before seeing any data.

Note: In the frequentist framework (Lesson 31), the unknown was written \(\theta\) (a fixed parameter). In the Bayesian framework, it is written \(X\) (a random variable). This is a notational convention to emphasize the philosophical difference, not a mathematical one.

The three key distributions

Name	Notation	Meaning
Prior	\(p_x(x)\)	What we believe about \(X\) before observing data
Likelihood (forward model)	\(p_{y\|x}(y\|x)\)	Probability of observing \(y\) if the true value is \(x\)
Posterior	\(p_{x\|y}(x\|y)\)	Updated belief about \(X\) after observing data \(y\)

The posterior is the central object in Bayesian estimation. It combines everything we know — the data and our prior — into a single distribution over \(X\).

🔄 Bayes' Rule and the Posterior

Bayes' rule

Our goal is the posterior \(p_{x|y}(x|y)\) — the distribution of the unknown \(X\) given the observed data \(Y\). In most practical problems, this distribution cannot be written down directly: we simply do not have a physics-based model that tells us the probability of a particular signal value given a particular measurement.

What we can usually formulate is the forward model \(p_{y|x}(y|x)\): the probability of obtaining measurement \(Y = y\) if the true underlying signal is \(X = x\). This is a physics model of the measurement process — noise statistics, sensor response, blur kernels, etc. — and is often straightforward to write down from first principles.

Bayes' rule flips the conditioning, letting us obtain the posterior we want from the forward model we have:

p_{x|y}(x|y) = \frac{p_{y|x}(y|x)\, p_x(x)}{p_y(y)}

Note: The denominator \(p_y(y) = \int p_{y|x}(y|x)\,p_x(x)\,dx\) is the normalizing constant (also called the partition function \(z_y\)). It does not depend on \(x\) — it just ensures the posterior integrates to 1. When maximizing over \(x\), it can be dropped.

MAP estimator

The maximum a posteriori (MAP) estimator finds the mode of the posterior distribution:

\hat X_{\mathrm{MAP}} = \underset{x \in \Omega}{\operatorname{arg\,max}}\; p_{x|y}(x|Y)

Using Bayes' rule and dropping the constant denominator, this becomes:

\begin{align} \hat X_{\mathrm{MAP}} &= \underset{x}{\operatorname{arg\,max}}\;\bigl\{ \log p_{y|x}(y|x) + \log p_x(x) \bigr\} \\[4pt] &= \underset{x}{\operatorname{arg\,min}}\;\bigl\{ -\log p_{y|x}(y|x) - \log p_x(x) \bigr\} \end{align}

Note: The MAP estimate balances two competing objectives: (1) fit the data well (minimize negative log-likelihood) and (2) be consistent with the prior (minimize negative log-prior). This is the fundamental trade-off in Bayesian estimation.

MMSE estimator

The minimum mean squared error (MMSE) estimator minimizes the expected squared error \(\mathbb{E}[\|X - \hat X\|^2]\) over the posterior. It is the posterior mean:

\hat X_{\mathrm{MMSE}} = \mathbb{E}[X \mid Y] = \int x\; p_{x|y}(x|Y)\, dx

For a Gaussian posterior, the mean and mode coincide, so \(\hat X_{\mathrm{MMSE}} = \hat X_{\mathrm{MAP}}\) in that special case.

📈 Example 1: Gaussian Signal in Gaussian Noise (Scalar)

Problem setup

Let \(X \sim \mathcal{N}(0, \sigma_x^2)\) be the unknown scalar signal, \(W \sim \mathcal{N}(0, \sigma_w^2)\) be independent noise, and the observation be \(Y = X + W\). We want to estimate \(X\) from \(Y\).

The prior and forward model are:

p_x(x) = \frac{1}{\sqrt{2\pi\sigma_x^2}} \exp\!\left\{-\frac{x^2}{2\sigma_x^2}\right\}, \qquad p_{y|x}(y|x) = \frac{1}{\sqrt{2\pi\sigma_w^2}} \exp\!\left\{-\frac{(y-x)^2}{2\sigma_w^2}\right\}

Posterior distribution

By Bayes' rule, the posterior is proportional to the product of the likelihood and the prior: \[ p_{x|y}(x|y) \;\propto\; p_{y|x}(y|x)\, p_x(x) \] Multiplying the two Gaussians and completing the square in \(x\), the product takes the form of an unnormalized Gaussian in \(x\). Identifying the mean and variance of that Gaussian and then dividing by the normalizing constant \(p_y(y)\) (which does not depend on \(x\)), the posterior is:

p_{x|y}(x|y) = \frac{1}{\sqrt{2\pi\sigma_{x|y}^2}} \exp\!\left\{-\frac{1}{2\sigma_{x|y}^2}(x - \mu_{x|y})^2\right\}

where the posterior mean and variance are:

\mu_{x|y} = \frac{\sigma_x^2}{\sigma_x^2 + \sigma_w^2}\, y, \qquad \sigma_{x|y}^2 = \frac{\sigma_x^2\, \sigma_w^2}{\sigma_x^2 + \sigma_w^2}

Note: The posterior is still Gaussian — this is a special property of Gaussian priors with Gaussian likelihoods. For jointly Gaussian \(X\) and \(Y\), the MAP, MMSE, and conditional median estimates are all identical: \(\hat X_{\mathrm{MAP}} = \hat X_{\mathrm{MMSE}} = \hat X_{\mathrm{median}}\).

MMSE estimate and the noise-to-signal ratio

Since the posterior is Gaussian, \(\hat X_{\mathrm{MMSE}} = \mu_{x|y}\):

\hat X_{\mathrm{MMSE}} = \frac{\sigma_x^2}{\sigma_x^2 + \sigma_w^2}\, Y = \frac{1}{1 + \sigma_w^2/\sigma_x^2}\, Y

Note: The MMSE estimate is \(Y\) attenuated by a factor between 0 and 1. The attenuation depends on the noise-to-signal ratio \(\sigma_w^2/\sigma_x^2\):

Low noise (\(\sigma_w^2 \ll \sigma_x^2\)): attenuation ≈ 1 → \(\hat X \approx Y\) (trust the data).
High noise (\(\sigma_w^2 \gg \sigma_x^2\)): attenuation ≈ 0 → \(\hat X \approx 0\) (trust the prior mean).

This attenuation reduces MSE but introduces bias: \(\mathbb{E}[\hat X_{\mathrm{MMSE}} | X] = \frac{\sigma_x^2}{\sigma_x^2+\sigma_w^2} X < X\).

Compare with ML

The ML estimate treats \(X\) as a fixed unknown parameter. Maximizing the likelihood \(p_{y|x}(y|x)\) over \(x\) gives simply:

\hat X_{\mathrm{ML}} = Y

The ML estimate is unbiased (\(\mathbb{E}[\hat X_{\mathrm{ML}} | X] = X\)) but noisy — it just returns the raw observation. The MMSE estimate is biased but has lower MSE when \(\sigma_w^2\) is not negligible.

🔢 Example 2: Multivariate Gaussian Estimation

The scalar result generalizes directly to vectors. Now \(X \in \mathbb{R}^n\) is an \(n\)-dimensional unknown — for example, all \(n\) pixel values in an image, or \(n\) physical quantities measured simultaneously. The distribution \(X \sim \mathcal{N}(0, R_x)\) is still Gaussian, but the variance is replaced by a covariance matrix \(R_x \in \mathbb{R}^{n \times n}\).

The covariance matrix \(R_x\) captures both the variance of each individual component and the statistical relationships between components:

Diagonal entry \([R_x]_{ii} = \mathbb{E}[X_i^2]\) — the variance of component \(X_i\).
Off-diagonal entry \([R_x]_{ij} = \mathbb{E}[X_i X_j]\) — the covariance between \(X_i\) and \(X_j\). Zero means uncorrelated.
\(R_x\) is always symmetric (\(R_x = R_x^T\)) and positive semi-definite.
If all components are independent, \(R_x\) is diagonal (off-diagonal entries are zero).

Similarly \(W \sim \mathcal{N}(0, R_w)\) is independent noise with covariance \(R_w\), and \(Y = X + W\). The MMSE estimate of \(X\) given \(Y\) is:

\hat X_{\mathrm{MMSE}} = \mathbb{E}[X \mid Y] = R_x(R_x + R_w)^{-1} Y

The posterior covariance (uncertainty after observing \(Y\)) is:

R_{x|y} = (R_x^{-1} + R_w^{-1})^{-1}

Note: Two important observations for jointly Gaussian variables: (1) the posterior mean is a linear function of the data \(Y\); (2) the posterior covariance \(R_{x|y}\) does not depend on \(Y\). These two properties are special to Gaussians and do not hold for general distributions.

This is the formula you will implement in HW 32 for a 50-dimensional signal reconstruction problem.

▶ Recommended Videos

3Blue1Brown gives a geometric and visual derivation of Bayes' theorem — ideal for building the intuition behind prior, likelihood, and posterior before working through the math.

▶

Bayes theorem, the geometry of changing beliefs

3Blue1Brown — YouTube

Katie Bouman explains how Bayesian estimation and computational imaging were used to reconstruct the first image of a black hole from sparse radio telescope data — a direct application of the prior + likelihood framework from this lesson.

▶

How to take a picture of a black hole | Katie Bouman

TEDx Talks — YouTube

Veritasium motivates Bayesian thinking with real-world examples and explains why our intuition about probabilities often fails without a prior.

▶

The Bayesian Trap

Veritasium — YouTube

A more technical look at the computational imaging methods behind the first black hole image — sparse measurements, atmospheric corruption, and the Bayesian regularization that made reconstruction possible.

▶

Imaging the Unseen: Taking the First Picture of a Black Hole

Harvard-Smithsonian CfA — YouTube

💻 Worked Example — MATLAB

This example demonstrates the scalar Gaussian case from Example 1: comparing the ML and MMSE estimates for a single observation. Your HW 32 extends this to the full 50-dimensional multivariate case from Example 2 — use this example to build intuition first.

%% Lesson 32 — Scalar Bayesian Estimation (demonstration, Example 1)
% X ~ N(0, sigma_x^2),  W ~ N(0, sigma_w^2),  Y = X + W
% ML estimate:   X_hat_ML   = Y
% MMSE estimate: X_hat_MMSE = (sigma_x^2 / (sigma_x^2 + sigma_w^2)) * Y

rng(7);
sigma_x2 = 2.0;   % signal variance
sigma_w2 = 3.0;   % noise variance  (high noise scenario)

%% Single trial
X_true  = randn * sqrt(sigma_x2);
W       = randn * sqrt(sigma_w2);
Y       = X_true + W;

X_ML   = Y;
X_MMSE = (sigma_x2 / (sigma_x2 + sigma_w2)) * Y;

fprintf('True X     = %+.4f\n', X_true);
fprintf('Observation Y = %+.4f\n', Y);
fprintf('ML  estimate  = %+.4f  (error = %.4f)\n', X_ML,   abs(X_ML   - X_true));
fprintf('MMSE estimate = %+.4f  (error = %.4f)\n', X_MMSE, abs(X_MMSE - X_true));

%% Monte Carlo: compare MSE over many trials
M     = 10000;
err_ML   = zeros(M,1);
err_MMSE = zeros(M,1);

for i = 1:M
  Xi = randn * sqrt(sigma_x2);
  Yi = Xi + randn * sqrt(sigma_w2);
  err_ML(i)   = (Yi - Xi)^2;
  err_MMSE(i) = ((sigma_x2/(sigma_x2+sigma_w2))*Yi - Xi)^2;
end

fprintf('\nEmpirical MSE (ML)   = %.4f\n', mean(err_ML));
fprintf('Empirical MSE (MMSE) = %.4f\n', mean(err_MMSE));

% Theoretical MSE for MMSE estimator = sigma_x|y^2 + bias^2
sigma_xgy2 = sigma_x2 * sigma_w2 / (sigma_x2 + sigma_w2);
bias2      = ((sigma_w2/(sigma_x2+sigma_w2))^2) * sigma_x2;
fprintf('Theoretical MSE (MMSE) = %.4f\n', sigma_xgy2 + bias2);

What to observe: With high noise (\(\sigma_w^2 > \sigma_x^2\)), the MMSE estimate has lower MSE than ML despite being biased. The attenuation factor pulls estimates toward the prior mean (zero), which helps on average when the noise is large. Try changing sigma_w2 to 0.1 to see the low-noise regime where ML and MMSE nearly coincide.

⚖️ Comparing ML, MAP, and MMSE

Estimator	Treats \(X\) as	Formula	Biased?	Best when
ML	Fixed unknown	\(\arg\max_x \log p_{y\|x}(y\|x)\)	Approx. unbiased for large \(n\)	Lots of data; weak/no prior
MAP	Random	\(\arg\max_x \{\log p_{y\|x} + \log p_x\}\)	Generally yes	Prior is known; finding mode matters
MMSE	Random	\(\mathbb{E}[X\|Y]\) (posterior mean)	Generally yes	Prior is known; minimizing MSE matters

When Bayesian can hurt: If the assumed prior \(p_x(x)\) does not reflect reality, both MAP and MMSE estimates can be significantly worse than ML. A wrong prior is worse than no prior.

📝 Summary

Concept	Key Idea
Bayesian vs. frequentist	Bayesian models \(X\) as random and incorporates a prior; frequentist treats \(\theta\) as fixed.
Prior \(p_x(x)\)	Encodes knowledge about \(X\) before seeing data.
Posterior \(p_{x\|y}(x\|y)\)	Updated distribution of \(X\) after observing \(Y\); computed via Bayes' rule.
MAP estimator	Mode of the posterior; minimizes \(-\log p_{y\|x} - \log p_x\).
MMSE estimator	Posterior mean; minimizes expected squared error.
Gaussian case (scalar)	\(\hat X_{\mathrm{MMSE}} = \frac{\sigma_x^2}{\sigma_x^2+\sigma_w^2} Y\); attenuated version of \(Y\).
Noise-to-signal ratio	\(\sigma_w^2/\sigma_x^2\) controls the attenuation. High noise → strong shrinkage toward prior mean.
Multivariate case	\(\hat X_{\mathrm{MMSE}} = R_x(R_x+R_w)^{-1}Y\); uses signal covariance matrix as prior.
Gaussian special property	MAP = MMSE = conditional median; posterior mean is linear in \(Y\); posterior variance independent of \(Y\).

📋 HW 32 — ML vs. MMSE: Recreating MBIP Figure 2.1

📚 References

Bouman, C.A. Foundations of Computational Imaging, SIAM, 2022, §§2.3.2–2.3.3. (See embedded notes in the course PDF.)
Harvard Smithsonian CfA. "Black Hole Image." Event Horizon Telescope, 2019. (Bayesian imaging methods at work.) Video: youtube.com/watch?v=iiAi6Y3yaI4

← Previous
L31: Maximum Likelihood

Next Lesson
L33: What is ML & AI?

→

Statistical Estimation II: Bayesian Estimation, MAP & MMSE

Learning Objectives

Notation used in this lesson

💡 Motivation: Going Beyond Maximum Likelihood

🎲 The Bayesian Framework

Unknown as a random variable

The three key distributions

🔄 Bayes' Rule and the Posterior

Bayes' rule

MAP estimator

MMSE estimator

📈 Example 1: Gaussian Signal in Gaussian Noise (Scalar)

Problem setup

Posterior distribution

MMSE estimate and the noise-to-signal ratio

Compare with ML

🔢 Example 2: Multivariate Gaussian Estimation

▶ Recommended Videos

💻 Worked Example — MATLAB

⚖️ Comparing ML, MAP, and MMSE

📝 Summary

📚 References