Lesson 38 • Physics 356

Transformers, LLMs & Diffusion Models

⏱ ~50 min read

Learning Objectives

🌍 Background & Motivation

From discriminative to generative models

The networks from Lessons 35–36 are discriminative models — they map inputs to class labels or probabilities. The architectures in this lesson do something more ambitious: transformers process sequences with arbitrary long-range dependencies, and diffusion models learn to generate entirely new data.

Why this matters for physicists

Transformers power the LLMs that are transforming how scientists write, code, and analyze data. AlphaFold 2/3 uses transformers for protein structure prediction. Diffusion models are now standard tools for molecular generation, inverse problems in imaging, and probabilistic weather forecasting. Understanding these architectures at a mechanistic level is increasingly essential for physicists and engineers working at the frontier.

🔄 Transformers & the Attention Mechanism

The problem with sequences

For sequential data (text, time series, sensor streams), the key challenge is long-range dependency: the meaning of a word can depend on something said many sentences ago. Recurrent networks (RNNs, LSTMs) struggled with this because they process sequences step-by-step and "forget" distant context. Transformers solved this with attention.

Scaled Dot-Product Attention

Given a sequence of \(n\) input vectors, attention computes, for each position, a weighted average of all other positions — where the weights encode relevance. Each input \(\mathbf{x}_i\) is linearly projected into three roles:

The attention output for position \(i\) is:

\[ \text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right)\mathbf{V} \]

The dot product \(\mathbf{q}_i \cdot \mathbf{k}_j\) measures how relevant position \(j\) is to position \(i\). The softmax normalizes these into weights summing to 1. The output is a weighted average of the values — a context-aware representation of position \(i\).

Why \(\sqrt{d_k}\)? The scaling factor prevents dot products from growing too large when \(d_k\) is large (which would push softmax into saturation, killing gradients).

Multi-Head Attention

Instead of one attention operation, transformers run \(h\) attention "heads" in parallel, each with different \(\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V\) projections. The outputs are concatenated and projected:

\[ \text{MultiHead}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\,\mathbf{W}_O \]

Each head can attend to different relationships simultaneously — one head might track grammatical agreement, another might track coreference (who "they" refers to).

The Transformer Block

A single transformer block consists of:

  1. Multi-head self-attention: each token attends to all others.
  2. Add & Norm: residual connection (\(\mathbf{x} + \text{attention}(\mathbf{x})\)) then layer normalization.
  3. Feed-forward network: two-layer MLP applied independently to each position.
  4. Add & Norm again.

Stacking \(N_\ell = 12, 24, 96, \ldots\) such blocks gives the full transformer depth. Depth is what gives transformers their power.

Self-attention vs. cross-attention: In self-attention, Q, K, V all come from the same sequence. In cross-attention (encoder-decoder), Q comes from the decoder, K and V from the encoder — enabling translation and summarization tasks.

🤖 Large Language Models (LLMs)

What is an LLM?

A Large Language Model is a transformer trained on massive text corpora (hundreds of billions of tokens) to predict the next token in a sequence. The training task is deceptively simple — predict the next word — but requires the model to develop a rich internal representation of language, facts, and reasoning.

\[ p(\text{token}_{t+1} \mid \text{token}_1, \ldots, \text{token}_t) = \text{softmax}(\mathbf{W}_\text{out} \cdot \mathbf{h}_t) \]

Where \(\mathbf{h}_t\) is the final transformer layer's hidden state at position \(t\). At inference time, tokens are sampled autoregressively — each predicted token is fed back as input to generate the next.

Training: Pre-training + Fine-tuning

StageTaskDataOutcome
Pre-training Next token prediction (self-supervised) Entire internet, books, code (\(\sim\)10T tokens) General language understanding
Supervised Fine-Tuning (SFT) Imitate expert demonstrations Human-written instruction-response pairs Follows instructions
RLHF Maximize human preference reward Human preference rankings Aligned, helpful, safe

Scale and Emergent Abilities

LLM capabilities improve predictably with scale (parameters × data × compute). Strikingly, certain abilities — multi-step reasoning, code generation, in-context learning — appear to emerge discontinuously as models scale, even though the training task never changes. This is not fully understood and is an active research area.

Physics Applications of LLMs/Transformers

🌫 Diffusion Models

The core idea: learn to denoise

Diffusion models are a class of generative model that learn to generate data by learning to reverse a noising process. The key insight is elegant: if you can learn to remove a small amount of noise from a corrupted image, you can iteratively apply that denoising to pure random noise until a coherent image emerges.

Forward process (noising)

Given a clean data sample \(\mathbf{x}_0\), the forward process gradually adds Gaussian noise over \(T\) steps:

\[ q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t;\, \sqrt{1-\beta_t}\,\mathbf{x}_{t-1},\, \beta_t \mathbf{I}) \]

Where \(\beta_t\) is a small noise variance schedule (e.g., linearly increasing from \(10^{-4}\) to \(0.02\)). After \(T \approx 1000\) steps, \(\mathbf{x}_T\) is approximately pure Gaussian noise. Crucially, you can sample any noisy \(\mathbf{x}_t\) directly from \(\mathbf{x}_0\) in closed form:

\[ \mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\,\boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \]

where \(\bar{\alpha}_t = \prod_{s=1}^t (1-\beta_s)\).

Reverse process (denoising)

The model learns the reverse: given a noisy sample \(\mathbf{x}_t\), predict the noise \(\boldsymbol{\epsilon}\) that was added. This is a neural network \(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\) trained with a simple mean-squared loss:

\[ L = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\!\left[\left\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right\|^2\right] \]

At inference, start from \(\mathbf{x}_T \sim \mathcal{N}(\mathbf{0},\mathbf{I})\) and iteratively apply the learned denoiser to get \(\mathbf{x}_{T-1}, \ldots, \mathbf{x}_0\) — a brand-new sample.

Architecture: The denoising network \(\boldsymbol{\epsilon}_\theta\) is typically a U-Net (a CNN with skip connections) or a Diffusion Transformer (DiT) — transformers applied to image patches. The same training and inference infrastructure from Lessons 35–37 (forward pass, backprop, SGD) applies directly.

Why diffusion models produce such high-quality outputs

Unlike GANs (which can be unstable to train) or VAEs (which can produce blurry outputs), diffusion models have a stable training objective and iteratively refine outputs over many steps. Each denoising step makes a small, learned correction — errors do not compound catastrophically.

Physics applications of diffusion models

Recommended Video

Andrej Karpathy's "Let's build GPT from scratch" is the best hands-on walkthrough of transformers available. Watch at least the first 45 minutes for the core architecture:

📝 Summary

ArchitectureKey InnovationBest For
MLP (L35–36)Universal approximation via depthTabular data, general function fitting
CNN (L37)Parameter sharing via convolution; spatial localityImages, 2D/3D data, spectrograms
TransformerSelf-attention captures arbitrary long-range dependenciesSequences: text, time series, molecules
LLMPre-trained transformer at massive scaleLanguage, code, reasoning, general-purpose AI
Diffusion ModelLearns to reverse a Gaussian noising process; trained with a simple MSE loss on predicted noiseImage/molecule/signal generation, inverse problems
📋 HW 38 — Transformers & Diffusion (Qualitative + Optional Demo)

📚 References

  1. Vaswani, A., et al. (2017). Attention is all you need. NeurIPS 2017. arXiv:1706.03762.
  2. Karpathy, A. (2023). Let's build GPT: from scratch, in code, spelled out [Video]. YouTube. youtube.com
  3. Brown, T. B., et al. (2020). Language models are few-shot learners (GPT-3). NeurIPS 2020. arXiv:2005.14165.
  4. Jumper, J., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583–589.
  5. Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models (DDPM). NeurIPS 2020. arXiv:2006.11239.
  6. Song, Y., et al. (2021). Score-based generative modeling through stochastic differential equations. ICLR 2021. arXiv:2011.13456.