Lesson 38 — Transformers, LLMs & Diffusion Models

Learning Objectives

Articulate the attention mechanism: what it computes and why it's powerful for sequential data.
Describe the transformer architecture at a high level (encoder, self-attention, feed-forward sublayers).
Explain what large language models (LLMs) are and how they are trained (pre-training + fine-tuning).
Describe diffusion models conceptually: the forward noising process, the reverse denoising process, and how they are trained.
Identify key physics applications of transformers and diffusion models.

🌍 Background & Motivation

From discriminative to generative models

The networks from Lessons 35–36 are discriminative models — they map inputs to class labels or probabilities. The architectures in this lesson do something more ambitious: transformers process sequences with arbitrary long-range dependencies, and diffusion models learn to generate entirely new data.

Why this matters for physicists

Transformers power the LLMs that are transforming how scientists write, code, and analyze data. AlphaFold 2/3 uses transformers for protein structure prediction. Diffusion models are now standard tools for molecular generation, inverse problems in imaging, and probabilistic weather forecasting. Understanding these architectures at a mechanistic level is increasingly essential for physicists and engineers working at the frontier.

🔄 Transformers & the Attention Mechanism

The problem with sequences

For sequential data (text, time series, sensor streams), the key challenge is long-range dependency: the meaning of a word can depend on something said many sentences ago. Recurrent networks (RNNs, LSTMs) struggled with this because they process sequences step-by-step and "forget" distant context. Transformers solved this with attention.

Scaled Dot-Product Attention

Given a sequence of \(n\) input vectors, attention computes, for each position, a weighted average of all other positions — where the weights encode relevance. Each input \(\mathbf{x}_i\) is linearly projected into three roles:

Query \(\mathbf{q}_i = \mathbf{W}_Q \mathbf{x}_i\): "what am I looking for?"
Key \(\mathbf{k}_j = \mathbf{W}_K \mathbf{x}_j\): "what do I contain?"
Value \(\mathbf{v}_j = \mathbf{W}_V \mathbf{x}_j\): "what information do I provide?"

The attention output for position \(i\) is:

\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right)\mathbf{V}

The dot product \(\mathbf{q}_i \cdot \mathbf{k}_j\) measures how relevant position \(j\) is to position \(i\). The softmax normalizes these into weights summing to 1. The output is a weighted average of the values — a context-aware representation of position \(i\).

Why \(\sqrt{d_k}\)? The scaling factor prevents dot products from growing too large when \(d_k\) is large (which would push softmax into saturation, killing gradients).

Multi-Head Attention

Instead of one attention operation, transformers run \(h\) attention "heads" in parallel, each with different \(\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V\) projections. The outputs are concatenated and projected:

\text{MultiHead}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\,\mathbf{W}_O

Each head can attend to different relationships simultaneously — one head might track grammatical agreement, another might track coreference (who "they" refers to).

The Transformer Block

A single transformer block consists of:

Multi-head self-attention: each token attends to all others.
Add & Norm: residual connection (\(\mathbf{x} + \text{attention}(\mathbf{x})\)) then layer normalization.
Feed-forward network: two-layer MLP applied independently to each position.
Add & Norm again.

Stacking \(N_\ell = 12, 24, 96, \ldots\) such blocks gives the full transformer depth. Depth is what gives transformers their power.

Self-attention vs. cross-attention: In self-attention, Q, K, V all come from the same sequence. In cross-attention (encoder-decoder), Q comes from the decoder, K and V from the encoder — enabling translation and summarization tasks.

🤖 Large Language Models (LLMs)

What is an LLM?

A Large Language Model is a transformer trained on massive text corpora (hundreds of billions of tokens) to predict the next token in a sequence. The training task is deceptively simple — predict the next word — but requires the model to develop a rich internal representation of language, facts, and reasoning.

p(\text{token}_{t+1} \mid \text{token}_1, \ldots, \text{token}_t) = \text{softmax}(\mathbf{W}_\text{out} \cdot \mathbf{h}_t)

Where \(\mathbf{h}_t\) is the final transformer layer's hidden state at position \(t\). At inference time, tokens are sampled autoregressively — each predicted token is fed back as input to generate the next.

Training: Pre-training + Fine-tuning

Stage	Task	Data	Outcome
Pre-training	Next token prediction (self-supervised)	Entire internet, books, code (\(\sim\)10T tokens)	General language understanding
Supervised Fine-Tuning (SFT)	Imitate expert demonstrations	Human-written instruction-response pairs	Follows instructions
RLHF	Maximize human preference reward	Human preference rankings	Aligned, helpful, safe

Scale and Emergent Abilities

LLM capabilities improve predictably with scale (parameters × data × compute). Strikingly, certain abilities — multi-step reasoning, code generation, in-context learning — appear to emerge discontinuously as models scale, even though the training task never changes. This is not fully understood and is an active research area.

Physics Applications of LLMs/Transformers

AlphaFold 2/3: transformer + equivariant layers for protein structure prediction.
Plasma physics: transformers for time-series modeling of tokamak disruptions.
Climate science: Pangu-Weather, GraphCast — transformer-based forecast models.
Space domain awareness: sequence modeling of orbital maneuver patterns.

🌫 Diffusion Models

The core idea: learn to denoise

Diffusion models are a class of generative model that learn to generate data by learning to reverse a noising process. The key insight is elegant: if you can learn to remove a small amount of noise from a corrupted image, you can iteratively apply that denoising to pure random noise until a coherent image emerges.

Forward process (noising)

Given a clean data sample \(\mathbf{x}_0\), the forward process gradually adds Gaussian noise over \(T\) steps:

q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t;\, \sqrt{1-\beta_t}\,\mathbf{x}_{t-1},\, \beta_t \mathbf{I})

Where \(\beta_t\) is a small noise variance schedule (e.g., linearly increasing from \(10^{-4}\) to \(0.02\)). After \(T \approx 1000\) steps, \(\mathbf{x}_T\) is approximately pure Gaussian noise. Crucially, you can sample any noisy \(\mathbf{x}_t\) directly from \(\mathbf{x}_0\) in closed form:

\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\,\boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})

where \(\bar{\alpha}_t = \prod_{s=1}^t (1-\beta_s)\).

Reverse process (denoising)

The model learns the reverse: given a noisy sample \(\mathbf{x}_t\), predict the noise \(\boldsymbol{\epsilon}\) that was added. This is a neural network \(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\) trained with a simple mean-squared loss:

L = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\!\left[\left\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right\|^2\right]

At inference, start from \(\mathbf{x}_T \sim \mathcal{N}(\mathbf{0},\mathbf{I})\) and iteratively apply the learned denoiser to get \(\mathbf{x}_{T-1}, \ldots, \mathbf{x}_0\) — a brand-new sample.

Architecture: The denoising network \(\boldsymbol{\epsilon}_\theta\) is typically a U-Net (a CNN with skip connections) or a Diffusion Transformer (DiT) — transformers applied to image patches. The same training and inference infrastructure from Lessons 35–37 (forward pass, backprop, SGD) applies directly.

Why diffusion models produce such high-quality outputs

Unlike GANs (which can be unstable to train) or VAEs (which can produce blurry outputs), diffusion models have a stable training objective and iteratively refine outputs over many steps. Each denoising step makes a small, learned correction — errors do not compound catastrophically.

Physics applications of diffusion models

Molecular generation: Generating novel drug-like molecules or crystal structures (e.g., DiffSBDD, DiffCSP).
Inverse problems: Reconstructing clean signals from noisy measurements — directly analogous to physics image reconstruction (MRI, CT, SAR).
Weather & climate: Probabilistic forecast generation (SEEDS, GenCast).
Simulation acceleration: Generating synthetic training data or emulating expensive physics simulations.

▶ Recommended Video

Andrej Karpathy's "Let's build GPT from scratch" is the best hands-on walkthrough of transformers available. Watch at least the first 45 minutes for the core architecture:

▶

Let's build GPT: from scratch, in code, spelled out (focus: 0:00–45:00)

Andrej Karpathy — YouTube

📝 Summary

Architecture	Key Innovation	Best For
MLP (L35–36)	Universal approximation via depth	Tabular data, general function fitting
CNN (L37)	Parameter sharing via convolution; spatial locality	Images, 2D/3D data, spectrograms
Transformer	Self-attention captures arbitrary long-range dependencies	Sequences: text, time series, molecules
LLM	Pre-trained transformer at massive scale	Language, code, reasoning, general-purpose AI
Diffusion Model	Learns to reverse a Gaussian noising process; trained with a simple MSE loss on predicted noise	Image/molecule/signal generation, inverse problems

📋 HW 38 — Transformers & Diffusion (Qualitative + Optional Demo)

📚 References

Vaswani, A., et al. (2017). Attention is all you need. NeurIPS 2017. arXiv:1706.03762.
Karpathy, A. (2023). Let's build GPT: from scratch, in code, spelled out [Video]. YouTube. youtube.com
Brown, T. B., et al. (2020). Language models are few-shot learners (GPT-3). NeurIPS 2020. arXiv:2005.14165.
Jumper, J., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583–589.
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models (DDPM). NeurIPS 2020. arXiv:2006.11239.
Song, Y., et al. (2021). Score-based generative modeling through stochastic differential equations. ICLR 2021. arXiv:2011.13456.

← L37: Convolutional Neural Networks

Back to
Module Home

→