Learning Objectives
- Articulate the attention mechanism: what it computes and why it's powerful for sequential data.
- Describe the transformer architecture at a high level (encoder, self-attention, feed-forward sublayers).
- Explain what large language models (LLMs) are and how they are trained (pre-training + fine-tuning).
- Describe diffusion models conceptually: the forward noising process, the reverse denoising process, and how they are trained.
- Identify key physics applications of transformers and diffusion models.
Background & Motivation
From discriminative to generative models
The networks from Lessons 35–36 are discriminative models — they map inputs to class labels or probabilities. The architectures in this lesson do something more ambitious: transformers process sequences with arbitrary long-range dependencies, and diffusion models learn to generate entirely new data.
Why this matters for physicists
Transformers power the LLMs that are transforming how scientists write, code, and analyze data. AlphaFold 2/3 uses transformers for protein structure prediction. Diffusion models are now standard tools for molecular generation, inverse problems in imaging, and probabilistic weather forecasting. Understanding these architectures at a mechanistic level is increasingly essential for physicists and engineers working at the frontier.
Transformers & the Attention Mechanism
The problem with sequences
For sequential data (text, time series, sensor streams), the key challenge is long-range dependency: the meaning of a word can depend on something said many sentences ago. Recurrent networks (RNNs, LSTMs) struggled with this because they process sequences step-by-step and "forget" distant context. Transformers solved this with attention.
Scaled Dot-Product Attention
Given a sequence of \(n\) input vectors, attention computes, for each position, a weighted average of all other positions — where the weights encode relevance. Each input \(\mathbf{x}_i\) is linearly projected into three roles:
- Query \(\mathbf{q}_i = \mathbf{W}_Q \mathbf{x}_i\): "what am I looking for?"
- Key \(\mathbf{k}_j = \mathbf{W}_K \mathbf{x}_j\): "what do I contain?"
- Value \(\mathbf{v}_j = \mathbf{W}_V \mathbf{x}_j\): "what information do I provide?"
The attention output for position \(i\) is:
The dot product \(\mathbf{q}_i \cdot \mathbf{k}_j\) measures how relevant position \(j\) is to position \(i\). The softmax normalizes these into weights summing to 1. The output is a weighted average of the values — a context-aware representation of position \(i\).
Multi-Head Attention
Instead of one attention operation, transformers run \(h\) attention "heads" in parallel, each with different \(\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V\) projections. The outputs are concatenated and projected:
Each head can attend to different relationships simultaneously — one head might track grammatical agreement, another might track coreference (who "they" refers to).
The Transformer Block
A single transformer block consists of:
- Multi-head self-attention: each token attends to all others.
- Add & Norm: residual connection (\(\mathbf{x} + \text{attention}(\mathbf{x})\)) then layer normalization.
- Feed-forward network: two-layer MLP applied independently to each position.
- Add & Norm again.
Stacking \(N_\ell = 12, 24, 96, \ldots\) such blocks gives the full transformer depth. Depth is what gives transformers their power.
Large Language Models (LLMs)
What is an LLM?
A Large Language Model is a transformer trained on massive text corpora (hundreds of billions of tokens) to predict the next token in a sequence. The training task is deceptively simple — predict the next word — but requires the model to develop a rich internal representation of language, facts, and reasoning.
Where \(\mathbf{h}_t\) is the final transformer layer's hidden state at position \(t\). At inference time, tokens are sampled autoregressively — each predicted token is fed back as input to generate the next.
Training: Pre-training + Fine-tuning
| Stage | Task | Data | Outcome |
|---|---|---|---|
| Pre-training | Next token prediction (self-supervised) | Entire internet, books, code (\(\sim\)10T tokens) | General language understanding |
| Supervised Fine-Tuning (SFT) | Imitate expert demonstrations | Human-written instruction-response pairs | Follows instructions |
| RLHF | Maximize human preference reward | Human preference rankings | Aligned, helpful, safe |
Scale and Emergent Abilities
LLM capabilities improve predictably with scale (parameters × data × compute). Strikingly, certain abilities — multi-step reasoning, code generation, in-context learning — appear to emerge discontinuously as models scale, even though the training task never changes. This is not fully understood and is an active research area.
Physics Applications of LLMs/Transformers
- AlphaFold 2/3: transformer + equivariant layers for protein structure prediction.
- Plasma physics: transformers for time-series modeling of tokamak disruptions.
- Climate science: Pangu-Weather, GraphCast — transformer-based forecast models.
- Space domain awareness: sequence modeling of orbital maneuver patterns.
Diffusion Models
The core idea: learn to denoise
Diffusion models are a class of generative model that learn to generate data by learning to reverse a noising process. The key insight is elegant: if you can learn to remove a small amount of noise from a corrupted image, you can iteratively apply that denoising to pure random noise until a coherent image emerges.
Forward process (noising)
Given a clean data sample \(\mathbf{x}_0\), the forward process gradually adds Gaussian noise over \(T\) steps:
Where \(\beta_t\) is a small noise variance schedule (e.g., linearly increasing from \(10^{-4}\) to \(0.02\)). After \(T \approx 1000\) steps, \(\mathbf{x}_T\) is approximately pure Gaussian noise. Crucially, you can sample any noisy \(\mathbf{x}_t\) directly from \(\mathbf{x}_0\) in closed form:
where \(\bar{\alpha}_t = \prod_{s=1}^t (1-\beta_s)\).
Reverse process (denoising)
The model learns the reverse: given a noisy sample \(\mathbf{x}_t\), predict the noise \(\boldsymbol{\epsilon}\) that was added. This is a neural network \(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\) trained with a simple mean-squared loss:
At inference, start from \(\mathbf{x}_T \sim \mathcal{N}(\mathbf{0},\mathbf{I})\) and iteratively apply the learned denoiser to get \(\mathbf{x}_{T-1}, \ldots, \mathbf{x}_0\) — a brand-new sample.
Why diffusion models produce such high-quality outputs
Unlike GANs (which can be unstable to train) or VAEs (which can produce blurry outputs), diffusion models have a stable training objective and iteratively refine outputs over many steps. Each denoising step makes a small, learned correction — errors do not compound catastrophically.
Physics applications of diffusion models
- Molecular generation: Generating novel drug-like molecules or crystal structures (e.g., DiffSBDD, DiffCSP).
- Inverse problems: Reconstructing clean signals from noisy measurements — directly analogous to physics image reconstruction (MRI, CT, SAR).
- Weather & climate: Probabilistic forecast generation (SEEDS, GenCast).
- Simulation acceleration: Generating synthetic training data or emulating expensive physics simulations.
Recommended Video
Andrej Karpathy's "Let's build GPT from scratch" is the best hands-on walkthrough of transformers available. Watch at least the first 45 minutes for the core architecture:
Summary
| Architecture | Key Innovation | Best For |
|---|---|---|
| MLP (L35–36) | Universal approximation via depth | Tabular data, general function fitting |
| CNN (L37) | Parameter sharing via convolution; spatial locality | Images, 2D/3D data, spectrograms |
| Transformer | Self-attention captures arbitrary long-range dependencies | Sequences: text, time series, molecules |
| LLM | Pre-trained transformer at massive scale | Language, code, reasoning, general-purpose AI |
| Diffusion Model | Learns to reverse a Gaussian noising process; trained with a simple MSE loss on predicted noise | Image/molecule/signal generation, inverse problems |
References
- Vaswani, A., et al. (2017). Attention is all you need. NeurIPS 2017. arXiv:1706.03762.
- Karpathy, A. (2023). Let's build GPT: from scratch, in code, spelled out [Video]. YouTube. youtube.com
- Brown, T. B., et al. (2020). Language models are few-shot learners (GPT-3). NeurIPS 2020. arXiv:2005.14165.
- Jumper, J., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583–589.
- Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models (DDPM). NeurIPS 2020. arXiv:2006.11239.
- Song, Y., et al. (2021). Score-based generative modeling through stochastic differential equations. ICLR 2021. arXiv:2011.13456.