**Slater Stich, Bain Capital Ventures**
We recorded a “History of Diffusion” interview series with Jascha Sohl-Dickstein, Yang Song, and Sander Dieleman. (The Jascha interview is out now; Yang Song and Sander are coming soon.)
Those interviews are full of surprises about the hunches, near-misses, and other ups-and-downs that led to landmark diffusion research. The stories are even more interesting once you have some background info about diffusion. When Yang Song says, for example, that nobody expected the reverse-time SDE to be so simple… it helps to know what that is!
That’s the purpose of this post: to give some background context that will let you get the most out of the interviews with Jascha, Yang Song, and Sander. The post has two parts: Part 1 is a self-contained explanation of diffusion models. It covers the SDE interpretation, and works out a 2D example “by hand” to give some geometric intuition. Part 2 is a brief history, focusing on Jascha’s original paper, DDPM, and SDEs.
To read this, it helps to remember a little vector calculus. Here’s a (open book) quiz. Let $f(x,y) = (x^2 + y^2, x+y)$. What is $\partial_y f$? What is the work integral of $f$ along the line segment from $(0,0)$ to $(1,1)$? Using Stoke’s/Green’s Theorem, what is $\int_{\partial D} f$, where $D$ is the unit disk? If you can answer these questions, maybe with a little help from your favorite reasoning model, you’re good to go.
In Part 1, we’ll explain diffusion three times:
The advantage of doing an example in 2D is that it builds some geometric intuition for what’s going on. Throughout all of this, we’ll talk mainly about image generation models, since that’s most intuitive and that’s what many of the original papers were about.
We’re not going to cover diffusion guidance at all. This a major omission, since classifier-free guidance is what most people have in mind when they think about Midjourney, Ideogram, or other image-generation tools (inputting “ultra realistic iPhone photo of a panda on a water slide” is the classifier-free text guidance you use with Midjourney, Ideogram, Dalle, etc.) The reason I’m not covering guidance is that you don’t need to understand it to appreciate the papers we cover in the interviews. However, guidance is an interesting and important subject, and I highly recommend Sander’s excellent post as a starting point; it should be very accessible after reading this post.
At a high level, image gen diffusion models work like this:
You get a big dataset of images.
You corrupt those images. Specifically, you iteratively add noise to those images over time. Let’s call $x_0$ your original image and $x_t$ your image at time $t$. When $t$ is small, you’ve only added a little noise and $x_t$ looks like a fuzzy / static-y version of $x_0$. When $t$ is large, $x_t$ looks like “pure noise”.
During training, you remove the noise. We train a neural network to take the input $(x_t,t)$ and produce the output $x_0$. The network is rewarded for successfully removing the noise from $x_t$ to reconstruct the original $x_0$. (This step is actually a little more complicated, and I’m fudging the explanation, as you’ll see when you keep reading; but this paragraph is still approximately correct.)
During inference, you input pure noise into your trained model. Your model will think this is $(x_T,T)$ for some very large time $T$, and it’ll try to output the corresponding image $x_0$. And if you trained your model well, $x_0$ will look like the original images from your data set. Another way to think about this is that the model will “hallucinate, on purpose” a clean image $x_0$ when given pure noise $x_T$.
That’s it in a nutshell, and it’s useful to keep the high-level picture in mind. However, this explanation isn’t quite correct (some details have been over-simplified), and it leaves a lot unspecified. Let’s go deeper.