Diffusion: Learning a Path from Noise to Data
Diffusion models convert generation into a sequence of denoising problems. The model does not memorize a direct jump from random noise to a finished image, video, action, or latent state. It learns local corrections along a path where structure gradually appears.
The useful mental model starts with an object \(x_0\) sampled from the data distribution. For an image model, \(x_0\) may be pixels. For Stable Diffusion or Cosmos-style video models, \(x_0\) may be a compressed latent tensor. For a robot policy, it can even include action or value slots, as long as those variables have been represented as continuous tensors.
Diffusion is the generative mechanism applied to that representation. The representation says what the model operates on; diffusion says how the model learns to generate it.
The Forward Process Destroys Structure
The first half of a diffusion model is deliberately simple. Take a clean sample \(x_0\), choose a noise level \(t\), and corrupt the sample with Gaussian noise. The usual DDPM notation writes this as:
Early noise levels preserve most of the sample. Late noise levels wash out the sample until it is close to a Gaussian. The model is given examples of many partially corrupted samples. Each example asks the same local question: from this noisy point, what correction points back toward clean data?
This design turns a hard density modeling problem into many supervised denoising tasks. The training data supplies \(x_0\). The noise process supplies \(x_t\). The target can be the clean sample, the noise that was added, a velocity parameterization, or the score. The common form used in many explanations predicts the noise:
The condition \(c\) may be a class label, text embedding, image prompt, camera state, language instruction, or robot task description. Conditioning gives the denoiser a reason to choose one part of the data distribution over another.
The Model Learns a Local Direction
There are several equivalent ways to describe the learned object. A DDPM implementation often says the network predicts noise. A score-based paper says the network estimates \(\nabla_x \log p_t(x)\), the direction in noisy space where probability density increases. An engineering implementation may call the same network a denoiser.
| Name | What it predicts | Why it matters |
|---|---|---|
| Denoiser | A cleaner version of a corrupted sample | Directly matches the intuition of iterative refinement. |
| Noise predictor | The Gaussian noise component added to the sample | Gives a simple MSE training target in many DDPM-style systems. |
| Score model | The gradient of the log density at a noise level | Connects diffusion to probability flows, SDEs, and Langevin sampling. |
These names emphasize different mathematical views. For intuition, the important fact is that the model does not need to solve the whole generation problem in one pass. It only needs to make a useful local prediction at a chosen noise level.
Sampling Repeats the Correction
Training samples a noisy point and asks for a correction. Sampling starts with noise and repeats correction many times. Each step calls the same network with a different noise level:
x_T ~ Normal(0, I)
for t = T ... 1:
correction = denoiser(x_t, t, condition)
x_{t-1} = sampler_step(x_t, correction, t)
return x_0
The sampler is an algorithm wrapped around the denoising network. It decides step sizes, noise schedules, stochasticity, and numerical integration details. This is why diffusion papers can share the same broad training idea but differ in samplers, schedules, parameterizations, and speed-quality tradeoffs. Karras et al.'s EDM paper is valuable because it separates many of these design choices that are often tangled together.
Slow sampling comes from this repeated network evaluation. The same property also gives diffusion its strength: a high-dimensional sample is assembled through many smaller corrections, with conditioning injected at every correction.
Latent Diffusion Moves the Same Mechanism
Pixel diffusion applies the noising and denoising process directly to pixel arrays. Latent diffusion first trains an autoencoder, then applies diffusion to the compressed representation:
The generative mechanism stays the same. The space changes. The model denoises \(z_t\) instead of \(x_t\), and the decoder maps the final latent back to pixels. This is the main reason Stable Diffusion can be much cheaper than pixel-space diffusion for high-resolution images. The model works in a smaller, perceptually meaningful space.
Guidance Makes Generation Conditional
Text-to-image systems need the sample to obey a condition. Early systems used classifier guidance: a separate classifier supplies a gradient that pushes the sample toward a label. Classifier-free guidance trains a model that can run both with and without a condition, then amplifies the difference between conditional and unconditional predictions at sampling time.
The guidance scale \(w\) changes how strongly the condition shapes the trajectory. Higher guidance often improves prompt adherence and can damage diversity or naturalness. This is the same tradeoff seen across generative systems: stronger control can narrow the distribution the sampler explores.
Video Adds Time To The Sample
Video diffusion expands the denoised object from an image to a spatiotemporal tensor. The sample now has frame, height, width, and channel dimensions. The model must preserve visual detail and maintain temporal coherence. A single bad correction can create flicker, identity drift, or motion that looks locally plausible and globally wrong.
This is why video diffusion papers spend so much effort on temporal attention, 3D VAEs, frame conditioning, windowed attention, cascades, and training curricula. The denoising idea is still recognizable. The sample being denoised has become a short world trajectory.
The Cosmos Policy Connection
Cosmos Policy becomes easier to place after the representation/process split. The paper starts from a pretrained video latent diffusion model. Its native object is a sequence of video latent frames. The control trick is to insert robot variables into that same sequence:
current observation latent
-> action chunk latent slot
-> future observation latent
-> value latent slot
The latent sequence is the representation interface. Diffusion denoising is the mechanism that fills missing slots. Conditioning masks decide which slots are visible and which slots are targets. With one mask the model behaves like a policy; with another it behaves like a world model; with another it predicts value.
Cosmos Policy does not require diffusion to be a picture-only algorithm. It relies on the fact that diffusion can denoise continuous tensors. Once action and value have been encoded as latent-shaped tensors, the same denoising machinery can generate them beside future video latents.
What Diffusion Gives Up
Diffusion pays for quality with repeated computation. Every generated sample requires many network calls unless the model is distilled or the sampler is aggressively shortened. Latent diffusion reduces the size of the denoised variable, but it still keeps an iterative sampling loop.
Diffusion also makes generation depend on the geometry of noise levels. The model can be strong at some noise ranges and weak at others. Schedules, preconditioning, parameterization, and solver choices become part of model quality. The clean conceptual story hides a large engineering surface.
The final limitation is control. Guidance, conditioning, masks, and value functions can steer the sample, but they do not guarantee that the generated object is causally correct. In video and robotics, visual plausibility and action-relevant correctness can separate. Cosmos Policy is interesting precisely because it tests whether a video diffusion prior can be turned into an action interface rather than only a visual generator.
Sources
- Denoising Diffusion Probabilistic Models, Ho, Jain, and Abbeel, NeurIPS 2020.
- What are Diffusion Models?, Lilian Weng.
- The Annotated Diffusion Model, Hugging Face.
- Generative Modeling by Estimating Gradients of the Data Distribution, Yang Song.
- Diffusion models are autoencoders, Sander Dieleman.
- Perspectives on diffusion, Sander Dieleman.
- High-Resolution Image Synthesis with Latent Diffusion Models, Rombach et al., CVPR 2022.
- Diffusion Models for Video Generation, Lilian Weng.
- Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning, Kim et al., arXiv 2026.