Diffusion: Learning a Path from Noise to Data

June 2026

Diffusion models convert generation into a sequence of denoising problems. The model does not memorize a direct jump from random noise to a finished image, video, action, or latent state. It learns local corrections along a path where structure gradually appears.

The useful mental model starts with an object \(x_0\) sampled from the data distribution. For an image model, \(x_0\) may be pixels. For Stable Diffusion or Cosmos-style video models, \(x_0\) may be a compressed latent tensor. For a robot policy, it can even include action or value slots, as long as those variables have been represented as continuous tensors.

Diffusion is the generative mechanism applied to that representation. The representation says what the model operates on; diffusion says how the model learns to generate it.

Core Split

Representation answers: what is \(x\)? Pixels, VAE latents, video latents, audio latents, action vectors, or a mixed latent sequence. Diffusion answers: how do we move from noise back to samples from the data distribution?

The forward path is fixed and cheap to sample from. The reverse path is learned. Generation runs the reverse path from noise back to data-like structure.

The Forward Process Destroys Structure

The first half of a diffusion model is deliberately simple. Take a clean sample \(x_0\), choose a noise level \(t\), and corrupt the sample with Gaussian noise. The usual DDPM notation writes this as:

\[ q(x_t \mid x_0) = \mathcal{N} \left( \sqrt{\bar{\alpha}_t}x_0,\, (1-\bar{\alpha}_t)I \right) \]

Early noise levels preserve most of the sample. Late noise levels wash out the sample until it is close to a Gaussian. The model is given examples of many partially corrupted samples. Each example asks the same local question: from this noisy point, what correction points back toward clean data?

This design turns a hard density modeling problem into many supervised denoising tasks. The training data supplies \(x_0\). The noise process supplies \(x_t\). The target can be the clean sample, the noise that was added, a velocity parameterization, or the score. The common form used in many explanations predicts the noise:

\[ \mathcal{L}(\theta)= \mathbb{E}_{x_0,\epsilon,t} \left[ \left\| \epsilon - \epsilon_\theta(x_t,t,c) \right\|^2 \right], \qquad x_t = \sqrt{\bar{\alpha}_t}x_0+ \sqrt{1-\bar{\alpha}_t}\epsilon \]

The condition \(c\) may be a class label, text embedding, image prompt, camera state, language instruction, or robot task description. Conditioning gives the denoiser a reason to choose one part of the data distribution over another.

The Model Learns a Local Direction

There are several equivalent ways to describe the learned object. A DDPM implementation often says the network predicts noise. A score-based paper says the network estimates \(\nabla_x \log p_t(x)\), the direction in noisy space where probability density increases. An engineering implementation may call the same network a denoiser.

Name	What it predicts	Why it matters
Denoiser	A cleaner version of a corrupted sample	Directly matches the intuition of iterative refinement.
Noise predictor	The Gaussian noise component added to the sample	Gives a simple MSE training target in many DDPM-style systems.
Score model	The gradient of the log density at a noise level	Connects diffusion to probability flows, SDEs, and Langevin sampling.

These names emphasize different mathematical views. For intuition, the important fact is that the model does not need to solve the whole generation problem in one pass. It only needs to make a useful local prediction at a chosen noise level.

Sampling Repeats the Correction

Training samples a noisy point and asks for a correction. Sampling starts with noise and repeats correction many times. Each step calls the same network with a different noise level:

x_T ~ Normal(0, I)
for t = T ... 1:
  correction = denoiser(x_t, t, condition)
  x_{t-1} = sampler_step(x_t, correction, t)
return x_0

The sampler is an algorithm wrapped around the denoising network. It decides step sizes, noise schedules, stochasticity, and numerical integration details. This is why diffusion papers can share the same broad training idea but differ in samplers, schedules, parameterizations, and speed-quality tradeoffs. Karras et al.'s EDM paper is valuable because it separates many of these design choices that are often tangled together.

Slow sampling comes from this repeated network evaluation. The same property also gives diffusion its strength: a high-dimensional sample is assembled through many smaller corrections, with conditioning injected at every correction.

Latent Diffusion Moves the Same Mechanism

Pixel diffusion applies the noising and denoising process directly to pixel arrays. Latent diffusion first trains an autoencoder, then applies diffusion to the compressed representation:

\[ z_0 = E(x_0), \qquad \hat{x}_0 = D(z_0) \]

\[ z_t = \sqrt{\bar{\alpha}_t}z_0+ \sqrt{1-\bar{\alpha}_t}\epsilon, \qquad \epsilon_\theta(z_t,t,c)\approx \epsilon \]

The generative mechanism stays the same. The space changes. The model denoises \(z_t\) instead of \(x_t\), and the decoder maps the final latent back to pixels. This is the main reason Stable Diffusion can be much cheaper than pixel-space diffusion for high-resolution images. The model works in a smaller, perceptually meaningful space.

Why Latent And Diffusion Coexist

Latent is the representation. Diffusion is the process run over that representation. A latent diffusion model is a diffusion model whose noisy variable is a latent tensor rather than raw pixels.

Guidance Makes Generation Conditional

Text-to-image systems need the sample to obey a condition. Early systems used classifier guidance: a separate classifier supplies a gradient that pushes the sample toward a label. Classifier-free guidance trains a model that can run both with and without a condition, then amplifies the difference between conditional and unconditional predictions at sampling time.

\[ \epsilon_{\text{guided}} = \epsilon_\theta(x_t,t,\varnothing) + w\left[ \epsilon_\theta(x_t,t,c) - \epsilon_\theta(x_t,t,\varnothing) \right] \]

The guidance scale \(w\) changes how strongly the condition shapes the trajectory. Higher guidance often improves prompt adherence and can damage diversity or naturalness. This is the same tradeoff seen across generative systems: stronger control can narrow the distribution the sampler explores.

Video Adds Time To The Sample

Video diffusion expands the denoised object from an image to a spatiotemporal tensor. The sample now has frame, height, width, and channel dimensions. The model must preserve visual detail and maintain temporal coherence. A single bad correction can create flicker, identity drift, or motion that looks locally plausible and globally wrong.

This is why video diffusion papers spend so much effort on temporal attention, 3D VAEs, frame conditioning, windowed attention, cascades, and training curricula. The denoising idea is still recognizable. The sample being denoised has become a short world trajectory.

The Cosmos Policy Connection

Cosmos Policy becomes easier to place after the representation/process split. The paper starts from a pretrained video latent diffusion model. Its native object is a sequence of video latent frames. The control trick is to insert robot variables into that same sequence:

current observation latent
-> action chunk latent slot
-> future observation latent
-> value latent slot

The latent sequence is the representation interface. Diffusion denoising is the mechanism that fills missing slots. Conditioning masks decide which slots are visible and which slots are targets. With one mask the model behaves like a policy; with another it behaves like a world model; with another it predicts value.

Cosmos Policy does not require diffusion to be a picture-only algorithm. It relies on the fact that diffusion can denoise continuous tensors. Once action and value have been encoded as latent-shaped tensors, the same denoising machinery can generate them beside future video latents.

What Diffusion Gives Up

Diffusion pays for quality with repeated computation. Every generated sample requires many network calls unless the model is distilled or the sampler is aggressively shortened. Latent diffusion reduces the size of the denoised variable, but it still keeps an iterative sampling loop.

Diffusion also makes generation depend on the geometry of noise levels. The model can be strong at some noise ranges and weak at others. Schedules, preconditioning, parameterization, and solver choices become part of model quality. The clean conceptual story hides a large engineering surface.

The final limitation is control. Guidance, conditioning, masks, and value functions can steer the sample, but they do not guarantee that the generated object is causally correct. In video and robotics, visual plausibility and action-relevant correctness can separate. Cosmos Policy is interesting precisely because it tests whether a video diffusion prior can be turned into an action interface rather than only a visual generator.

Sources

Denoising Diffusion Probabilistic Models, Ho, Jain, and Abbeel, NeurIPS 2020.
What are Diffusion Models?, Lilian Weng.
The Annotated Diffusion Model, Hugging Face.
Generative Modeling by Estimating Gradients of the Data Distribution, Yang Song.
Diffusion models are autoencoders, Sander Dieleman.
Perspectives on diffusion, Sander Dieleman.
High-Resolution Image Synthesis with Latent Diffusion Models, Rombach et al., CVPR 2022.
Diffusion Models for Video Generation, Lilian Weng.
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning, Kim et al., arXiv 2026.

Diffusion：从噪声到数据的一条路径

June 2026

Diffusion model 把生成任务拆成一串 denoising problem。模型不需要一次性从随机噪声跳到完整图像、视频、action 或 latent state；它学习的是沿着一条路径做局部修正，让结构逐步出现。

一个可用的理解方式从 \(x_0\) 开始。\(x_0\) 来自 data distribution。图像模型里，\(x_0\) 可以是 pixels；Stable Diffusion 或 Cosmos 风格的视频模型里，\(x_0\) 可以是压缩后的 latent tensor；机器人策略里，它甚至可以包含 action 或 value slots，只要这些变量已经被表示成连续 tensor。

Diffusion 是施加在这个表示上的生成机制。representation 决定模型操作的对象；diffusion 决定模型如何生成这个对象。

Core Split

Representation 回答：\(x\) 是什么？pixels、VAE latents、video latents、audio latents、action vectors，或者一条混合 latent sequence。Diffusion 回答：如何从 noise 回到 data distribution 里的样本？

Forward path 是固定过程，采样很便宜；reverse path 由模型学习。生成时，从 noise 出发沿 reverse path 回到 data-like structure。

Forward Process 负责破坏结构

Diffusion model 的前半部分非常简单。取一个 clean sample \(x_0\)，选一个 noise level \(t\)，用 Gaussian noise 污染它。DDPM 常用的写法是：

\[ q(x_t \mid x_0) = \mathcal{N} \left( \sqrt{\bar{\alpha}_t}x_0,\, (1-\bar{\alpha}_t)I \right) \]

较小的 noise level 保留大部分样本结构。较大的 noise level 会让样本越来越接近 Gaussian。模型在训练时看到许多 partially corrupted samples。每个训练样本都在问同一个局部问题：从这个 noisy point 出发，哪个 correction 能指向 clean data？

这个设计把高维 density modeling 拆成许多 supervised denoising tasks。训练数据提供 \(x_0\)，noise process 提供 \(x_t\)。target 可以是 clean sample、被加入的 noise、velocity parameterization，或者 score。许多 DDPM 风格解释使用 noise prediction：

\[ \mathcal{L}(\theta)= \mathbb{E}_{x_0,\epsilon,t} \left[ \left\| \epsilon - \epsilon_\theta(x_t,t,c) \right\|^2 \right], \qquad x_t = \sqrt{\bar{\alpha}_t}x_0+ \sqrt{1-\bar{\alpha}_t}\epsilon \]

condition \(c\) 可以是 class label、text embedding、image prompt、camera state、language instruction，或者 robot task description。conditioning 给 denoiser 一个理由，让它沿着 data distribution 的某一部分去修正。

模型学习的是局部方向

这个 learned object 有几种等价描述。DDPM 实现常说网络预测 noise。Score-based paper 会说网络估计 \(\nabla_x \log p_t(x)\)，也就是 noisy space 里 probability density 上升的方向。工程实现里，它常被直接叫 denoiser。

Name	Predicts	Role
Denoiser	corrupted sample 的 cleaner version	最贴近 iterative refinement 的直觉。
Noise predictor	被加入的 Gaussian noise component	给 DDPM 系统一个简单的 MSE training target。
Score model	某个 noise level 下的 log density gradient	连接 probability flow、SDE 和 Langevin sampling。

这些名字强调不同数学视角。直觉上最重要的一点是：模型无需一次性解决完整生成问题，它只需要在某个 noise level 上做出一个有用的局部预测。

Sampling 反复调用这个修正

训练时，模型看到 noisy point，并学习 correction。采样时，从 noise 开始，反复调用同一个 correction。每一步都把当前样本和当前 noise level 交给同一个网络：

x_T ~ Normal(0, I)
for t = T ... 1:
  correction = denoiser(x_t, t, condition)
  x_{t-1} = sampler_step(x_t, correction, t)
return x_0

sampler 是套在 denoising network 外面的算法。它决定 step size、noise schedule、stochasticity 和 numerical integration 细节。这也解释了为什么许多 diffusion papers 共享同一个大训练思想，却在 sampler、schedule、parameterization、speed-quality tradeoff 上差很多。Karras 等人的 EDM paper 很有价值，因为它把这些常被混在一起的设计选择拆开讨论。

diffusion sampling 慢，原因正是反复 network evaluation。同一个性质也带来它的强项：高维样本由许多小修正逐步组装，conditioning 可以在每一次修正里介入。

Latent Diffusion 把同一机制搬到压缩空间

Pixel diffusion 直接对 pixel arrays 做 noising 和 denoising。Latent diffusion 先训练 autoencoder，再把 diffusion 放到 compressed representation 上：

\[ z_0 = E(x_0), \qquad \hat{x}_0 = D(z_0) \]

\[ z_t = \sqrt{\bar{\alpha}_t}z_0+ \sqrt{1-\bar{\alpha}_t}\epsilon, \qquad \epsilon_\theta(z_t,t,c)\approx \epsilon \]

生成机制保持不变，空间发生变化。模型 denoise 的对象从 \(x_t\) 换成 \(z_t\)，最后由 decoder 把 latent 还原成 pixels。这就是 Stable Diffusion 比 pixel-space diffusion 更适合高分辨率图像的主要原因：模型工作在更小、更接近感知结构的空间里。

Why Latent And Diffusion Coexist

Latent 是 representation。Diffusion 是运行在 representation 上的生成过程。Latent diffusion model 指的是：被加噪、去噪的变量是 latent tensor，而非 raw pixels。

Guidance 让生成过程受条件控制

Text-to-image 系统需要样本服从 condition。早期方法用 classifier guidance：一个单独 classifier 提供 gradient，把样本推向某个 label。Classifier-free guidance 训练一个既能有条件运行、也能无条件运行的模型，然后在 sampling 时放大 conditional prediction 和 unconditional prediction 的差异。

\[ \epsilon_{\text{guided}} = \epsilon_\theta(x_t,t,\varnothing) + w\left[ \epsilon_\theta(x_t,t,c) - \epsilon_\theta(x_t,t,\varnothing) \right] \]

guidance scale \(w\) 决定 condition 对 trajectory 的塑形强度。更强 guidance 往往提升 prompt adherence，也可能损伤 diversity 或 naturalness。这是很多 generative systems 共有的取舍：更强控制会收窄 sampler 探索的分布。

Video 把时间放进样本

Video diffusion 把 denoised object 从 image 扩展成 spatiotemporal tensor。样本多了 frame、height、width、channel 这些维度。模型既要保留视觉细节，又要维持 temporal coherence。一次局部修正出现偏差，就可能造成 flicker、identity drift，或者局部合理但全局错误的 motion。

所以 video diffusion papers 会大量讨论 temporal attention、3D VAE、frame conditioning、windowed attention、cascade 和 training curriculum。denoising idea 仍然清楚，只是被 denoise 的样本变成了一段短 world trajectory。

Cosmos Policy 的位置

有了 representation/process 这层区分，Cosmos Policy 就更好放置。它从一个 pretrained video latent diffusion model 出发。这个模型原生处理的是一串 video latent frames。论文的 control trick 是把机器人变量插入同一条 sequence：

current observation latent
-> action chunk latent slot
-> future observation latent
-> value latent slot

Latent sequence 是 representation interface。Diffusion denoising 是补全 missing slots 的 mechanism。Conditioning masks 决定哪些 slots 可见，哪些 slots 是 target。换一个 mask，模型像 policy；再换一个 mask，模型像 world model；再换一个 mask，模型预测 value。

Cosmos Policy 依赖的是 diffusion 可以 denoise continuous tensors 这一点。action 和 value 被编码成 latent-shaped tensors 后，就能和 future video latents 一起进入同一个 denoising machinery。

Diffusion 付出的代价

Diffusion 用重复计算换取生成质量。每个样本都需要多次 network calls，除非模型经过 distillation，或者 sampler 被大幅缩短。Latent diffusion 缩小了被 denoise 的变量，但 iterative sampling loop 仍然存在。

Diffusion 也让生成质量依赖 noise levels 的几何结构。模型可能在某些 noise ranges 很强，在另一些 noise ranges 较弱。schedule、preconditioning、parameterization 和 solver choice 都会进入最终质量。简洁的概念背后有很大的工程表面。

最后一处限制是 control。Guidance、conditioning、masks 和 value functions 都能 steer sample，但它们不会自动保证生成对象具有正确因果结构。在 video 和 robotics 里，visual plausibility 与 action-relevant correctness 可能分离。Cosmos Policy 的价值就在这里：它检验 video diffusion prior 是否能被改造成 action interface。

Sources

Denoising Diffusion Probabilistic Models, Ho, Jain, and Abbeel, NeurIPS 2020.
What are Diffusion Models?, Lilian Weng.
The Annotated Diffusion Model, Hugging Face.
Generative Modeling by Estimating Gradients of the Data Distribution, Yang Song.
Diffusion models are autoencoders, Sander Dieleman.
Perspectives on diffusion, Sander Dieleman.
High-Resolution Image Synthesis with Latent Diffusion Models, Rombach et al., CVPR 2022.
Diffusion Models for Video Generation, Lilian Weng.
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning, Kim et al., arXiv 2026.