Cosmos Policy: How a Video Model Outputs Actions

June 2026

Cosmos is NVIDIA's world-model platform for Physical AI. Cosmos Policy takes the Cosmos-Predict2 Video2World backbone and fine-tunes it so selected video-latent slots mean action, future observation, and value. The model still denoises latent tensors; training data and decoding rules change what some tensors mean.

The central question is simple. Cosmos-Predict2 was built to predict future video. Cosmos Policy asks whether the same tensor interface can also carry robot control variables.

The answer in the paper is to keep the video model's shape contract and change the sequence stored inside it.

Cosmos Policy overview showing current state and task description as inputs and action chunk, future state, and value as outputs — Figure 1 from the Cosmos Policy paper. The same backbone receives current state and task text, then outputs action, future state, and value.

Start From The Tensor

A video diffusion model emits latent tensors. When those tensors pass through a video VAE decoder, they become image frames. The tensors do not carry a fixed semantic label by themselves. Their meaning comes from the training target and the way the output is read.

Cosmos-Predict2 starts with a video sequence interface:

current observation
-> future video latents
-> video decoder
-> future frames

Cosmos Policy keeps the latent-frame interface and changes the slots:

current observation
-> action chunk
-> future observation
-> value

This is the technical reason a video model can output actions. The network is still producing tensors with the expected shape. Fine-tuning teaches the model that some tensor slots now correspond to robot actions and scalar values.

The Slot Assignment

Image observations enter normally. The video VAE encodes each camera observation into video latent frames. Robot variables are handled differently. Proprioception, action chunks, and scalar values are normalized, repeated across a video-latent-shaped volume, and inserted into reserved slots.

\[ x_0 = [ z(o_t),\ \phi_q(q_t),\ \phi_a(a_{t:t+K-1}),\ z(o_{t+K}),\ \phi_q(q_{t+K}),\ \phi_v(V(s_{t+K})) ] \]

\(z(o_t)\) is the video latent for the current observation. \(\phi_a\) maps an action chunk into a video-latent-shaped slot. \(\phi_v\) maps a scalar value into the same shape family. After denoising, action and value extraction runs the reverse operation: average over the repeated volume, then un-normalize back to the original action or value scale.

Cosmos Policy latent frame injection diagram showing conditioning sequence, target sequence, robot proprioception, action chunk, future state, and value inserted into video latent frames — Figure 2 from the paper. The red-bordered slots are the control variables injected into the video latent sequence.

The method spends extra representation capacity to preserve compatibility with the pretrained video backbone. A low-dimensional action vector becomes an image-shaped latent slot. The gain is that the transformer and diffusion objective remain close to the original video model.

Diffusion Learns The Interface

Denoising intuition

During training, the model sees many clean state -> action -> future -> value trajectories. It deliberately corrupts those trajectories with noise, then learns to restore them under the current state. At inference time, the unknown slots start as noise, and the denoising process turns them into one plausible action, future, and value sample.

The clean sequence \(x_0\) is the robot trajectory represented as latent slots. The forward process adds Gaussian noise. The denoiser learns to recover the clean sequence under a task condition and a mask:

\[ x_\sigma = x_0 + n,\quad n \sim \mathcal{N}(0,\sigma^2 I), \qquad D_\theta(x_\sigma,\sigma,c,m) \approx x_0 \]

The mask is important. It marks which slots are known inputs and which slots must be predicted. With the current state visible and action/future/value hidden, the same denoising model becomes a policy. With state and action visible, it becomes a world model. With state, action, and future visible, it becomes a value estimator.

\[ \begin{aligned} \text{policy} &: p(a, s^{\prime}, V(s^{\prime}) \mid s) \\ \text{world model} &: p(s^{\prime}, V(s^{\prime}) \mid s, a) \\ \text{value model} &: p(V(s^{\prime}) \mid s, a, s^{\prime}) \end{aligned} \]

Why The Paper Needs Value

A direct policy samples an action chunk and executes it. The planning version samples multiple candidate action chunks, predicts the future state for each candidate, estimates the value of each future, and executes the action with the highest predicted value.

\[ a^\star = \arg\max_{a_i \sim \pi(\cdot \mid s)} V(\hat{s}^{\prime}_i), \qquad \hat{s}^{\prime}_i \sim \hat{T}(\cdot \mid s, a_i) \]

This is where Cosmos Policy moves from video prediction toward a control-oriented world model. The future is used for action selection. The model predicts an action, predicts what that action will lead to, then scores the predicted future.

Evidence And Boundary

The paper reports strong direct-policy results on LIBERO, RoboCasa, and ALOHA. The more useful mechanism check is the RoboCasa ablation. Removing the future-state and value auxiliary targets drops average success from 67.1% to 44.4%. That row says the future/value slots are contributing to action learning.

Table 5 from Cosmos Policy showing RoboCasa ablations and inference latency results — Table 5 from the paper. The key ablation removes the future-state and value targets and sharply weakens the policy.

The result supports a specific claim: a pretrained video diffusion model can become a strong manipulation policy when robot actions, future observations, and values are represented inside its native latent sequence.

The result leaves three questions open. First, the full best-of-N planning setup is expensive; the paper reports about 4.9 seconds on 8 H100 GPUs in the ALOHA planning setting. Second, planning needs rollout data so the world model and value model see failures beyond demonstrations. Third, repeating actions and scalar values across a video-latent volume preserves compatibility, but it is not obviously the most efficient representation for control.

Cosmos Policy is useful because it makes the interface problem explicit. A video world model becomes relevant to robotics only after action, future, and value enter the same object that the model can train on and sample from.

Sources

NVIDIA Cosmos, NVIDIA's world foundation model platform for Physical AI.
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning, Kim et al., arXiv 2026.
Cosmos Policy project page, NVIDIA Research.
NVlabs/cosmos-policy, released code, models, and training data references.

Cosmos Policy：视频模型怎样输出机器人动作

June 2026

Cosmos 是 NVIDIA 面向 Physical AI 的 world-model platform。Cosmos Policy 取用其中的 Cosmos-Predict2 Video2World backbone，在机器人数据上继续训练，让若干 video latent slots 分别表示 action、future observation 和 value。底层模型仍然在 denoise latent tensors；改变的是 slot 的含义、训练目标和读取方式。

这篇论文的核心问题很直接。Cosmos-Predict2 原本预测未来视频。Cosmos Policy 检验同一套 tensor 接口能否承载机器人控制变量。

论文的做法是保留视频模型的形状约定，替换这条序列里存放的对象。

先看 tensor

video diffusion 模型底层输出的是 latent tensors。它们经过 video VAE decoder 后成为未来图像帧。tensor 自身没有固定语义标签。它的含义来自训练目标，也来自输出读取方式。

Cosmos-Predict2 的视频接口大致是：

current observation
-> future video latents
-> video decoder
-> future frames

Cosmos Policy 保留 latent-frame 接口，把 slot 换成：

current observation
-> action chunk
-> future observation
-> value

这就是视频模型能够输出 action 的技术原因。网络仍然产生符合原始形状的 tensors。fine-tuning 让模型学习：其中一部分 tensor slots 对应机器人动作和 scalar value。

Slot 怎样改含义

图像观测照常进入模型。video VAE 把每个相机观测编码成 video latent frames。机器人变量走另一条处理路径。proprioception、action chunk 和 scalar value 先归一化，再复制成 video-latent-shaped volume，最后写入预留 slot。

\[ x_0 = [ z(o_t),\ \phi_q(q_t),\ \phi_a(a_{t:t+K-1}),\ z(o_{t+K}),\ \phi_q(q_{t+K}),\ \phi_v(V(s_{t+K})) ] \]

\(z(o_t)\) 是当前观测的视频 latent。\(\phi_a\) 把 action chunk 映射成 video-latent-shaped slot。\(\phi_v\) 把 scalar value 映射到同一类形状。denoise 完成后，action 和 value 的读取过程反向执行：对复制出的 volume 求平均，再 un-normalize 回原始动作或 value 尺度。

这个方法会消耗额外表示容量。低维 action vector 被铺成 image-shaped latent slot。收益是 pretrained video backbone、transformer 结构和 diffusion objective 都能继续使用。

Diffusion 怎样学这个接口

Denoising intuition

训练时，模型看见许多真实的 state -> action -> future -> value 轨迹；它故意把这些轨迹加噪弄脏，再学习怎样根据当前 state 把它们还原回来。推理时，它从纯噪声开始，一步步 denoise 出一组合理的 action、future 和 value。

clean sequence \(x_0\) 是用 latent slots 表示的机器人轨迹。forward process 加入高斯噪声。denoiser 在 task condition 和 mask 约束下恢复 clean sequence：

\[ x_\sigma = x_0 + n,\quad n \sim \mathcal{N}(0,\sigma^2 I), \qquad D_\theta(x_\sigma,\sigma,c,m) \approx x_0 \]

mask 很关键。它标记哪些 slot 是已知输入，哪些 slot 要预测。只暴露 current state，隐藏 action、future 和 value，同一个 denoising model 就承担 policy 角色。暴露 state 和 action，它承担 world model 角色。暴露 state、action 和 future，它承担 value estimator 角色。

为什么还要 value

direct policy 采样一个 action chunk，然后执行。planning 版本会采样多个候选 action chunk，分别预测每个 action 之后的 future state，再估计每个 future 的 value，最后执行 predicted value 最高的 action。

\[ a^\star = \arg\max_{a_i \sim \pi(\cdot \mid s)} V(\hat{s}^{\prime}_i), \qquad \hat{s}^{\prime}_i \sim \hat{T}(\cdot \mid s, a_i) \]

这一步把 Cosmos Policy 从视频预测推向控制型 world model。future 进入 action selection。模型先预测动作，再预测动作之后的未来，最后给预测出的未来打分。

证据和边界

论文在 LIBERO、RoboCasa 和 ALOHA 上报告了很强的 direct-policy 结果。更有用的机制检查来自 RoboCasa ablation。去掉 future-state 和 value auxiliary targets 后，平均成功率从 67.1% 降到 44.4%。这一行说明 future/value slots 参与了 action learning。

这组结果支撑一个具体结论：把机器人 action、future observation 和 value 放进视频模型原生的 latent sequence 后，pretrained video diffusion model 能被改造成很强的 manipulation policy。

论文也留下三个边界。第一，完整 best-of-N planning 成本很高；ALOHA planning 设置报告了 8 张 H100 上约 4.9 秒的耗时。第二，planning 需要 rollout data，让 world model 和 value model 见到 demonstrations 之外的失败经验。第三，把 action 和 scalar value 复制到 video-latent volume 里能保留兼容性，但这未必是控制变量最高效的表示方式。

Cosmos Policy 的价值在于明确了接口问题。video world model 进入机器人控制，需要 action、future 和 value 进入同一个可训练、可采样的对象。

Sources

NVIDIA Cosmos, NVIDIA's world foundation model platform for Physical AI.
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning, Kim et al., arXiv 2026.
Cosmos Policy project page, NVIDIA Research.
NVlabs/cosmos-policy, released code, models, and training data references.