WEAVER: a world model that is faithful, consistent, and fast at once

June 2026 · world models · robotic manipulation · WEAVER (arXiv:2606.13672)

A learned simulator is useful for robotics — to evaluate a policy, improve it, or plan at test time — only if it is faithful, coherent over long horizons, and fast, all at the same time. Prior robot world models give up at least one. WEAVER is a multi-view latent world model that, on its own real-robot setup, reaches all three together.

Three desiderata, and why prior models miss at least one

A world model can drive three downstream uses with little real interaction: policy evaluation, policy improvement, and test-time planning. To deliver them, the paper argues a world model must jointly satisfy three properties — fidelity (simulated trajectories correlate with reality), consistency (they stay coherent over long horizons), and efficiency (they are produced quickly).

The claim is that no prior robot world model has all three. Video-generation world models are high-fidelity but slow. JEPA-style latent models may not decode into the images needed to score an arbitrary visuomotor policy. The state-of-the-art manipulation world model Ctrl-World runs far slower than real time, which rules out test-time planning and makes policy improvement expensive. Manipulation makes the tension worse: multiple camera views, occlusions, and a need for physically faithful state rather than good-looking frames.

How WEAVER reaches all three

WEAVER (World Estimation Across Views for Embodied Reasoning, 928M parameters) is a multi-view latent world model, and each desideratum maps to a specific design choice.

Fidelity — every camera view (external and wrist) is encoded into patch tokens by a pretrained Stable Diffusion 3 VAE encoder; proprioceptive state is projected to a token and concatenated. WEAVER predicts both views and, unlike Ctrl-World, explicitly predicts future proprioceptive state, which the paper argues matters for contact-rich and deformable manipulation.

Dynamics and objective — given a sparse long-term memory (every k-th past latent) plus a short recent history and an h-step action plan, a 2D transformer autoregressively generates the next h latents, trained with a flow-matching loss (predict the velocity of the latent flow toward the ground-truth next latents).

Consistency — Diffusion Forcing (independent noise levels per future step) keeps long rollouts coherent rather than collapsing.

Efficiency — SPRINT blocks drop patch tokens aggressively, KV-caching reuses memory/history tokens across denoising steps, and a rectified-flow distillation step (WEAVER-REFLOW) cuts generation to a few forward passes; imagination runs at about 5 Hz.

Scoring without decoding — a reward head distilled from an off-the-shelf reward model scores rewards directly on latents plus the language instruction, and a critic estimates returns beyond the imagined horizon. So scoring a candidate future does not require decoding frames and calling a separate VLM judge.

observation multi-view RGB + proprioception SD3 VAE → latent zₜ flow-matching 2D transformer dynamics future latents (h steps) + memory / history reward + critic score on latents (no VLM decode) policy evaluation replay, record reward policy improvement advantage → distill test-time planning best-of-N fidelity: multi-view + proprio consistency: Diffusion Forcing efficiency: SPRINT / ReFlow one model, read three ways
Schematic (original) of the WEAVER pipeline; the paper's figures give the full architecture, see the paper.
WEAVER architecture
Figure 2 from WEAVER (Jain et al., arXiv:2606.13672), CC BY 4.0 — the world model encodes memory, history, and an action sequence to imagine future rollouts in latent space.

One model, three downstream uses

The same world model is read three ways. Policy evaluation: replay a recorded real action trajectory open-loop inside WEAVER and record the predicted reward. Policy improvement: sample action chunks, forward-simulate them, compute a Monte-Carlo advantage from the reward and critic heads, keep the best rollout above a threshold, and distill it into the base policy. Test-time planning: a single-chunk best-of-N — sample candidate action chunks, imagine their outcomes, and execute the highest-advantage one, all scored in latent space without an external judge.

Evidence

Downstream useResult (real hardware)
Policy evaluationρ = 0.870 correlation between WEAVER's score and real-world success rate
Policy improvement+38% real-world success on top of the π0.5 robot foundation model
Test-time planning+14% real-world success, with a 5–10× speedup over prior world models
Out-of-distributionbetter than prior world models on OOD scenarios

The main world-model baseline is Ctrl-World (a 1.5B diffusion model trained on DROID), which WEAVER reports Pareto-dominating on fidelity versus speed; the policy-improvement base is the π0.5 foundation model. WEAVER itself is pretrained on DROID (~1M steps on 4×H100 for about 10 days) and runs imagination at roughly 5 Hz.

FVD versus inference time, WEAVER vs Ctrl-World
Figure 5 from WEAVER (arXiv:2606.13672), CC BY 4.0 — FVD (lower is more faithful) versus inference time; WEAVER sits below and to the left of Ctrl-World: more faithful and faster at once.

Limitations

WEAVER leans on pretrained components — an SD3 VAE encoder/decoder and a distilled reward model — so their coverage bounds it. The "all three at once" claim is relative to prior world models on the paper's own manipulation suite and real-robot tasks, not a universal result. The version read here does not list author affiliations. As with any single real-robot system, the numbers come from its own task suite, and the boundary worth keeping is that a more faithful, faster simulator is a better substrate for evaluation and planning, not a replacement for the verification and trust-horizon questions that sit on top of it.

References

  1. WEAVER: WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation (arXiv:2606.13672). Code, models, and videos on the project page linked from the paper.
  2. Main world-model baseline: Ctrl-World. Policy-improvement base model: π0.5.

WEAVER:一个同时做到保真、一致、快的世界模型

2026 年 6 月 · 世界模型 · 机器人操作 · WEAVER (arXiv:2606.13672)

一个学出来的模拟器要对机器人有用——评估策略、改进策略、测试时规划——前提是它同时做到保真、长时域一致、且快。以前的机器人世界模型总要放弃其中至少一项。WEAVER 是一个多视角 latent 世界模型,在它自己的真机 setup 上把三者一起拿到。

三个 desiderata,以及为什么以前的模型至少缺一个

世界模型能用很少的真实交互驱动三件下游事:策略评估策略改进测试时规划。论文主张,要交付这三件,世界模型必须同时满足三个属性——fidelity(保真,模拟轨迹和现实相关)consistency(一致,长时域连贯)efficiency(高效,生成得快)

它的论点是:以前没有任何机器人世界模型三者兼备。视频生成式世界模型保真高但慢;JEPA 式 latent 模型的潜表示不一定能 decode 成评估任意视觉运动策略所需的图像;当前最强的操作世界模型 Ctrl-World 远慢于实时,因此做不了测试时规划、也让策略改进很贵。操作场景让矛盾更尖锐:多个相机视角、遮挡,而且需要物理上忠实的状态,光有好看的画面不够。

WEAVER 怎么把三者一起拿到

WEAVER(World Estimation Across Views for Embodied Reasoning,928M 参数)是一个多视角 latent 世界模型,每个 desideratum 都对应一个具体设计。

保真——每个相机视角(外部 + 腕部)由一个预训练的 Stable Diffusion 3 VAE encoder 编成 patch tokens,本体感受状态投影成一个 token 拼接进去。WEAVER 同时预测两个视角,并且不同于 Ctrl-World,它显式预测未来的本体感受状态——论文认为这对接触密集与可变形操作很关键。

动力学与目标——给一个稀疏的长期记忆(每隔 k 步取一个过去 latent)加一小段近期 history 和一个 h 步动作计划,一个 2D transformer 自回归地生成接下来 h 个 latent,用 flow-matching 损失训练(预测 latent 流向真值的速度)。

一致——Diffusion Forcing(每个未来步独立的噪声水平)让长 rollout 保持连贯,不塌掉。

高效——SPRINT 块激进地丢弃 patch token,KV-cache 在去噪步之间复用记忆/history token,再加一个 rectified-flow 蒸馏步骤(WEAVER-REFLOW)把生成压到几次前向;想象以约 5 Hz 运行。

不解码就打分——一个 reward head 从现成 reward model 蒸馏,直接在 latent 加语言指令上打 reward,一个 critic 估计想象 horizon 之外的回报。于是给一个候选未来打分,不需要 decode 出画面再叫一个 VLM 来判。

观测 多视角 RGB + 本体感受 SD3 VAE → latent zₜ flow-matching 2D transformer 动力学 未来 latent(h 步) + 记忆 / history reward + critic 在 latent 上打分 (不解码 VLM) 策略评估 重放,记录 reward 策略改进 advantage → 蒸馏 测试时规划 best-of-N 保真:多视角 + 本体感受 一致:Diffusion Forcing 高效:SPRINT / ReFlow 一个模型,三种读法
原创示意图。论文给出完整架构图,见原文
WEAVER 架构
图 2,引自 WEAVER(Jain 等,arXiv:2606.13672),CC BY 4.0——世界模型编码记忆、history 与动作序列,在 latent 空间想象未来 rollout。

一个模型,三种下游用法

同一个世界模型被读三遍。策略评估:把记录下来的真实动作轨迹在 WEAVER 里开环重放,记录预测的 reward。策略改进:采一批动作块,前向模拟,用 reward + critic 算 Monte-Carlo advantage,留下超过阈值的最好 rollout,蒸馏进 base policy。测试时规划:single-chunk best-of-N——采若干候选动作块,想象各自结果,执行 advantage 最高的那个,全程在 latent 空间打分,不用外部裁判。

证据

下游用法结果(真机)
策略评估WEAVER 的打分与真实成功率相关系数 ρ = 0.870
策略改进在 π0.5 机器人基础模型之上,真机成功率 +38%
测试时规划真机成功率 +14%,且比之前的世界模型快 5–10×
分布外OOD 场景上优于之前的世界模型

主要的世界模型 baseline 是 Ctrl-World(一个 1.5B 的扩散模型,在 DROID 上训练),WEAVER 报告在"保真 vs 速度"上 Pareto 压过它;策略改进的基座是 π0.5。WEAVER 自己在 DROID 上预训练(约 1M 步,4×H100 约 10 天),想象以约 5 Hz 运行。

FVD 对推理时间,WEAVER vs Ctrl-World
图 5,引自 WEAVER(arXiv:2606.13672),CC BY 4.0——FVD(越低越保真)对推理时间;WEAVER 在 Ctrl-World 的左下方:更保真,也更快。

局限

WEAVER 依赖预训练组件——SD3 VAE 的 encoder/decoder 和一个蒸馏来的 reward model——它们的覆盖范围限定了它的上限。"三者同时拿到"这个说法,是相对以前的世界模型、在它自己的操作任务套件与真机任务上成立的,不是一个普适结论。我读到的版本没有列作者机构。和任何单一真机系统一样,数字来自它自己的任务集。要记住的边界是:一个更保真、更快的模拟器,是评估与规划更好的底座,但它取代不了在它之上的验证与信任时域问题。

参考

  1. WEAVER:WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation(arXiv:2606.13672)。代码、模型、视频见原文链接的项目页。
  2. 主要世界模型 baseline:Ctrl-World。策略改进基座:π0.5。