EV-WM: verifying events, not features

June 2026 · long-horizon manipulation · EV-WM (arXiv:2606.13053)

A feature-space world model can imagine future visual features, but feature similarity does not tell you whether the task actually progressed. EV-WM adds a verifier on top of such a model: it decodes each imagined future into task predicates and checks whether the event happened — whether the object moved, the contact state changed, the placement predicate holds — instead of whether the predicted features land near a goal.

Where feature-space planning breaks

A feature-space world model — the DINO-WM line — keeps a frozen visual encoder and trains an action-conditioned dynamics model that predicts future features. To act, the planner rolls out candidate action sequences in feature space and scores them by how close the predicted features land to a goal feature, typically with the cross-entropy method (CEM): sample action sequences, keep the high-scoring elites, resample, execute the first action, replan.

In long-horizon manipulation this scoring is the weak point. A small feature distance does not mean a task predicate holds. What decides progress is relational and physical: whether an object moved, whether a drawer or contact state changed, whether a placement predicate is satisfied, and whether a predicted future is reliable enough to execute. Feature similarity measures none of these, so the planner can select an action whose predicted features look right while the task does not advance.

The verifier: from feature distance to event predicates

EV-WM adds a verification layer on the same frozen-encoder feature world model, and does not retrain the world model. Three steps:

1. Imagine. Roll out candidate futures in feature space with the frozen encoder and the action-conditioned dynamics model — no pixel reconstruction.

2. Decode to an event state. An event predictor maps each imagined future, together with the task context, to a structured event state: binary event predicates (object and relation changes), continuous distances and margins, contact and progress quantities, a task-success flag, and an uncertainty term.

3. Score and gate. A verifier combines four terms into one score,

\[ S_{\text{EV-WM}} \;=\; S_{\text{task}} \;+\; \lambda_s\,S_{\text{semantic}} \;+\; \lambda_p\,S_{\text{physical}} \;-\; \lambda_u\,U \]

task progress, semantic consistency, physical feasibility, minus uncertainty. This score guides CEM sampling, gates candidate actions through a conservative hybrid rule, and — in the contact-sensitive LIBERO wine-rack setting — selects among proposals from a residual PPO policy. The unit of judgment moves from predicted features near a goal to task predicates satisfied.

Schematic (original) of the EV-WM loop. The paper's Figure 1 gives the full framework diagram; see the paper.

Training the verifier without manual labels

The event predictor is trained on rollout windows, and its supervision comes from the simulator rather than from human annotation. Object and relation predicates, progress margins, and task success are read directly through check_success and predicate checks. Binary predicates use cross-entropy, continuous distances use regression, and success uses binary classification. For LIBERO-goal, an additional ranked verifier is trained with a pairwise ranking loss, so that demonstration and successful windows rank above Gaussian-noise, zero-action, and shuffled windows.

Does the verifier help?

The test is a controlled comparison: EV-WM (feature world model plus verifier) against DINO-WM (the same feature world model, scored by feature distance), with everything else held equal. Four settings:

Setting	What it is	DINO-WM → EV-WM
PointMaze	2D continuous navigation	random-state success 0.90 → 0.94
Wall-Single	obstacle-constrained control	0.88 → 0.92 (early-archive top-50: 0.95); mean state distance drops
Deformable	deformable-object manipulation	zero-init 0% → 94% with a nearest-latent start + conservative planner
LIBERO-goal	language-described manipulation	verifier AUC 0.99; wine-rack H=20 online top-2 hybrid 97/100

Three checks show the score tracks progress rather than appearance. A feature-weight ablation on PointMaze gives 0.84 / 0.92 / 0.94 / 0.88 for weights 1/5/10/20 — the verifier contributes, and its weight has to be tuned. A planning-score sanity check shows demonstration windows scoring above Gaussian (0.998), zero-action (0.984), and shuffled (0.866) windows. And against an Oracle@32 baseline — secretly try 32 candidates and keep the truly best — EV-WM matches it by selecting the top-2 with the verifier alone.

Schematic (original); success rates from the paper, Tables 1/3/4. Deformable's DINO-WM/zero-init bar is 0%. EV-WM uses a verifier-guided planner on the same feature world model.

Limitations

Event labels come from the simulator. On a real robot there is no such ground-truth source; the paper states it would need robust perception, a VLM labeler, or manual auditing, and it is not validated on real hardware. The LIBERO online evaluation is a short H=20 window, not full episode-level autonomous execution. The residual-PPO proposal study covers a single wine-rack task. And the added verification costs extra inference per planning step.

References

EV-WM: Event-Verified World Models for Long-Horizon Robotic Manipulation (arXiv:2606.13053).
Feature-space world-model baseline: the DINO-WM line (world models on pre-trained visual features, with planning by feature distance).
A different trust signal on the offline side: NEUBAY uses long rollouts and a Bayesian posterior rather than a predicate verifier.

EV-WM：验证事件，不验特征

2026 年 6 月 · 长程操作 · EV-WM (arXiv:2606.13053)

特征空间的世界模型能想象未来的视觉特征，但特征像不像目标，说明不了任务有没有真的推进。EV-WM 在这样的模型上加一个 verifier：它把每个想象出的未来解码成任务谓词，检查事件到底有没有发生——物体动了没、接触状态变了没、放置谓词成立没——不看预测特征落得离目标近不近。

特征空间的规划在哪里出问题

特征空间世界模型（DINO-WM 这一脉）用一个冻结的视觉编码器，再训一个动作条件的动力学模型来预测未来特征。要行动时，规划器在特征空间里 rollout 一批候选动作序列，按"预测特征落得离目标特征有多近"打分，通常用 cross-entropy method（CEM）：采一批动作序列、留高分的精英、再采样、执行第一步、重规划。

在长程操作里，这个打分方式正是薄弱点。特征距离小，并不代表任务谓词成立。真正决定进度的是关系性、物理性的判断：物体有没有移动、抽屉或接触状态有没有改变、放置谓词满足了没、这个预测未来可不可靠到能执行。特征相似度量不到这些，于是规划器会选出"预测特征看着对、任务却没推进"的动作。

这个 verifier：从特征距离换成事件谓词

EV-WM 在同一个冻结编码器的特征世界模型上加一个验证层，并且不重训世界模型。三步：

1. 想象。用冻结编码器 + 动作条件动力学，在特征空间 rollout 出候选未来，不解码像素。

2. 解码成事件状态。一个 event predictor 把每个想象未来连同任务上下文，映射成结构化的事件状态：二值事件谓词（物体与关系是否变化）、连续距离与 margin、接触与进度量、任务成功标志、一个不确定性项。

3. 打分与门控。verifier 把四项合成一个分数，

\[ S_{\text{EV-WM}} \;=\; S_{\text{task}} \;+\; \lambda_s\,S_{\text{semantic}} \;+\; \lambda_p\,S_{\text{physical}} \;-\; \lambda_u\,U \]

任务进度、语义一致、物理可行，再减去不确定性。这个分数引导 CEM 采样、用一个保守的 hybrid 规则 gate 掉不达标的候选动作；在接触敏感的 LIBERO 红酒架任务里，从一个 residual PPO 策略产生的候选里挑。判断的单位从"预测特征离目标近"换成了"任务谓词成立"。

原创示意图。论文 Figure 1 给出完整框架图，见原文。

不靠人工标注，怎么训练这个 verifier

event predictor 在 rollout 窗口上训练，监督信号来自模拟器，不靠人工标注。物体与关系谓词、进度 margin、任务成功，都直接通过 check_success 和谓词判定读出来。二值谓词用交叉熵，连续距离用回归，成功用二分类。对 LIBERO-goal，还额外训一个 ranked verifier，用 pairwise ranking 损失，让示范和成功窗口排在高斯噪声、零动作、打乱顺序的窗口之上。

这个 verifier 真的有用吗

验证方式是控制变量对照：EV-WM（特征世界模型 + verifier）对比 DINO-WM（同一个特征世界模型，但按特征距离打分），其它都一样。四类 setting：

Setting	是什么	DINO-WM → EV-WM
PointMaze	2D 连续导航	随机起点成功率 0.90 → 0.94
Wall-Single	障碍约束控制	0.88 → 0.92（早期 archive top-50：0.95）；mean state distance 下降
Deformable	可变形物体操作	零初始化 0% → 94%（检索最近 latent 当起点 + 保守规划器）
LIBERO-goal	语言描述操作	verifier 判别 AUC 0.99；红酒架 H=20 在线 top-2 hybrid 97/100

三个检查说明这个分数跟着"进度"走，不跟着"画面"走。PointMaze 上的 feature-weight 消融给出 0.84 / 0.92 / 0.94 / 0.88（权重 1/5/10/20）——verifier 确实有贡献，且权重要调对。planning-score sanity check 显示示范窗口的分数高于高斯（0.998）、零动作（0.984）、打乱顺序（0.866）的窗口。而对一个 Oracle@32 基准（偷偷试 32 个候选、留真正最好的），EV-WM 只用 verifier 选前 2 个就追平它。

原创示意图；成功率引自原文 Table 1/3/4。Deformable 的 DINO-WM/零初始化为 0%。EV-WM 是在同一个特征世界模型上用 verifier 引导规划。

局限

事件标签来自模拟器。真机上没有这样的真相源；原文写明真机需要鲁棒感知、一个 VLM 标注器或人工审计，且它没有在真实硬件上验证。LIBERO 在线评测是 H=20 的短窗口，不是完整 episode 级的自主执行。residual-PPO proposal 研究只覆盖一个红酒架任务。而且多一层验证，每步规划会多一些推理开销。

参考

EV-WM：Event-Verified World Models for Long-Horizon Robotic Manipulation（arXiv:2606.13053）。
特征空间世界模型基线：DINO-WM 一脉（在预训练视觉特征上做世界模型，按特征距离规划）。
离线侧的另一种 trust 信号：NEUBAY 用长 rollout + 贝叶斯后验给 rollout 定信任，没有谓词 verifier。