EV-WM: verifying events, not features
A feature-space world model can imagine future visual features, but feature similarity does not tell you whether the task actually progressed. EV-WM adds a verifier on top of such a model: it decodes each imagined future into task predicates and checks whether the event happened — whether the object moved, the contact state changed, the placement predicate holds — instead of whether the predicted features land near a goal.
Where feature-space planning breaks
A feature-space world model — the DINO-WM line — keeps a frozen visual encoder and trains an action-conditioned dynamics model that predicts future features. To act, the planner rolls out candidate action sequences in feature space and scores them by how close the predicted features land to a goal feature, typically with the cross-entropy method (CEM): sample action sequences, keep the high-scoring elites, resample, execute the first action, replan.
In long-horizon manipulation this scoring is the weak point. A small feature distance does not mean a task predicate holds. What decides progress is relational and physical: whether an object moved, whether a drawer or contact state changed, whether a placement predicate is satisfied, and whether a predicted future is reliable enough to execute. Feature similarity measures none of these, so the planner can select an action whose predicted features look right while the task does not advance.
The verifier: from feature distance to event predicates
EV-WM adds a verification layer on the same frozen-encoder feature world model, and does not retrain the world model. Three steps:
1. Imagine. Roll out candidate futures in feature space with the frozen encoder and the action-conditioned dynamics model — no pixel reconstruction.
2. Decode to an event state. An event predictor maps each imagined future, together with the task context, to a structured event state: binary event predicates (object and relation changes), continuous distances and margins, contact and progress quantities, a task-success flag, and an uncertainty term.
3. Score and gate. A verifier combines four terms into one score,
task progress, semantic consistency, physical feasibility, minus uncertainty. This score guides CEM sampling, gates candidate actions through a conservative hybrid rule, and — in the contact-sensitive LIBERO wine-rack setting — selects among proposals from a residual PPO policy. The unit of judgment moves from predicted features near a goal to task predicates satisfied.
Training the verifier without manual labels
The event predictor is trained on rollout windows, and its supervision comes from the simulator rather than from human annotation. Object and relation predicates, progress margins, and task success are read directly through check_success and predicate checks. Binary predicates use cross-entropy, continuous distances use regression, and success uses binary classification. For LIBERO-goal, an additional ranked verifier is trained with a pairwise ranking loss, so that demonstration and successful windows rank above Gaussian-noise, zero-action, and shuffled windows.
Does the verifier help?
The test is a controlled comparison: EV-WM (feature world model plus verifier) against DINO-WM (the same feature world model, scored by feature distance), with everything else held equal. Four settings:
| Setting | What it is | DINO-WM → EV-WM |
|---|---|---|
| PointMaze | 2D continuous navigation | random-state success 0.90 → 0.94 |
| Wall-Single | obstacle-constrained control | 0.88 → 0.92 (early-archive top-50: 0.95); mean state distance drops |
| Deformable | deformable-object manipulation | zero-init 0% → 94% with a nearest-latent start + conservative planner |
| LIBERO-goal | language-described manipulation | verifier AUC 0.99; wine-rack H=20 online top-2 hybrid 97/100 |
Three checks show the score tracks progress rather than appearance. A feature-weight ablation on PointMaze gives 0.84 / 0.92 / 0.94 / 0.88 for weights 1/5/10/20 — the verifier contributes, and its weight has to be tuned. A planning-score sanity check shows demonstration windows scoring above Gaussian (0.998), zero-action (0.984), and shuffled (0.866) windows. And against an Oracle@32 baseline — secretly try 32 candidates and keep the truly best — EV-WM matches it by selecting the top-2 with the verifier alone.
Limitations
Event labels come from the simulator. On a real robot there is no such ground-truth source; the paper states it would need robust perception, a VLM labeler, or manual auditing, and it is not validated on real hardware. The LIBERO online evaluation is a short H=20 window, not full episode-level autonomous execution. The residual-PPO proposal study covers a single wine-rack task. And the added verification costs extra inference per planning step.
References
- EV-WM: Event-Verified World Models for Long-Horizon Robotic Manipulation (arXiv:2606.13053).
- Feature-space world-model baseline: the DINO-WM line (world models on pre-trained visual features, with planning by feature distance).
- A different trust signal on the offline side: NEUBAY uses long rollouts and a Bayesian posterior rather than a predicate verifier.