WAV: verifying actions through forward-inverse asymmetry

June 2026 · world models · robot data collection · WAV (arXiv:2604.01985)

Predicting a full future from a state and an action is hard; inferring which action could have produced an observed change is often easier. WAV names this difficulty gap the forward-inverse asymmetry, and turns it into a way for a world model to find its own errors and decide which interactions to collect next.

Why a world model must be reliable off the expert path

Policy learning mostly cares about good actions: how to grasp the cup, how to place the block. A world model is queried over a far wider space, because planners, reinforcement learning, and policy evaluation all probe suboptimal, exploratory, and random actions. Those actions are exactly what action-labeled robot data underrepresents, and that data is expensive. So the operative question is: given many candidate actions, which one, once executed, would most expose the model's prediction errors and add the most information to training?

The common answer is uncertainty or learning progress. The difficulty is that those signals come from the current world model itself, which is least able to judge its own errors precisely in the unfamiliar regions that matter most. WAV changes the direction of verification instead.

The asymmetry: verifying is easier than predicting

Write an action-conditioned world model as \(\hat s_{t+1}=f_\theta(s_t,a_t)\). Direct forward prediction is hard on three counts: the output space is large (object positions, occlusions, contacts, background), the environment is stochastic, and the low-data regions are precisely where the model is weakest. WAV decomposes the prediction into two factors that can each be verified on their own:

\[ p(s_{t+1}\mid s_t,a_t)\ \propto\ \underbrace{p(s_{t+1}\mid s_t)}_{\text{state plausibility}}\ \cdot\ \underbrace{p(a_t\mid s_t,s_{t+1})}_{\text{action reachability}} \]

State plausibility asks whether a future looks like something the environment actually produces; it carries no action label, so it can be learned from large amounts of action-free video. Action reachability asks which action could have caused a given change; it is an inverse-dynamics question that ignores most visual detail. The paper grounds the asymmetry in two reasons these are more tractable than full forward prediction: action-free data is far more available, and action-relevant features are lower-dimensional — so WAV uses a sparse inverse model that reads only a subset of state features.

state sₜ video prior sample subgoals p(s′ | s) sparse inverse infer action p(a | s, s′) forward WM predict outcome f_θ(s, a) cycle consistency → discrepancy dist(subgoal, forward rollout) execute the top action append new transition update the world model, repeat
Schematic (original) of the WAV loop; the paper's figure gives the full diagram, see the paper. The forward model sits last, so candidates stay anchored to plausible, reachable futures.

The WAV loop

From the current state, WAV samples a set of plausible subgoals from the video prior, lets the sparse inverse model infer an action for each, and rolls those actions forward with the current world model. It then enforces cycle consistency among the proposed subgoals, the inferred actions, and the forward rollouts, and executes the action with the largest discrepancy — the one the model currently fails to follow. The resulting real transition is appended to the dataset. Putting the forward model last keeps candidate futures anchored to plausible, reachable states, rather than letting an unreliable forward model drift off the real state distribution. WAV does not remove the need for environment interaction; it changes how the data budget is spent, so each executed action is the one most likely to teach the model something.

Evidence

StudyWhat it testsResult
MiniGrid robustnessIs the sparse inverse model easier to learn and more robust than forward prediction?With fewer samples, more objects, and noisy floors, the sparse inverse model infers actions more stably — supporting the asymmetry.
MiniGrid active learningDoes selecting transitions with WAV reduce world-model error faster?WAV approaches an Oracle and beats Random, Uncertainty, Progress, and a vanilla IDM; it reaches rare but informative interaction actions.
RoboMimic / ManiSkillDoes it transfer to continuous-control manipulation?On six tasks the paper reports 32-frame prediction MSE; WAV is the strongest non-Oracle method, with a larger margin in the low-data regime.
OOD adaptationCan an existing world model adapt to new visuals/objects with little target data?On RoboMimic Can shifts, ~200 target trajectories give lower prediction error and higher reward; policy improves by about 22%.

The headline across nine tasks over MiniGrid, RoboMimic, and ManiSkill: WAV reaches 2× sample efficiency and improves downstream policy performance by over 22%. The boundary worth keeping: the evidence is from simulation and benchmarks. WAV is a data-collection and world-model quality-control mechanism; it does not yet demonstrate long-horizon deployment on a real factory robot.

Limitations

WAV depends on a good subgoal generator: if the video prior proposes implausible futures, the inverse and forward checks inherit the error. The sparse inverse model assumes action effects are identifiable from a small set of features, which breaks under complex contact, multi-object chains, or several actions producing the same visual change. And it is a self-improving framework that still needs environment interaction — the gain is making each real transition more worthwhile, not removing interaction.

References

  1. WAV: World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry (arXiv:2604.01985). Project page.
  2. Benchmarks: RoboMimic, ManiSkill.

WAV:用 forward-inverse asymmetry 验证动作

2026 年 6 月 · 世界模型 · 机器人数据采集 · WAV (arXiv:2604.01985)

从状态和动作预测一个完整未来很难;反推"哪个动作可能造成这个变化"往往更容易。WAV 把这种难度差叫作 forward-inverse asymmetry,并把它做成一个让世界模型发现自己错误、决定下一批该采哪些交互的循环。

为什么世界模型必须在非专家动作上也可靠

策略学习主要关心好动作:怎么抓起杯子、怎么把方块放到目标区。世界模型被查询的范围宽得多——规划器、强化学习、策略评估都会去试探次优动作、探索动作、随机动作。而这些动作,正是带动作标签的机器人数据最缺的,且这类数据很贵。于是问题是:现在有一批候选动作,哪一个一旦真实执行,最能暴露模型的预测错误、给训练带来最大信息量?

常见答案是 uncertainty 或 learning progress。难点在于这些信号大多来自当前世界模型自己,而它恰恰在最关键的陌生区域里最难判断自己错在哪。WAV 改的是验证的方向。

不对称:验证比预测更容易

把动作条件世界模型写成 \(\hat s_{t+1}=f_\theta(s_t,a_t)\)。直接做 forward 预测难在三层:输出空间大(物体位置、遮挡、接触、背景),环境有随机性,而低数据区恰恰是模型最弱的地方。WAV 把这个预测拆成两个可以各自验证的因子:

\[ p(s_{t+1}\mid s_t,a_t)\ \propto\ \underbrace{p(s_{t+1}\mid s_t)}_{\text{state plausibility}}\ \cdot\ \underbrace{p(a_t\mid s_t,s_{t+1})}_{\text{action reachability}} \]

State plausibility 问的是:这个未来像不像环境里真会发生的未来?它不带动作标签,可以从大量 action-free video 学。Action reachability 问的是:如果真变成这个未来,哪个动作可能造成它?这是 inverse dynamics,可以忽略大部分视觉细节。论文把这种不对称落到两点上:action-free 数据可得性更高,且 action-relevant 特征维度更低——所以 WAV 用一个 sparse inverse model,只读取一部分状态特征。

当前状态 sₜ 视频先验 采样 subgoals p(s′ | s) 稀疏 inverse 反推动作 p(a | s, s′) forward WM 预测结果 f_θ(s, a) cycle consistency → 差异 dist(subgoal, forward rollout) 执行差异最大的动作 加回新 transition 更新世界模型,重复
原创示意图。论文给出完整框架图,见原文。forward model 放在最后,让候选未来锚在"合理且可达"的状态上。

WAV 的循环

从当前状态出发,WAV 先由视频先验采一组合理 subgoal,让 sparse inverse model 给每个 subgoal 反推一个动作,再用当前世界模型把这些动作 rollout 出去。然后它在"提出的 subgoal、反推的动作、forward rollout"三者之间强制 cycle consistency,执行其中差异最大的那个动作——也就是模型当前最跟不上的那个。真实执行得到的 transition 加回数据集。把 forward model 放在循环末端,能让候选未来锚在合理、可达的状态上;否则一个还不可靠的 forward model 会漂出真实状态分布。WAV 不取消环境交互;它改变的是数据预算的花法,让每次执行的都是最可能教会模型东西的那个动作。

证据

实验验证什么结论
MiniGrid 稳健性sparse inverse 是否比 forward 预测更易学、更抗复杂场景与噪声?样本少、物体多、noisy floors 增加时,sparse inverse 的动作推断更稳,支持不对称假设。
MiniGrid 主动学习用 WAV 选 transition 是否更快降低世界模型误差?WAV 接近 Oracle,优于 Random / Uncertainty / Progress / vanilla IDM,更易采到稀有但关键的交互动作。
RoboMimic / ManiSkill能否迁移到连续控制操作?六个任务报告 32-frame 预测 MSE;WAV 是最强的 non-Oracle 方法,低数据区优势更明显。
OOD 适配已有世界模型遇到新视觉/新物体能否用少量数据快速适配?RoboMimic Can 的 shift 上,约 200 条目标域轨迹即得到更低误差、更高 reward;策略约提升 22%。

跨 MiniGrid、RoboMimic、ManiSkill 九个任务的总结果:WAV 把世界模型的样本效率提高 2 倍,并让下游策略性能提升超过 22%。边界要记住:证据来自仿真与 benchmark。WAV 是一个数据采集与世界模型质检机制,还没有证明自己能支撑真实工厂里的长时程机器人部署。

局限

WAV 依赖一个足够好的 subgoal generator:如果视频先验提出的未来本身不可靠,后面的 inverse 与 forward 检验都会被带偏。sparse inverse 也有假设:动作效果要能被一小组 action-relevant 特征识别——复杂接触、多物体连锁、多个动作造成同一视觉变化时,这个假设会失效。它是自改进框架,但仍然需要环境交互——价值在于让每一次真实交互更值得;它省不掉交互本身。

参考

  1. WAV:World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry(arXiv:2604.01985)。项目页
  2. Benchmark:RoboMimicManiSkill