Long-horizon world model 的五个接口

June 2026

Long-horizon failure 可以按闭环系统里的接口来拆：planner 怎样使用模型，dynamics 在哪里漂移，event 与 memory 怎样维持任务进度，action abstraction 怎样改变任务长度，以及 world model 怎样进入评估和数据闭环。

一个 robot world model 进入闭环系统后，研究对象会从单纯的 future prediction 变成一条执行链。系统从当前 observation 得到 state，生成一批 candidate action sequences，在 world model 中 rollout，用 reward、value、event、progress 或 risk score 给候选动作排序，然后执行下一小段动作。新的 observation 到来后，系统再规划一次。

Long-horizon 的困难经常出现在这条链的接口上。planner 会选择模型里分数最高的 future；dynamics model 会在接触、遮挡、子任务切换处积累误差；policy 需要记住过去发生过的事件；动作空间的粒度会决定有效任务长度；world model 还可能被用来评估 policy、生成数据或做 test-time filtering。

A long-horizon robot system repeatedly turns observations into imagined futures, scores those futures, executes a short action chunk, and observes again. The five interfaces below describe where reliability can fail.

五个接口

Interface	Question	Representative works
Planner 使用模型	planner 选择的 trajectory 是否比普通 rollout 更容易暴露模型漏洞。	TD-MPC, GRASP, Closing the Train-Test Gap, WEAVER, Slot-MPC, Cosmos Policy
Dynamics / rollout fidelity	模型在接触、遮挡、对象位移、任务阶段变化时能预测多远。	PRISM-WM, VLWM, PointWorld, Slot-MPC, NEUBAY, Universal Horizon Models
Event / memory / progress	系统如何记住关键事件，并判断任务推进到哪一步。	EV-WM, Mem-World, KEMO, EventVLA, H-WM, LoHo-Manip, tau0-WM
Action abstraction / hierarchy	任务是否需要更高层的动作粒度，而非逐步输出底层 action。	Puppeteer, Psi-Zero, Cosmos Policy, Newt, LoHo-Manip, tau0-WM
Evaluation / data infrastructure	world model 能否评估 policy、筛选数据、生成 synthetic rollouts，或作为 benchmark 基础设施。	WEAVER, Mem-World, Cosmos 3, PointWorld, Newt, world-model surveys

这五个接口对应同一个闭环机器人系统里的五类问题：模型怎样被 planner 调用，预测误差怎样随状态变化，任务进度怎样被保存，动作粒度怎样影响有效任务长度，以及 world model 怎样进入评估和数据基础设施。

因此，long-horizon 的研究可以先从接口开始，再决定方法名。一个方向是否值得做，取决于它能否把失败定义清楚，能否设计实验暴露这个失败，能否用指标证明改动真的改变了 closed-loop outcome。

子集一：Planner 使用模型

这个子集研究的是 world model 被使用时的分布。一个模型做被动预测时可能表现稳定；放进 MPC、CEM、MPPI 或 gradient planner 后，planner 会主动搜索 high-score action sequences。被选中的轨迹可能进入训练数据覆盖很弱、模型最不稳定、但 reward 或 value 看起来很高的区域。

它想解决的问题很具体：随机 rollout 的误差小，不代表 planner 最终会选的 rollout 可靠。长任务里的风险集中在被 planner 选中的那条轨迹。

代表论文

Work	Role in this subset
TD-MPC	给出短 horizon latent MPC + terminal value 的经典接口：模型只预测一小段，剩下交给 value。
GRASP	指出长 horizon 梯度规划会因 serial rollout optimization 变难，用 lifted-state planning 改优化接口。
Closing the Train-Test Gap	把 planner-induced states 和 rollout distribution shift 明确作为 world model planning 的失败来源。
WEAVER	把 world model 用于 policy evaluation、policy improvement 和 test-time planning，暴露“模型被用起来以后”的可靠性问题。
Slot-MPC	说明视觉指标好的 world model 不一定能被 planner 用好；planner 需要控制相关表示。
Cosmos Policy	把 video world model 接进 policy、world model 和 value 体系，并用 best-of-N planning 选择动作。

实验形状

这个方向的实验不应只画平均 rollout error。更有信息量的做法是保存 planner 里的所有候选轨迹，把候选分成三组，再分别测它们在真实环境、replay future 或高保真 simulator 里的结果。

Candidate group	What to measure
random candidates	planner 优化前的普通候选，用作模型平均预测能力的对照组。
high-score rejected candidates	模型里得分高但未被最终选中的候选，用来观察 reward/value 是否已经开始高估。
selected candidate	planner 最终想执行的候选，用来测 latent error、event error、value ranking error 和真实 success gap。

可能的研究题

Planner-selected rollouts are less trustworthy than random rollouts in contact-rich long-horizon manipulation.
A planner-aware trust score can reduce model exploitation without training a bigger world model.
Gradient planning and sampling-based MPC fail in different ways; long-horizon failure should be diagnosed by planner interface, not only by model architecture.

边界

这个子集可以做 verifier，也可以做 planner logging、planner-induced OOD measurement、train-test rollout distribution correction，或者优化器接口改造。核心是证明 planner 调用模型的方式改变了失败分布。

子集二：Dynamics 与 rollout fidelity

这个子集研究的是模型在不同物理状态下能被信任多远。接触前的状态比较平滑；接触中会出现摩擦、碰撞、抓取成败和对象位移；物体被遮挡后，当前 observation 又不足以恢复完整状态。同样 rollout 10 步，难度会随状态改变。

它想解决的问题是：在什么状态、接触模式、对象配置或任务阶段下，world model 的 rollout fidelity 会突然下降。这里的 long-horizon failure 会随状态改变，无法用一个统一时间长度概括。

代表论文

Work	Role in this subset
PRISM-WM	把 hybrid / contact dynamics 的 oversmoothing 作为核心失败模式，用 mixture-of-experts dynamics 提升长 horizon latent fidelity。
VLWM	认为固定 one-step latent prediction 容易积累误差，用 variable-length latent world model 改预测长度接口。
PointWorld	用 3D point-flow 预测物体和场景的几何变化，把 rollout fidelity 从像素转到几何运动。
Slot-MPC	用 object slots 表达状态，让 planner 直接关心物体状态，避开整张图像预测带来的无关负担。
NEUBAY	从 adaptive horizon 和 Bayesian evaluation 的角度讨论什么时候该信多远。
Universal Horizon Models	尝试把不同 horizon 的预测统一建模，覆盖固定长度预测器之外的时间尺度。

实验形状

State group	What it exposes	Useful metrics
free-space motion	平滑动力学下的基线误差。	latent error, reward/value error
pre-contact	动作选择是否即将进入高风险状态。	uncertainty, local sensitivity
contact / grasp / push	hybrid dynamics、摩擦和对象位移。	object pose error, point-flow error
occluded / out-of-view	当前 observation 缺失历史证据。	event consistency, memory retrieval accuracy

指标需要越过像素。更接近控制的指标包括 latent error、object pose error、point-flow error、event error、value ranking error 和 closed-loop success。最有用的图往往是按状态类别分组的 trust-horizon 曲线。

可能的研究题

State-dependent rollout fidelity: learned robot world models have different trust horizons across contact regimes.
Contact-aware world model evaluation predicts closed-loop planning failure better than average rollout error.
Object and point representations help planning because they preserve decision-relevant state, not because they generate prettier futures.

边界

这个子集可以先做纯诊断，不必一开始就训练完整 policy。它适合从 TD-MPC / PRISM-WM 这类复现体系进入，因为这些方法本来就把 latent dynamics、rollout horizon 和 planning success 放在同一个实验框架里。

子集三：Event、memory 与 progress

长任务由一串事件组成。杯子是否已经被拿起，抽屉是否打开过，目标物体被遮挡前在哪里，子任务是否已经完成，这些信息经常无法从当前图像直接读出。

它想解决的问题是：world model 不能只预测下一帧像不像，还要回答关键事件有没有发生、哪些历史证据必须保存、当前任务推进到哪一步。如果视觉未来看起来合理，但事件顺序错了，系统仍然会失败。

代表论文

Work	Role in this subset
EV-WM	把 event / predicate verification 接进 planning，让未来不只被视觉质量评分，也被事件进度评分。
Mem-World	处理 wrist-view occlusion 和 camera motion，用 geometry-indexed memory 保持长任务一致性。
KEMO	用 event-driven keyframe memory 处理 memory-dependent long-horizon manipulation。
EventVLA	把 transient event evidence 作为长任务记忆对象，补足只保存视觉 anchor 的方案。
H-WM	用 hierarchical logical world model 和 visual guidance，把任务进度拆成显式逻辑状态。
LoHo-Manip	用 task manager 预测剩余 subtasks 和 visual traces，服务 receding-horizon execution。
tau0-WM	把 action generation、video prediction 和 progress scoring 放进同一个系统，并用 test-time compute 提升多阶段任务。

实验形状

这一类实验需要 memory-dependent tasks。任务可以超出单步 pick-and-place：让关键物体短暂出现后被遮挡，要求系统在多个子任务之后仍能恢复对象身份、位置或事件状态。

Task design	Failure exposed	Useful metrics
物体短暂出现后被遮挡	object persistence 和 memory retrieval。	object identity accuracy, pose error
多个中间状态必须保留	event memory 是否保存了已发生事实。	event accuracy, stage completion
子任务顺序必须正确	progress state 是否能区分任务阶段。	progress score, subtask success
当前画面无法区分任务阶段	模型是否真的使用历史证据。	memory ablation drop, final success

可能的研究题

Event memory is a smaller and more useful interface than full video memory for long-horizon robot manipulation.
Progress-state prediction can diagnose long-horizon failure earlier than pixel or latent error.
Keyframe selection should be event-driven, not frame-uniform, in memory-dependent manipulation.

边界

这个子集更偏 VLA、manipulation 和 real robot。它不一定需要 MPC，也可以作为 policy wrapper、memory module、task manager 或 progress verifier。它的难点是任务、数据、标注和真实平台更重。

子集四：Action abstraction 与 hierarchy

Long-horizon 也可能来自动作粒度。每一步都输出底层关节 action，会把任务拉成很长的序列。更高层的 action chunk、subgoal、waypoint 或 hierarchy 可以缩短有效决策长度，但会把问题转移到低层控制器和子目标可达性上。

它想解决的问题是：长任务的困难可能不在 world model 大小，而在 action interface。底层动作空间太细，planner 要想很久；高层动作空间太粗，又可能丢掉接触细节。

代表论文

Work	Role in this subset
Puppeteer	用低层 motion prior 和高层 visual puppeteering agent，让 humanoid 做视觉全身控制且动作自然。
Psi-Zero	通过 human video pretraining 和 humanoid action expert post-training，把 humanoid loco-manipulation 变成可执行任务。
Cosmos Policy	把视频基础模型 fine-tune 成 policy、world model 和 value model，用 latent frame injection 支持动作生成和规划。
Newt	展示 TD-MPC2-style agent 在 200 个连续控制任务上的 multitask scaling、language、demos 和 open-loop horizon。
LoHo-Manip	用 task manager 负责长任务分解和中间 visual trace，底层 policy 执行局部动作。
tau0-WM	用 progress scoring 和 test-time compute 选择更好的多阶段动作。

实验形状

这个接口的实验应该比较不同 action interface，模型大小只作为对照变量。

Action interface	What it tests
low-level action	逐步动作最精细，但有效 horizon 最长。
action chunk	一次输出一段动作，减少决策次数，但需要处理连续性和纠错。
subgoal / waypoint	高层规划更短，但要求低层控制器能可靠到达。
event-conditioned action	用任务事件指导动作段，测试动作是否跟任务进度对齐。
hierarchy	高层选目标，低层保证身体自然和可执行。

指标除了 task success，还应包括 subtask success、action continuity、motion naturalness、human preference、latency，以及失败发生在哪个任务阶段。

可能的研究题

Long-horizon failure can be reduced by changing action granularity before changing model scale.
Hierarchical action interfaces are useful when low-level rollout length, rather than perception, is the bottleneck.
Naturalness and success can diverge; humanoid long-horizon work needs both task metrics and motion-quality metrics.

边界

这个子集和 humanoid、whole-body control、VLA 很贴近，但离 TD-MPC / PRISM-WM 的 model-based RL 核心稍远。它适合作为真实机器人和工业 Physical AI 的方向，但不一定是最小可做学术题。

子集五：评估与数据基础设施

World model 的用途已经超出控制本身。它可以评估 policy、生成 synthetic data、筛选 test-time candidates，或者构造专门暴露 long-horizon failure 的 benchmark。

它想解决的问题是：如果 world model 暂时不足以直接闭环控制，它仍然能不能作为基础设施带来价值。比如离线评估哪个 policy 更可能成功，生成额外训练数据，在 test-time 筛掉危险动作，或把 benchmark 从单任务成功率扩展到 long-horizon progress、event、memory 和 geometry。

代表论文

Work	Role in this subset
WEAVER	明确展示 world model 可以做 policy evaluation、policy improvement 和 test-time planning。
Mem-World	用 memory-based world model 做 policy evaluation 和 synthetic data improvement。
Cosmos 3	工业级 Physical AI backbone，把 robotics action generation 放进 omnimodal foundation model。
Cosmos Policy	从视频基础模型到 policy、world model 和 value，展示工业路线里的可部署接口。
Newt	给出 multitask benchmark 和 scaling 叙事，强调 benchmark、data 和 generalization。
PointWorld	用大规模 3D point-flow world model 做控制相关预测和 zero-shot MPC。
World-model surveys	强调 world model 的评价不应只看视频质量，还要看控制、规划、泛化和可靠性。

实验形状

这个方向不应只做模型指标。至少要把 world-model metric 和真实 robot 或 policy outcome 连起来。

Experiment	What it proves
policy evaluation correlation	world model 的预测成功率是否和真实成功率相关。
synthetic data improvement	world model 生成数据后，policy 是否真的更好。
test-time candidate filtering	world model 能否筛掉会失败的动作。
benchmark stress test	新 benchmark 是否能暴露旧 benchmark 看不到的 failure。

可能的研究题

World model evaluation should be measured by policy-ranking correlation, not only video fidelity.
Synthetic rollouts are useful only when they preserve task events and contact-relevant state.
A long-horizon benchmark should expose memory, event progress, planner exploitation, and state-dependent drift separately.

边界

这个子集最贴近工业界，但也最容易变大。要做学术产出，需要把问题收窄成一个可证明的 claim，比如 policy ranking correlation、synthetic data 在某类 failure 上有效，或 benchmark 能区分两类旧方法。

五个子集之间的关系

这张图可以按机器人执行链来讲。系统要先记住过去，再想象未来，接着决定动作粒度，再用模型选动作，最后用评估和数据改进下一轮系统。

Execution position	Research subset
记住过去	Event / memory / progress
想象未来	Dynamics / rollout fidelity
决定动作粒度	Action abstraction / hierarchy
用模型选动作	Planner 使用模型的接口
评估和改进系统	World model 基础设施

从哪里开始做

第一优先级是 dynamics / rollout fidelity。它最贴近 TD-MPC 和 PRISM-WM，也最容易做成清楚实验。这个方向能回答“长任务什么时候开始漂”，不需要一开始就搭完整机器人系统。

第二优先级是 planner 使用模型的接口。它比单纯 prediction 更接近 model-based control 的核心。关键实验是比较 selected candidate 和普通 rollout，避免只画泛泛的误差曲线。

第三优先级是 event / memory / progress。它更贴近真实 long-horizon manipulation 和工业 VLA，但需要任务、数据、标注和机器人平台支撑。若已有 real robot 或 imitation / VLA 资源，这个方向会很有价值。

Verifier-MPC 在这张图里的位置

Verifier-MPC 位于 planner interface、dynamics fidelity 和 event/progress interface 的交汇处。它适合作为第一组实验，因为 MPC 会显式产生候选轨迹、rollout horizon、score 和 replanning 决策。

更大的研究对象是 closed-loop agents 如何可信地使用 imagined future。Verifier 可以输出 trust horizon、event score、failure risk、reachability、coverage 或 local sensitivity；MPC 只是第一个使用这些信号的控制接口。

research direction

一个更完整的问题是：同一个 learned robot world model 上，pixel error、latent error、uncertainty、coverage、event/progress、persistent-state inconsistency、local sensitivity，哪一个信号最早预示真实任务失败；这个信号接入 planner 或 failure monitor 后，是否改变 closed-loop success、replanning cost 或安全性。

Long-horizon world model 的五个接口

五个接口

子集一：Planner 使用模型

代表论文

实验形状

可能的研究题

边界

子集二：Dynamics 与 rollout fidelity

代表论文

实验形状

可能的研究题

边界

子集三：Event、memory 与 progress

代表论文

实验形状

可能的研究题

边界

子集四：Action abstraction 与 hierarchy

代表论文

实验形状

可能的研究题

边界

子集五：评估与数据基础设施

代表论文

实验形状

可能的研究题

边界

五个子集之间的关系

从哪里开始做

Verifier-MPC 在这张图里的位置

References