From Rollout to Context

June 2026

A research map of the long-horizon problem across robot world models, robot policies, and language-model agents.

Long horizon is becoming a shared phrase across robotics and language-model agents, but it does not name one technical object. In robot world models, the pressure comes from repeated prediction: a model is rolled forward under candidate actions, and small transition errors accumulate into wrong imagined futures. In robot policies, the pressure comes from embodied task execution: walking, reaching, grasping, moving, and recovering must survive many control steps and many changes of scene state. In coding and web agents, the pressure comes from context and state: specifications, files, tests, tool outputs, and partial decisions must remain coherent over a long sequence of actions.

The same phrase therefore points to several failure modes. A map is useful because it separates papers that directly study long-horizon reliability from papers that supply mechanisms a long-horizon system needs.

Working distinction

World-model horizon is measured in model steps. The failure mode is rollout drift, wrong event prediction, or a planner optimizing against a false future.

Robot-policy horizon is measured in control steps and task stages. The failure mode is losing progress, failing at contacts, or completing one subtask while making later subtasks impossible.

Agent horizon is measured in human task time or tool-use trajectories. The failure mode is context loss, state drift, missing tests, or a specification that no longer constrains the work.

The first split

Three levels of relevance help keep the literature clean. Some works directly study long-horizon prediction, planning, or execution. Some works are adjacent: they provide future prediction, action chunks, data, retargeting, reward priors, or memory. Some works are background: they establish the vocabulary or the interface that newer systems build on.

Group	Representative works	What long horizon means
Robot world models	PRISM-WM, GRASP, WEAVER, EV-WM, H-WM, World-Ego Modeling	The model is repeatedly queried by a planner. Prediction error, event mismatch, or wrong state factorization changes action selection.
Robot policies and VLAs	Psi-Zero, VIRAL, EgoHumanoid, ULTRA, MEM, DexDrummer, FlowDPG	The robot must complete multi-stage embodied tasks under perception, contact, latency, and partial memory.
Enabling components	Cosmos Policy, A2A, LAPO, HumanNet, GMR, SPIDER, SMP, AnySkill	The work supplies a mechanism a long-horizon system may need: future-state prediction, action generation, human video, retargeting, or reward priors.
Language and coding agents	OpenAI Codex, Anthropic long-running agents, Kimi K2.6, Mage, Odysseys, SWE-Marathon	The agent must preserve task state across context windows, tool calls, tests, and multi-session work.

Robot world models

The central problem is whether imagined futures remain useful to a planner.

PRISM-WM, from Beijing Institute of Technology, Peking University, and BIGAI/NLCo Lab, is the most direct fit for this problem. It studies hybrid physical systems where continuous motion is interrupted by contacts and impacts. A monolithic latent transition model can smooth over boundaries such as sticking versus sliding or flight versus stance. For a planner, this creates compounding errors during long-horizon lookahead. PRISM-WM changes the transition function into a context-aware mixture of experts with latent orthogonalization, so different local dynamics can be represented separately.

GRASP approaches the same pressure from the planner side. It uses learned dynamics for gradient-based planning and adds virtual states, stochastic exploration, and gradient reshaping. The object is not only to predict a future, but to make the gradients through that future useful enough for control.

WEAVER pushes manipulation world models toward high fidelity, fast generation, and longer context. Its promise is practical: a world model should evaluate policies, improve them, and plan with them. EV-WM adds another useful constraint: visual futures should be checked against predicates and task events. A future that looks plausible can still be useless if the required event did not happen.

Works such as H-WM, tau0-WM, AR Forcing, World-Ego Modeling, Lifting Embodied World Models, and Slot-MPC are variations on the same theme. They move some part of the planning problem into a higher-level state, a video-action sequence, a stabilized autoregressive rollout, a world/ego factorization, a waypoint space, or an object-centric representation.

Work	Organization signal	Date	Main object
PRISM-WM	BIT, Peking University, BIGAI/NLCo Lab	2026-05	Compositional latent dynamics for hybrid/contact-rich systems.
GRASP	UC Berkeley / BAIR	2026-04	Gradient-based planning through learned dynamics.
WEAVER	CMU / Mila signal	2026-06	Manipulation world model for evaluation, improvement, and planning.
EV-WM	Academic arXiv line	2026-06	World model with event and predicate verification.
World-Ego Modeling	Multi-institution arXiv paper	2026-05	Separates persistent world evolution from robot-centric ego dynamics.

Robot policies and embodied memory

These papers study long embodied execution more than long model rollout.

Psi-Zero is a humanoid VLA paper. It trains from egocentric human video for task semantics, then uses real humanoid data to train a flow-based joint-space action expert. Its long-horizon problem is task execution: a humanoid has to coordinate perception, locomotion, manipulation, and action chunking across thousands of low-level control steps.

VIRAL is a visual sim-to-real robot-policy system from NVIDIA, CMU, UC Berkeley, and CUHK. It trains a privileged teacher and an RGB student in simulation, then deploys on a Unitree G1 for sustained loco-manipulation loops. Its long-horizon claim is embodied deployment under visual perception, not learned-world-model rollout reliability.

EgoHumanoid addresses the data side of the same problem. It uses robot-free egocentric human demonstrations and aligns human view/action data to a humanoid policy. The long-horizon value comes from diverse multi-stage human demonstrations in real environments: the policy can see more objects, scenes, lighting, and task sequences than a lab-only teleoperation set would provide.

MEM makes the memory problem explicit. It combines short-term visual memory with long-term language memory so a robot can keep task-relevant history over longer interactions. This is closer to the language-agent version of long horizon, but it happens inside a robot policy.

Adjacent mechanisms

Cosmos Policy is adjacent to long-horizon world modeling rather than a pure long-horizon reliability paper. It fine-tunes a video diffusion model so the same model can output actions, future states, and values. The planning interface matters for long horizon, because imagined futures and value estimates can rank candidate actions, but the paper's main object is the interface that turns a video model into a robot policy/world/value model.

A2A attacks a different bottleneck: generative action policies are often too slow for real-time control when they start from random noise. A2A starts generation from historical proprioceptive action latents. LAPO and HumanNet address data scaling from videos. GMR and SPIDER address retargeting and physics feasibility. SMP and AnySkill provide reusable motion or semantic reward priors.

These mechanisms matter because long-horizon systems fail at their interfaces. The future model may drift, the action generator may be too slow, the demonstration may be physically infeasible, the reward may be too sparse, or the policy may forget what already happened. Long horizon turns interface flaws into task failures.

Language and coding agents

The agent side gives the same phrase a different measurement. OpenAI's Codex long-horizon guidance focuses on persistent software work: repository setup, AGENTS.md, continuation, verification, and context management. Anthropic's writing on long-running agents emphasizes harnesses: progress files, tests, handoff artifacts, and external state. Kimi K2.6 is an industrial model signal around long tool-call and long coding trajectories.

Mage gives the cleanest research formulation: memory should be treated as execution-state management. A long task is not only a set of facts to retrieve. It is a branching state of constraints, decisions, progress, and errors that must be maintained and revised.

The parallel to world models is useful, but limited. A coding agent loses reliability because context, tests, and task state degrade over a long tool trajectory. A robot world model loses reliability because predicted states become inputs to later predictions. Both are long-horizon problems. The unit, failure mode, and repair mechanism differ.

What this map is for

The most relevant line for robot world models is not the broad slogan of long horizon. It is the narrower question of how a model remains useful under repeated closed-loop use: when a planner asks it to imagine futures, when a robot has to use those futures to choose actions, and when small state or event errors change the whole trajectory. PRISM-WM, GRASP, WEAVER, EV-WM, and related hierarchical or object-centric models sit closest to that question.

Robot-policy papers such as Psi-Zero, VIRAL, EgoHumanoid, and MEM are still important. They show the deployment pressure: real embodied tasks are long before the world model is perfect. They also reveal the missing interfaces — memory, data, action chunks, retargeting, reward, and closed-loop correction — that a useful robot world model will eventually have to meet.

References

从 Rollout 到 Context

June 2026

Long Horizon 问题的研究地图：World Model、Robot Policy 与 Language Agent 里的长任务问题分别是什么。

Long horizon 正在同时出现在机器人和语言模型 agent 的讨论里，但它并不指向同一个技术对象。在 robot world model 里，压力来自反复预测：模型在候选动作下向前 rollout，小的 transition error 会累积成错误的 imagined future。在 robot policy 里，压力来自具身任务执行：走路、伸手、抓取、移动、恢复平衡，要跨很多控制步和很多场景状态。在 coding agent 和 web agent 里，压力来自 context 和 state：需求、文件、测试、工具输出和中间决策要在长动作序列里保持一致。

因此，同一个词覆盖了几种不同的失败模式。先把这些问题分开，才能判断一篇论文是在直接研究 long horizon，还是在提供长任务系统所需的组件。

工作区分

World-model horizon 的单位是模型 timestep。失败模式是 rollout drift、事件预测错误，或者 planner 在错误的未来上优化动作。

Robot-policy horizon 的单位是控制步和任务阶段。失败模式是任务进度丢失、接触失败，或者完成当前子任务后让后续子任务无法执行。

Agent horizon 的单位是人类任务时长或工具调用轨迹。失败模式是 context 丢失、状态漂移、测试遗漏，或者原始需求不再约束后续工作。

第一层区分

这里有三层相关性。第一层工作直接研究长 rollout、长任务执行、闭环规划或长轨迹评测。第二层工作提供长任务系统需要的机制，例如 future prediction、action chunk、数据、retargeting、reward prior 或 memory。第三层工作提供基础语法，例如 MPC、latent dynamics、motion prior 或 action-free pretraining。

类别	代表工作	Long horizon 在这里指什么
Robot world models	PRISM-WM, GRASP, WEAVER, EV-WM, H-WM, World-Ego Modeling	模型被 planner 反复调用。预测误差、事件不一致或状态表示错误会改变动作选择。
Robot policies and VLAs	Psi-Zero, VIRAL, EgoHumanoid, ULTRA, MEM, DexDrummer, FlowDPG	机器人要在感知、接触、延迟和部分记忆条件下完成多阶段具身任务。
Enabling components	Cosmos Policy, A2A, LAPO, HumanNet, GMR, SPIDER, SMP, AnySkill	这些工作提供长任务系统可能需要的组件：future-state prediction、action generation、人类视频、retargeting 或 reward prior。
Language and coding agents	OpenAI Codex, Anthropic long-running agents, Kimi K2.6, Mage, Odysseys, SWE-Marathon	agent 要在 context window、工具调用、测试和多 session 工作中保持任务状态。

Robot world models

核心问题是 imagined future 能否继续服务 planner。

PRISM-WM 最贴这个问题。它来自北京理工大学、北京大学和 BIGAI/NLCo Lab，研究 hybrid physical systems：连续运动会被 contact 和 impact 打断。单一 latent transition model 容易把 sticking/sliding、flight/stance 这些边界平滑掉。planner 做 long-horizon lookahead 时，这种平滑会变成 compounding error。PRISM-WM 把 transition function 改成 context-aware mixture of experts，并用 latent orthogonalization 让不同局部 dynamics 分开表示。

GRASP 从 planner 侧处理同一类压力。它用 learned dynamics 做 gradient-based planning，并加入 virtual states、stochastic exploration 和 gradient reshaping。这里的对象包括预测未来，也包括让穿过未来轨迹的梯度足够可用于控制。

WEAVER 把 manipulation world model 推向更高保真度、更快生成和更长 context，用于 policy evaluation、improvement 和 planning。EV-WM 加了一个很有用的约束：视觉未来要能通过 predicate 和 task event 验证。一个 future 可以看起来合理，但如果目标事件没有发生，它对任务没有用。

H-WM、tau0-WM、AR Forcing、World-Ego Modeling、Lifting Embodied World Models 和 Slot-MPC 都在改写同一类问题。它们把长规划的一部分放进更高层状态、video-action sequence、稳定的 autoregressive rollout、world/ego factorization、waypoint space 或 object-centric representation。

工作	组织信号	时间	主要对象
PRISM-WM	BIT, Peking University, BIGAI/NLCo Lab	2026-05	面向 hybrid/contact-rich systems 的 compositional latent dynamics。
GRASP	UC Berkeley / BAIR	2026-04	穿过 learned dynamics 的 gradient-based planning。
WEAVER	CMU / Mila signal	2026-06	用于 evaluation、improvement 和 planning 的 manipulation world model。
EV-WM	Academic arXiv line	2026-06	带 event 和 predicate verification 的 world model。
World-Ego Modeling	Multi-institution arXiv paper	2026-05	把 persistent world evolution 和 robot-centric ego dynamics 分开。

Robot policies and embodied memory

这些工作主要研究长具身执行，长模型 rollout 只作为相邻背景出现。

Psi-Zero 是 humanoid VLA 论文。它用 egocentric human video 学 task semantics，再用真实 humanoid 数据训练 flow-based joint-space action expert。它的 long-horizon 问题是任务执行：humanoid 要在几千个底层控制步里协调 perception、locomotion、manipulation 和 action chunking。

VIRAL 是 NVIDIA、CMU、UC Berkeley 和 CUHK 的 visual sim-to-real robot-policy system。它在仿真里训练 privileged teacher 和 RGB student，再部署到 Unitree G1 上完成连续 loco-manipulation loop。它的 long-horizon 证据来自具身部署和视觉感知下的持续执行。

EgoHumanoid 处理同一类问题的数据侧。它用 robot-free egocentric human demonstrations，并把人的 view/action 数据对齐到 humanoid policy。它的 long-horizon 价值来自真实环境中的多阶段人类示范：模型能看到比实验室 teleoperation 更丰富的物体、场景、光照和任务顺序。

MEM 把 memory 问题显式化。它结合 short-term visual memory 和 long-term language memory，让机器人在更长交互里保留任务相关历史。它接近 language-agent 里的 long horizon，但对象是 robot policy。

Adjacent mechanisms

Cosmos Policy 更接近 long-horizon world modeling 的相邻方向。它 fine-tune video diffusion model，让同一个模型输出 action、future state 和 value。这个 planning interface 对 long horizon 重要，因为 imagined future 和 value estimate 可以给候选动作排序，但论文的主要对象是把 video model 转成 robot policy/world/value model 的接口。

A2A 处理另一个瓶颈：从随机噪声开始的 generative action policy 往往太慢，难以满足实时控制。A2A 从历史 proprioceptive action latent 开始生成。LAPO 和 HumanNet 处理视频数据规模。GMR 和 SPIDER 处理 retargeting 和物理可执行性。SMP 和 AnySkill 提供可复用的 motion 或 semantic reward prior。

这些机制重要，因为长任务系统经常在接口处失败。future model 可能漂移，action generator 可能太慢，demonstration 可能物理不可行，reward 可能太稀疏，policy 可能忘记已经发生过的事件。Long horizon 会把接口缺陷变成任务失败。

Language and coding agents

Agent 侧给这个词提供了另一套度量。OpenAI 的 Codex long-horizon guidance 关注持续软件工作：repository setup、AGENTS.md、continuation、verification 和 context management。Anthropic 关于 long-running agents 的文章强调 harness：progress files、tests、handoff artifacts 和 external state。Kimi K2.6 是工业界围绕长工具调用和长 coding trajectory 的模型信号。

Mage 给出了比较清楚的研究表述：memory 应该被看成 execution-state management。长任务需要检索事实，也需要维护一棵不断变化的状态树，里面有约束、决策、进度和错误。

这里和 world model 有相似性，也有清晰边界。Coding agent 的可靠性下降来自 context、tests 和 task state 在长工具轨迹里退化。Robot world model 的可靠性下降来自 predicted state 作为下一步 prediction 的输入。两者都属于 long-horizon 问题，但单位、失败模式和修复机制不同。

这张地图的用途

对 robot world model 最相关的问题是具体的闭环使用：当 planner 让模型想象 future，当机器人要用这些 future 选择动作，当小的 state error 或 event error 改变整条 trajectory，模型还能否继续有用。PRISM-WM、GRASP、WEAVER、EV-WM 和相关的 hierarchical/object-centric models 最接近这个问题。

Psi-Zero、VIRAL、EgoHumanoid 和 MEM 这些 robot-policy 工作同样重要。它们展示了部署压力：真实具身任务在 world model 完美之前已经很长。它们也暴露出一个有用 robot world model 最后必须面对的接口：memory、data、action chunks、retargeting、reward 和 closed-loop correction。