From Rollout to Context

June 2026

A research map of the long-horizon problem across robot world models, robot policies, and language-model agents.

Long horizon is becoming a shared phrase across robotics and language-model agents, but it does not name one technical object. In robot world models, the pressure comes from repeated prediction: a model is rolled forward under candidate actions, and small transition errors accumulate into wrong imagined futures. In robot policies, the pressure comes from embodied task execution: walking, reaching, grasping, moving, and recovering must survive many control steps and many changes of scene state. In coding and web agents, the pressure comes from context and state: specifications, files, tests, tool outputs, and partial decisions must remain coherent over a long sequence of actions.

The same phrase therefore points to several failure modes. A map is useful because it separates papers that directly study long-horizon reliability from papers that supply mechanisms a long-horizon system needs.

Working distinction

World-model horizon is measured in model steps. The failure mode is rollout drift, wrong event prediction, or a planner optimizing against a false future.

Robot-policy horizon is measured in control steps and task stages. The failure mode is losing progress, failing at contacts, or completing one subtask while making later subtasks impossible.

Agent horizon is measured in human task time or tool-use trajectories. The failure mode is context loss, state drift, missing tests, or a specification that no longer constrains the work.

The first split

Three levels of relevance help keep the literature clean. Some works directly study long-horizon prediction, planning, or execution. Some works are adjacent: they provide future prediction, action chunks, data, retargeting, reward priors, or memory. Some works are background: they establish the vocabulary or the interface that newer systems build on.

Group Representative works What long horizon means
Robot world models PRISM-WM, GRASP, WEAVER, EV-WM, H-WM, World-Ego Modeling The model is repeatedly queried by a planner. Prediction error, event mismatch, or wrong state factorization changes action selection.
Robot policies and VLAs Psi-Zero, VIRAL, EgoHumanoid, ULTRA, MEM, DexDrummer, FlowDPG The robot must complete multi-stage embodied tasks under perception, contact, latency, and partial memory.
Enabling components Cosmos Policy, A2A, LAPO, HumanNet, GMR, SPIDER, SMP, AnySkill The work supplies a mechanism a long-horizon system may need: future-state prediction, action generation, human video, retargeting, or reward priors.
Language and coding agents OpenAI Codex, Anthropic long-running agents, Kimi K2.6, Mage, Odysseys, SWE-Marathon The agent must preserve task state across context windows, tool calls, tests, and multi-session work.

Robot world models

The central problem is whether imagined futures remain useful to a planner.

PRISM-WM, from Beijing Institute of Technology, Peking University, and BIGAI/NLCo Lab, is the most direct fit for this problem. It studies hybrid physical systems where continuous motion is interrupted by contacts and impacts. A monolithic latent transition model can smooth over boundaries such as sticking versus sliding or flight versus stance. For a planner, this creates compounding errors during long-horizon lookahead. PRISM-WM changes the transition function into a context-aware mixture of experts with latent orthogonalization, so different local dynamics can be represented separately.

GRASP approaches the same pressure from the planner side. It uses learned dynamics for gradient-based planning and adds virtual states, stochastic exploration, and gradient reshaping. The object is not only to predict a future, but to make the gradients through that future useful enough for control.

WEAVER pushes manipulation world models toward high fidelity, fast generation, and longer context. Its promise is practical: a world model should evaluate policies, improve them, and plan with them. EV-WM adds another useful constraint: visual futures should be checked against predicates and task events. A future that looks plausible can still be useless if the required event did not happen.

Works such as H-WM, tau0-WM, AR Forcing, World-Ego Modeling, Lifting Embodied World Models, and Slot-MPC are variations on the same theme. They move some part of the planning problem into a higher-level state, a video-action sequence, a stabilized autoregressive rollout, a world/ego factorization, a waypoint space, or an object-centric representation.

Work Organization signal Date Main object
PRISM-WM BIT, Peking University, BIGAI/NLCo Lab 2026-05 Compositional latent dynamics for hybrid/contact-rich systems.
GRASP UC Berkeley / BAIR 2026-04 Gradient-based planning through learned dynamics.
WEAVER CMU / Mila signal 2026-06 Manipulation world model for evaluation, improvement, and planning.
EV-WM Academic arXiv line 2026-06 World model with event and predicate verification.
World-Ego Modeling Multi-institution arXiv paper 2026-05 Separates persistent world evolution from robot-centric ego dynamics.

Robot policies and embodied memory

These papers study long embodied execution more than long model rollout.

Psi-Zero is a humanoid VLA paper. It trains from egocentric human video for task semantics, then uses real humanoid data to train a flow-based joint-space action expert. Its long-horizon problem is task execution: a humanoid has to coordinate perception, locomotion, manipulation, and action chunking across thousands of low-level control steps.

VIRAL is a visual sim-to-real robot-policy system from NVIDIA, CMU, UC Berkeley, and CUHK. It trains a privileged teacher and an RGB student in simulation, then deploys on a Unitree G1 for sustained loco-manipulation loops. Its long-horizon claim is embodied deployment under visual perception, not learned-world-model rollout reliability.

EgoHumanoid addresses the data side of the same problem. It uses robot-free egocentric human demonstrations and aligns human view/action data to a humanoid policy. The long-horizon value comes from diverse multi-stage human demonstrations in real environments: the policy can see more objects, scenes, lighting, and task sequences than a lab-only teleoperation set would provide.

MEM makes the memory problem explicit. It combines short-term visual memory with long-term language memory so a robot can keep task-relevant history over longer interactions. This is closer to the language-agent version of long horizon, but it happens inside a robot policy.

Adjacent mechanisms

Cosmos Policy is adjacent to long-horizon world modeling rather than a pure long-horizon reliability paper. It fine-tunes a video diffusion model so the same model can output actions, future states, and values. The planning interface matters for long horizon, because imagined futures and value estimates can rank candidate actions, but the paper's main object is the interface that turns a video model into a robot policy/world/value model.

A2A attacks a different bottleneck: generative action policies are often too slow for real-time control when they start from random noise. A2A starts generation from historical proprioceptive action latents. LAPO and HumanNet address data scaling from videos. GMR and SPIDER address retargeting and physics feasibility. SMP and AnySkill provide reusable motion or semantic reward priors.

These mechanisms matter because long-horizon systems fail at their interfaces. The future model may drift, the action generator may be too slow, the demonstration may be physically infeasible, the reward may be too sparse, or the policy may forget what already happened. Long horizon turns interface flaws into task failures.

Language and coding agents

The agent side gives the same phrase a different measurement. OpenAI's Codex long-horizon guidance focuses on persistent software work: repository setup, AGENTS.md, continuation, verification, and context management. Anthropic's writing on long-running agents emphasizes harnesses: progress files, tests, handoff artifacts, and external state. Kimi K2.6 is an industrial model signal around long tool-call and long coding trajectories.

Mage gives the cleanest research formulation: memory should be treated as execution-state management. A long task is not only a set of facts to retrieve. It is a branching state of constraints, decisions, progress, and errors that must be maintained and revised.

The parallel to world models is useful, but limited. A coding agent loses reliability because context, tests, and task state degrade over a long tool trajectory. A robot world model loses reliability because predicted states become inputs to later predictions. Both are long-horizon problems. The unit, failure mode, and repair mechanism differ.

What this map is for

The most relevant line for robot world models is not the broad slogan of long horizon. It is the narrower question of how a model remains useful under repeated closed-loop use: when a planner asks it to imagine futures, when a robot has to use those futures to choose actions, and when small state or event errors change the whole trajectory. PRISM-WM, GRASP, WEAVER, EV-WM, and related hierarchical or object-centric models sit closest to that question.

Robot-policy papers such as Psi-Zero, VIRAL, EgoHumanoid, and MEM are still important. They show the deployment pressure: real embodied tasks are long before the world model is perfect. They also reveal the missing interfaces — memory, data, action chunks, retargeting, reward, and closed-loop correction — that a useful robot world model will eventually have to meet.

References

  1. Prismatic World Model
  2. GRASP, UC Berkeley BAIR
  3. WEAVER
  4. EV-WM
  5. Cosmos Policy
  6. Psi-Zero
  7. VIRAL
  8. EgoHumanoid
  9. MEM: Multi-Scale Embodied Memory
  10. OpenAI Codex long-horizon tasks
  11. Anthropic, Effective harnesses for long-running agents
  12. Kimi K2.6
  13. Mage: Memory as Execution State Management