Long Horizon Is Not One Problem
Coding agents, robot world models, and contact-rich control all complain about the long horizon, yet their solutions almost never cite each other. The reason is structural: long horizon decomposes into six different failures, and recent work answers them with five recurring moves. One of the six has no move at all.
The phrase "long horizon" now appears in three fields that do not read each other's papers. A coding agent is long-horizon when it must keep a software task coherent across many sessions. A robot world model is long-horizon when it must imagine many steps of future dynamics before acting. A locomotion controller is long-horizon when reward arrives only after thousands of control steps. The vocabulary does not transfer, and neither do the methods.
A label shared by three communities with no shared solution is usually not one problem. It is a surface name for several different failures that happen to share a cause. The useful first step is to separate the cause from the failures.
The shared structure
The cause is the same everywhere: a task runs longer than the span the system can hold reliably in one piece. What that span is differs by field, and naming it literally is more useful than calling all of it "long horizon."
In a world model, the span is the number of timesteps a learned transition function stays accurate. Each imagined step takes the previous prediction as input, so per-step error compounds. In an agent, the span is the amount of task state a context window can hold. Each new step pushes earlier state toward the edge of the window, where it is dropped or summarized away. In control, the span is the number of steps a reward signal can directly cover. Each additional step spreads credit assignment thinner over a longer trajectory.
So the unit being measured is different in each field — model timesteps, context tokens, reward steps — but the structure is identical: something accurate, retained, or assignable degrades per step. Split that single structure apart and six distinct failures fall out, each with a different thing that breaks.
What actually breaks
These six are coupled but not interchangeable. A method that fixes one can leave the others untouched, which is exactly why papers attacking different ones do not cite each other.
1. Rollout drift
A learned transition function feeds its own prediction back as the next input, so a small one-step error becomes a structured multi-step error in latents, rewards, contacts, or object poses. The error is largest where dynamics are discontinuous: a foot touching the ground and a foot in flight follow different transition rules, and a single smoothed model averages them. This is the failure TD-MPC avoids by refusing to roll out far — it plans a short horizon and bootstraps the rest with a value function — and the failure PRISM-WM attacks directly by splitting the transition function across contact regimes.
2. Planner exploitation
A world model trained for passive next-step prediction is used at test time as an optimization target. A planner searches for the action sequence that scores highest inside the model, and that search lands on states the model never saw in training, where its error is unconstrained. The model can look good under prediction and still be bad for control, because the planner actively seeks out its blind spots. This is a train-test gap specific to model-based planning, not a generic accuracy problem.
3. Trust horizon
Even a good model never reports how far the current rollout can be trusted. The missing quantity is a number of steps that depends on the present state, task phase, and contact regime — the longest rollout worth using, not the longest the model can physically produce. "Errors accumulate" is a statement about the model; the trust horizon is a statement the model should make about itself, and almost no system makes it.
4. Plausible is not correct
A predicted future can be visually and physically plausible while the actual task event did not happen: the gripper looks closed but the object was not grasped, the trajectory looks smooth but a precondition was violated. The unit of progress is wrong. The model is scored on whether the future looks right rather than on whether the task predicate became true.
5. Temporal granularity
Planning at every motor step or every token makes a long task intractable, because the number of decisions grows with the horizon. A flat policy mapping pixels to torques over thousands of steps, or a flat agent reasoning token by token over a multi-hour task, is solving a problem whose decision count is too large by construction.
6. State persistence
Over a long execution the system loses object identity, location, occluded facts, which subtask is done, and which constraints are still satisfied. The same current observation can require different actions depending on an earlier event that has since left the frame. This is non-Markovian failure: the present observation is no longer a sufficient state.
Five moves that recur
Across coding agents, world models, and control, the responses collapse into five families. Each one shortens, externalizes, checks, stabilizes, or re-trains — and each maps onto specific failures above rather than onto "long horizon" in general.
A. Shorten the horizon
Do not cross the whole span at once. Plan a short rollout and bootstrap the tail with a value function (TD-MPC); stack a slow high-level planner over a fast low-level controller so each level sees a shorter horizon (hierarchical world models such as Puppeteer; the classic options framework); plan in a coarser action space of waypoints, subgoals, or logical predicates (LLM task decomposition grounded by affordances, as in SayCan); predict an action chunk instead of one step. This addresses rollout drift and temporal granularity. It can fail when the high level is assumed reliable: if a subgoal is wrong, a perfectly executed low level still misses the task.
B. Externalize state
When the context cannot hold the span, move it outside — and the hard part is retrieval, not capacity. Progress files, structured handoffs, and execution-state memory let an agent continue across sessions that start with no memory (Anthropic's harness work); episodic, keyframe, and geometry-indexed memory let a robot policy fetch the one past frame that the current decision depends on. This addresses state persistence. The longer-context reflex is the weaker version of it; the stronger version retrieves the right evidence at the right time rather than carrying everything forward.
C. Verify, don't trust
Since plausible is not correct, check whether the task event actually occurred instead of whether the future looks right. In coding agents this is the test oracle and executable verification; in robot world models it is predicate or event prediction layered on top of visual features, and using model rollouts as a policy evaluator scored against real success rather than visual quality. This addresses planner exploitation and the plausible-is-not-correct failure. It can fail when the verifier itself is wrong or gameable, which moves the problem rather than removing it.
D. Stabilize the rollout
Attack the error source directly. Split dynamics by contact regime, expert, or object so the model stops smoothing across incompatible modes (MoE / compositional dynamics such as PRISM-WM; object-centric latents built on slot representations); keep the latent control-relevant rather than reconstruction-oriented (TD-MPC, Dreamer-style models); penalize epistemic uncertainty so the planner avoids high-error regions (ensemble disagreement, as in PETS-style model-based RL). This addresses rollout drift and, by closing the train-test gap, planner exploitation. The split granularity — how many experts, how many regimes — is usually set by hand.
E. Fix signal and data
Long-horizon failure is often credit assignment, not model capacity. Shape dense, stage-aware progress rewards for sparse tasks; pretrain on human and egocentric video, recover latent actions from action-free video, and use feasible retargeting and sim-to-real so the policy sees enough of the long-task distribution. This addresses the training side of the plausible-is-not-correct and temporal-granularity failures. Dense reward is itself gameable, which feeds straight back into planner exploitation.
| Failure mode | A Shorten |
B Externalize |
C Verify |
D Stabilize |
E Signal & data |
|---|---|---|---|---|---|
| 1Rollout drift | ● | ● | ○ | ||
| 2Planner exploitation | ● | ● | |||
| 3Trust horizon | |||||
| 4Plausible ≠ correct | ● | ○ | |||
| 5Temporal granularity | ● | ○ | |||
| 6State persistence | ● | ○ |
Two worlds, the same five moves
The five families are domain-independent. A coding agent and a robot world model look unrelated, but when a failure appears in both, the move that answers it is the same — only the implementation changes. The clearest way to see this is to put the two worlds in the same table.
| Move | Coding / LLM agent | Robot world model |
|---|---|---|
| A · Shorten | Spec decomposition into features; subagents with branch-and-merge | Hierarchical world model; short-horizon MPC with terminal value |
| B · Externalize | Progress file, compaction of history, structured handoff | Episodic and geometry-indexed memory; keyframe retrieval |
| C · Verify | Test oracle, executable checks, held-out tests | Predicate / event verification of imagined futures |
| D · Stabilize | Context curation that prunes stale, drift-inducing content | Regime-split and object-centric dynamics |
| E · Signal & data | Synthesizing open-ended tasks that elicit long-horizon behavior | Dense progress reward; human-video and latent-action pretraining |
This is the practical payoff of the decomposition. There is no need to track a hundred long-horizon papers as a hundred separate ideas. Each one is one of five moves applied to one of six failures in one domain. The map is small.
The one move that is missing: trust horizon
Return to the matrix. Five rows are covered by at least one family. Row three is empty, and the emptiness is not an accident of how the table was drawn.
Moves A through E all improve the quality of something: the model, the planner, the memory, the data. A better model drifts more slowly; a planner that trains on its own induced states is harder to exploit; better retrieval keeps more state; better reward shaping assigns credit further. None of them produces a calibration. They make the rollout more trustworthy without telling the planner how far to trust it — and a more accurate model with no trust estimate is still queried blindly.
That gap matters because closed-loop planning has to use the rollout. Without an estimate of how far the rollout remains valid, the planning horizon becomes a fixed hyperparameter: too short on easy states, where the model could safely look further, and too long on brittle states such as contact and mode switches, where the model has already drifted. The horizon should be a function of the state, and right now it is a constant.
One route to it runs through move D. Once dynamics are split by regime or expert — the direction PRISM-WM takes for contact-rich control — each regime carries its own instrumentation: its own prediction consistency, its own uncertainty, its own evidence of when it stops applying. That structure is what a state-dependent trust horizon needs. The intervention is small and testable: add a horizon estimate on top of an existing world model and gate the planner with it, without retraining the base dynamics. It is the one place on this map where the move does not yet exist, which makes it the most concrete thing to build next.
The six failures and five moves already account for most of the recent work. What no family yet does is let the model report its own trustable number of steps. That is the open problem hiding inside the phrase "long horizon."
Sources
- Temporal Difference Learning for Model Predictive Control, Hansen, Wang, and Su, ICML 2022 — short-horizon latent rollout plus terminal value.
- Effective harnesses for long-running agents and Harness design for long-running application development, Anthropic Engineering — long-horizon agent failure modes and artifact handoff.
- Between MDPs and semi-MDPs: A framework for temporal abstraction, Sutton, Precup, and Singh, Artificial Intelligence 1999 — the options framework for temporal abstraction.
- Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero), Schrittwieser et al., Nature 2020 — a world model organized around a planning interface.