Long Horizon Is Not One Problem

June 2026

Coding agents, robot world models, and contact-rich control all complain about the long horizon, yet their solutions almost never cite each other. The reason is structural: long horizon decomposes into six different failures, and recent work answers them with five recurring moves. One of the six has no move at all.

The phrase "long horizon" now appears in three fields that do not read each other's papers. A coding agent is long-horizon when it must keep a software task coherent across many sessions. A robot world model is long-horizon when it must imagine many steps of future dynamics before acting. A locomotion controller is long-horizon when reward arrives only after thousands of control steps. The vocabulary does not transfer, and neither do the methods.

A label shared by three communities with no shared solution is usually not one problem. It is a surface name for several different failures that happen to share a cause. The useful first step is to separate the cause from the failures.

The shared structure

The cause is the same everywhere: a task runs longer than the span the system can hold reliably in one piece. What that span is differs by field, and naming it literally is more useful than calling all of it "long horizon."

In a world model, the span is the number of timesteps a learned transition function stays accurate. Each imagined step takes the previous prediction as input, so per-step error compounds. In an agent, the span is the amount of task state a context window can hold. Each new step pushes earlier state toward the edge of the window, where it is dropped or summarized away. In control, the span is the number of steps a reward signal can directly cover. Each additional step spreads credit assignment thinner over a longer trajectory.

So the unit being measured is different in each field — model timesteps, context tokens, reward steps — but the structure is identical: something accurate, retained, or assignable degrades per step. Split that single structure apart and six distinct failures fall out, each with a different thing that breaks.

What actually breaks

These six are coupled but not interchangeable. A method that fixes one can leave the others untouched, which is exactly why papers attacking different ones do not cite each other.

1. Rollout drift

A learned transition function feeds its own prediction back as the next input, so a small one-step error becomes a structured multi-step error in latents, rewards, contacts, or object poses. The error is largest where dynamics are discontinuous: a foot touching the ground and a foot in flight follow different transition rules, and a single smoothed model averages them. This is the failure TD-MPC avoids by refusing to roll out far — it plans a short horizon and bootstraps the rest with a value function — and the failure PRISM-WM attacks directly by splitting the transition function across contact regimes.

2. Planner exploitation

A world model trained for passive next-step prediction is used at test time as an optimization target. A planner searches for the action sequence that scores highest inside the model, and that search lands on states the model never saw in training, where its error is unconstrained. The model can look good under prediction and still be bad for control, because the planner actively seeks out its blind spots. This is a train-test gap specific to model-based planning, not a generic accuracy problem.

3. Trust horizon

Even a good model never reports how far the current rollout can be trusted. The missing quantity is a number of steps that depends on the present state, task phase, and contact regime — the longest rollout worth using, not the longest the model can physically produce. "Errors accumulate" is a statement about the model; the trust horizon is a statement the model should make about itself, and almost no system makes it.

4. Plausible is not correct

A predicted future can be visually and physically plausible while the actual task event did not happen: the gripper looks closed but the object was not grasped, the trajectory looks smooth but a precondition was violated. The unit of progress is wrong. The model is scored on whether the future looks right rather than on whether the task predicate became true.

5. Temporal granularity

Planning at every motor step or every token makes a long task intractable, because the number of decisions grows with the horizon. A flat policy mapping pixels to torques over thousands of steps, or a flat agent reasoning token by token over a multi-hour task, is solving a problem whose decision count is too large by construction.

6. State persistence

Over a long execution the system loses object identity, location, occluded facts, which subtask is done, and which constraints are still satisfied. The same current observation can require different actions depending on an earlier event that has since left the frame. This is non-Markovian failure: the present observation is no longer a sufficient state.

Five moves that recur

Across coding agents, world models, and control, the responses collapse into five families. Each one shortens, externalizes, checks, stabilizes, or re-trains — and each maps onto specific failures above rather than onto "long horizon" in general.

A. Shorten the horizon

Do not cross the whole span at once. Plan a short rollout and bootstrap the tail with a value function (TD-MPC); stack a slow high-level planner over a fast low-level controller so each level sees a shorter horizon (hierarchical world models such as Puppeteer; the classic options framework); plan in a coarser action space of waypoints, subgoals, or logical predicates (LLM task decomposition grounded by affordances, as in SayCan); predict an action chunk instead of one step. This addresses rollout drift and temporal granularity. It can fail when the high level is assumed reliable: if a subgoal is wrong, a perfectly executed low level still misses the task.

B. Externalize state

When the context cannot hold the span, move it outside — and the hard part is retrieval, not capacity. Progress files, structured handoffs, and execution-state memory let an agent continue across sessions that start with no memory (Anthropic's harness work); episodic, keyframe, and geometry-indexed memory let a robot policy fetch the one past frame that the current decision depends on. This addresses state persistence. The longer-context reflex is the weaker version of it; the stronger version retrieves the right evidence at the right time rather than carrying everything forward.

C. Verify, don't trust

Since plausible is not correct, check whether the task event actually occurred instead of whether the future looks right. In coding agents this is the test oracle and executable verification; in robot world models it is predicate or event prediction layered on top of visual features, and using model rollouts as a policy evaluator scored against real success rather than visual quality. This addresses planner exploitation and the plausible-is-not-correct failure. It can fail when the verifier itself is wrong or gameable, which moves the problem rather than removing it.

D. Stabilize the rollout

Attack the error source directly. Split dynamics by contact regime, expert, or object so the model stops smoothing across incompatible modes (MoE / compositional dynamics such as PRISM-WM; object-centric latents built on slot representations); keep the latent control-relevant rather than reconstruction-oriented (TD-MPC, Dreamer-style models); penalize epistemic uncertainty so the planner avoids high-error regions (ensemble disagreement, as in PETS-style model-based RL). This addresses rollout drift and, by closing the train-test gap, planner exploitation. The split granularity — how many experts, how many regimes — is usually set by hand.

E. Fix signal and data

Long-horizon failure is often credit assignment, not model capacity. Shape dense, stage-aware progress rewards for sparse tasks; pretrain on human and egocentric video, recover latent actions from action-free video, and use feasible retargeting and sim-to-real so the policy sees enough of the long-task distribution. This addresses the training side of the plausible-is-not-correct and temporal-granularity failures. Dense reward is itself gameable, which feeds straight back into planner exploitation.

Failure mode	A Shorten	B Externalize	C Verify	D Stabilize	E Signal & data
1Rollout drift	●			●	○
2Planner exploitation			●	●
3Trust horizon
4Plausible ≠ correct			●		○
5Temporal granularity	●	○
6State persistence		●		○

● the family addresses this failure directly · ○ partially. Read down each column to see what a move does; read across each row to see how a failure is covered. Row 3 is the object of the last section: the trust horizon has no column.

Author's schematic. Six failures against five moves. The matrix is built from the mechanism each move operates on, not from any single paper. The one empty row is the argument of this essay.

Two worlds, the same five moves

The five families are domain-independent. A coding agent and a robot world model look unrelated, but when a failure appears in both, the move that answers it is the same — only the implementation changes. The clearest way to see this is to put the two worlds in the same table.

Move	Coding / LLM agent	Robot world model
A · Shorten	Spec decomposition into features; subagents with branch-and-merge	Hierarchical world model; short-horizon MPC with terminal value
B · Externalize	Progress file, compaction of history, structured handoff	Episodic and geometry-indexed memory; keyframe retrieval
C · Verify	Test oracle, executable checks, held-out tests	Predicate / event verification of imagined futures
D · Stabilize	Context curation that prunes stale, drift-inducing content	Regime-split and object-centric dynamics
E · Signal & data	Synthesizing open-ended tasks that elicit long-horizon behavior	Dense progress reward; human-video and latent-action pretraining

The same five moves, read across two fields. Compaction is latent compression; a progress file is episodic memory; a test oracle is event verification; spec decomposition is hierarchical subgoaling. The experience accumulated in one field transfers to the other once the move, not the domain, is treated as the unit.

This is the practical payoff of the decomposition. There is no need to track a hundred long-horizon papers as a hundred separate ideas. Each one is one of five moves applied to one of six failures in one domain. The map is small.

The one move that is missing: trust horizon

Return to the matrix. Five rows are covered by at least one family. Row three is empty, and the emptiness is not an accident of how the table was drawn.

Moves A through E all improve the quality of something: the model, the planner, the memory, the data. A better model drifts more slowly; a planner that trains on its own induced states is harder to exploit; better retrieval keeps more state; better reward shaping assigns credit further. None of them produces a calibration. They make the rollout more trustworthy without telling the planner how far to trust it — and a more accurate model with no trust estimate is still queried blindly.

That gap matters because closed-loop planning has to use the rollout. Without an estimate of how far the rollout remains valid, the planning horizon becomes a fixed hyperparameter: too short on easy states, where the model could safely look further, and too long on brittle states such as contact and mode switches, where the model has already drifted. The horizon should be a function of the state, and right now it is a constant.

One route to it runs through move D. Once dynamics are split by regime or expert — the direction PRISM-WM takes for contact-rich control — each regime carries its own instrumentation: its own prediction consistency, its own uncertainty, its own evidence of when it stops applying. That structure is what a state-dependent trust horizon needs. The intervention is small and testable: add a horizon estimate on top of an existing world model and gate the planner with it, without retraining the base dynamics. It is the one place on this map where the move does not yet exist, which makes it the most concrete thing to build next.

The six failures and five moves already account for most of the recent work. What no family yet does is let the model report its own trustable number of steps. That is the open problem hiding inside the phrase "long horizon."

Sources

Temporal Difference Learning for Model Predictive Control, Hansen, Wang, and Su, ICML 2022 — short-horizon latent rollout plus terminal value.
Effective harnesses for long-running agents and Harness design for long-running application development, Anthropic Engineering — long-horizon agent failure modes and artifact handoff.
Between MDPs and semi-MDPs: A framework for temporal abstraction, Sutton, Precup, and Singh, Artificial Intelligence 1999 — the options framework for temporal abstraction.
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero), Schrittwieser et al., Nature 2020 — a world model organized around a planning interface.

长程：六种失败，五种应对

June 2026

Coding agent、机器人 world model、接触丰富的控制都在抱怨 long horizon，但三边的解法几乎互不引用。原因是结构性的：long horizon 拆成六种不同的失败，近期工作用五类反复出现的动作来应对。这六种里，有一种至今没有任何动作覆盖。

“Long horizon” 这个词如今出现在三个互不读对方论文的领域。一个 coding agent 是 long-horizon 的，因为它要让一个软件任务在很多 session 里保持连贯。一个机器人 world model 是 long-horizon 的，因为它要在行动前想象很多步未来动力学。一个 locomotion 控制器是 long-horizon 的，因为 reward 要在几千个控制步之后才到。它们的术语不互通，方法也不互通。

一个被三个社区共用、却没有共用解法的词，通常并不对应一个问题。它是几种不同失败的表层名字，这些失败只是共享同一个成因。有用的第一步，是把成因和失败分开。

共同结构

成因在哪里都一样：任务运行的长度，超过了系统一次能可靠持有的那个跨度。这个跨度在不同领域指的东西不同，把它字面说清楚，比把它们全叫 long horizon 更有用。

在 world model 里，这个跨度是学到的 transition function 还能保持准确的 timestep 数。每想象一步，都把上一步的预测当作输入，于是每一步的误差会复合放大。在 agent 里，这个跨度是 context window 能容纳的任务状态量。每多一步，更早的状态就被推向窗口边缘，被丢弃或被压缩掉。在控制里，这个跨度是 reward 能直接覆盖的步数。每多一步，credit assignment 就在更长的轨迹上摊得更薄。

所以被度量的单位在每个领域都不同——模型 timestep、context token、reward 步数——但结构完全一致：某种“准确的、被保留的、可被分配的”东西，随每一步退化。把这个单一结构拆开，就得到六种不同的失败，每一种坏掉的东西都不一样。

真正坏掉的六样东西

这六种相互耦合，却不可替换。一个修好其中一种的方法，可能完全不碰其余几种——这正好解释了为什么攻击不同失败的论文彼此不引用。

1. 预测漂移（rollout drift）

学到的 transition function 把自己的预测当作下一步输入，于是一步的小误差会在 latent、reward、接触、物体位姿上变成有结构的多步误差。误差最大的地方是动力学不连续处：脚接触地面和脚处于腾空遵循不同的转移规则，一个被平滑过的单一模型会把它们平均掉。这正是 TD-MPC 通过拒绝长 rollout 来规避的失败——它只规划短 horizon，剩下用 value function 估计——也是 PRISM-WM 通过把 transition function 按接触 regime 分解来正面处理的失败。

2. 规划器钻空子（planner exploitation）

一个为被动单步预测训练的 world model，在测试时被当成优化目标。planner 去搜索在模型内部评分最高的动作序列，而这个搜索会落到模型训练时从未见过的状态，那里它的误差不受约束。模型在预测上看着好，对控制却可能是坏的，因为 planner 在主动寻找它的盲区。这是 model-based planning 特有的 train-test gap：planner 把模型推向它从未被训练过的状态。

3. 信任步数未知（trust horizon）

即使模型不错，它也从不报告当前 rollout 能被信任多少步。缺的这个量，是一个随当前状态、任务阶段、接触 regime 变化的步数——值得使用的最长 rollout，它短于模型物理上能产生的最长 rollout。“误差会累积”是一句关于模型的话；trust horizon 是模型本该对自己作出的陈述，而几乎没有系统给出它。

4. 看着对但任务没成（plausible ≠ correct）

一个预测的未来可以在视觉上、物理上都 plausible，但真正的任务事件没有发生：夹爪看着闭合了，物体却没被抓起；轨迹看着平滑，某个 precondition 却被违反。进度的计量单位错了。模型被按“未来看着对不对”打分；它本该按“任务 predicate 是否变为真”打分。

5. 时间粒度太细（temporal granularity）

在每个 motor step 或每个 token 上规划，会让长任务变得不可解，因为决策数量随 horizon 增长。一个把 pixel 映射到 torque、跨几千步的扁平 policy，或一个逐 token 推理、跨几小时任务的扁平 agent，在求解一个决策数从构造上就太大的问题。

6. 状态丢失（state persistence）

在长执行里，系统会丢失物体身份、位置、被遮挡的事实、哪个子任务已完成、哪些约束仍被满足。同一个当前观测，可能因为一个早已离开画面的事件而需要不同的动作。这是非马尔可夫的失败：当前观测不再是充分状态。

反复出现的五个动作

在 coding agent、world model、控制三边，应对会收敛成五个族。每一个都在缩短、外置、核对、稳定或重训——而且每一个都落在上面某一种具体的失败上，long horizon 这个统称本身没有对应动作。

A. 缩短跨度

不要一次跨完整个跨度。规划一段短 rollout，尾部用 value function 估计（TD-MPC）；在快速低层控制器之上叠一个慢速高层 planner，让每一层看到的 horizon 都更短（Puppeteer 这类层级 world model；经典的 options 框架）；在 waypoint、subgoal 或逻辑 predicate 这种更粗的动作空间里规划（用 affordance 落地的 LLM 任务分解，如 SayCan）；一次预测整段 action chunk，把多步合并成一次决策。它处理预测漂移和时间粒度。它会失败在“高层被默认可靠”的地方：如果一个 subgoal 本身错了，低层执行得再完美，任务也错。

B. 外置状态

当 context 装不下这个跨度，就把它移到外面——而难点在检索本身，容量反而次要。progress 文件、结构化 handoff、execution-state memory 让一个从无记忆开始的 agent 能跨 session 继续（Anthropic 的 harness 工作）；episodic、keyframe、geometry-indexed 记忆让机器人 policy 取回当前决策所依赖的那一帧历史。它处理状态丢失。“更长的 context”是它较弱的版本；较强的版本在对的时刻检索对的证据，只取当前决策需要的那一份。

C. 以验证代替信任

既然“看着对”不等于“任务对”，就去核对任务事件是否真的发生，把“未来看着对”从评分里拿掉。在 coding agent 里这是 test oracle 和可执行验证；在机器人 world model 里，这是在视觉特征之上叠加 predicate 或 event 预测，以及把 model rollout 当作 policy evaluator，按真实成功来打分、把视觉质量排除在评分之外。它处理规划器钻空子和“看着对但任务没成”。它会失败在验证器本身错误或可被钻空子的地方，那样问题被转移，并未被消除。

D. 稳定 rollout

直接攻击误差源。把动力学按接触 regime、expert 或物体分解，让模型停止在不兼容的模式之间平滑（MoE / compositional dynamics，如 PRISM-WM；建立在 slot 表示上的 object-centric latent）；让 latent 围绕 control-relevant 训练，不要求它做 reconstruction（TD-MPC、Dreamer 这类模型）；惩罚 epistemic uncertainty，让 planner 避开高误差区域（ensemble 分歧，如 PETS 这类 model-based RL）。它处理预测漂移，并通过缩小 train-test gap 处理规划器钻空子。分解的粒度——多少个 expert、多少个 regime——通常靠手工设定。

E. 修信号与数据

long-horizon 的失败常常出在 credit assignment 这一侧，而不在模型容量。为稀疏任务塑造稠密、stage-aware 的进度 reward；在人类与第一视角视频上预训练，从 action-free 视频里恢复 latent action，用可行的 retargeting 和 sim-to-real 让 policy 见到足够多的长任务分布。它处理“看着对但任务没成”和“时间粒度太细”在训练侧的那一面。稠密 reward 本身可被钻空子，这又直接喂回规划器钻空子。

失败模式	A 缩短	B 外置	C 验证	D 稳定	E 信号·数据
1预测漂移	●			●	○
2规划器钻空子			●	●
3信任步数未知
4看着对但任务没成			●		○
5时间粒度太细	●	○
6状态丢失		●		○

● 该动作族直接处理这种失败 · ○ 部分处理。竖着读每一列，看一个动作在做什么；横着读每一行，看一种失败被覆盖到什么程度。第 3 行是最后一节的对象：trust horizon 没有任何一列。

自制示意图。六种失败对五种动作。矩阵按每个动作所操作的机制来填，并非取自某一篇论文。唯一全空的那一行，是本文的论点。

两个世界，同一组动作

这五个族是领域无关的。一个 coding agent 和一个机器人 world model 看着毫不相关，但当同一种失败在两边都出现，应对它的动作是同一个——变的只是实现。把两个世界放进同一张表，是看清这件事最直接的方式。

动作	Coding / LLM agent	机器人 world model
A · 缩短	把 spec 分解成 feature；subagent 的 branch-and-merge	层级 world model；短窗 MPC 加 terminal value
B · 外置	progress 文件、历史 compaction、结构化 handoff	episodic 与 geometry-indexed 记忆；keyframe 检索
C · 验证	test oracle、可执行检查、held-out 测试	对想象未来的 predicate / event 验证
D · 稳定	剪除陈旧、诱发漂移内容的 context curation	按 regime 分解、object-centric 的动力学
E · 信号·数据	合成能诱发长程行为的开放式任务	稠密进度 reward；人类视频与 latent-action 预训练

同一组五个动作，横跨两个领域来读。compaction 就是 latent 压缩；progress 文件就是 episodic memory；test oracle 就是 event verification；spec 分解就是 hierarchical subgoaling。一旦以动作为单位来组织，领域差异就退居次要，一个领域里积累的经验能迁移到另一个领域。

这是这套分解的实际收益。不必把上百篇 long-horizon 论文当成上百个独立想法来追。每一篇都是五个动作之一，作用在六种失败之一上，落在某一个领域里。这张地图很小。

仍然缺的一步：trust horizon

回到矩阵。五行都被至少一个族覆盖。第三行是空的，而这个空并非画表时的偶然。

A 到 E 都在提升某样东西的质量：模型、planner、记忆、数据。更好的模型漂移更慢；一个在自己诱发的状态上训练过的 planner 更难被钻空子；更好的检索保住更多状态；更好的 reward shaping 把 credit 分得更远。它们都不产出校准。它们让 rollout 更可信，却不告诉 planner 该信任它多远——而一个更准、却没有信任估计的模型，仍然是被盲目调用的。

这个缺口之所以要紧，是因为闭环规划必须使用 rollout。没有“rollout 还有效多远”的估计，planning horizon 就退化成一个固定超参数：在简单状态上太短，那里模型本可以安全地看更远；在接触、模式切换这类脆弱状态上太长，那里模型其实已经漂移。这个 horizon 本该是状态的函数，而现在它是一个常数。

一条通向它的路径经过动作 D。一旦动力学按 regime 或 expert 分解——PRISM-WM 在接触丰富控制上走的方向——每个 regime 就带着自己的 instrumentation：自己的预测一致性、自己的不确定性、自己关于“何时不再适用”的证据。这套结构，正是一个随状态变化的 trust horizon 所需要的。这个干预很小、可检验：在已有 world model 之上加一个 horizon 估计，用它给 planner 设门，而不重训基座动力学。它是这张地图上唯一动作尚未出现的位置，这让它成为下一步最具体可做的事。

六种失败、五个动作已经覆盖了近期大部分工作。还没有任何一个族能做到的，是让模型报告自己可信的步数。那是藏在 “long horizon” 这个词里的、仍未解决的问题。

Sources

Temporal Difference Learning for Model Predictive Control，Hansen、Wang、Su，ICML 2022 —— 短窗 latent rollout 加 terminal value。
Effective harnesses for long-running agents 与 Harness design for long-running application development，Anthropic Engineering —— 长程 agent 的失败模式与 artifact handoff。
Between MDPs and semi-MDPs: A framework for temporal abstraction，Sutton、Precup、Singh，Artificial Intelligence 1999 —— 用于时间抽象的 options 框架。
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model（MuZero），Schrittwieser 等，Nature 2020 —— 围绕 planning 接口组织的 world model。