Two Long Horizons

June 2026

The same phrase names a reliability problem in language-model agents and a fidelity problem in world models. Telling the two apart explains why one horizon has been doubling every few months while the other, in its hardest form, is still measured in seconds.

Two research literatures use the phrase long horizon constantly, and they rarely cite each other. In work on language-model agents, a long horizon is a claim about reliability over time: how long a task an agent can carry before it makes one fatal mistake. In work on world models and model-based reinforcement learning, a long horizon is a claim about fidelity over rollout: how many steps a learned model can predict before its imagined future stops matching the real one. The phrase is shared. The axis underneath it is not.

The distinction is worth holding precisely, because the two horizons behave differently enough that advice from one field is often wrong in the other. One has been doubling on a calendar; the other gets longer only when the model itself gets more accurate.

The agent's horizon is measured in human time

The cleanest definition of an agent's long horizon comes from METR. Their metric is the 50%-task-completion time horizon: the length of tasks, measured by how long a human expert takes, that a model can finish with 50% success (Kwa et al., 2025). The axis is human time. A model with a one-hour horizon succeeds half the time on tasks that take a person about an hour — Claude 3.7 Sonnet, in March 2025. The headline is a trend: that horizon has doubled roughly every seven months since 2019, and faster, closer to every four months, through 2024–2025. Later frontier models pushed the 50% figure from under an hour to several hours.

Two details keep this from being a general autonomy claim, and both matter for the comparison. First, the y-axis is human labor, not machine runtime. METR is explicit that the horizon measures "the amount of serial human labor [a model] can replace," not how long it can run on its own (METR, 2026). A five-hour horizon is not five hours of unattended operation. Second, reliability does most of the work. The 80%-success horizon is about five times shorter than the 50% horizon for the same model. Demanding higher reliability collapses the usable task length, because the failures are not in reasoning, they are in execution stretched over many steps.

That last point has its own paper. The Illusion of Diminishing Returns isolates execution by handing the model the full plan and knowledge for a task and asking only that it carry out many simple steps in sequence. Models that reach 100% accuracy on the first step still degrade over the run, and the degradation has a clean shape: completing an H-step task requires succeeding at each step, so end-to-end success behaves like single-step accuracy raised to the power of the horizon. A small gain in per-step accuracy compounds into a large gain in achievable length, which is why short benchmarks look saturated while long tasks keep improving. Toby Ord makes the same curve explicit from the other side: model an agent as failing at a constant rate per human-minute of work, and you get an exponentially declining success rate with task length, and a characteristic "half-life" for each agent (Ord, 2025).

So the agent's long horizon is a measurement of reliability. The unit is human task-time; the quantity that accumulates is the probability of a task-level mistake; and the way to extend it is to raise per-step reliability or to shorten the chain of steps that all have to go right.

The world model's horizon is counted in its own steps

A world model's long horizon lives on a different axis. Here the horizon is the number of steps the learned model is unrolled forward, and the quantity that accumulates is prediction error.

The canonical statement is MBPO's. A one-step dynamics model applied autoregressively feeds each prediction back in as the next input, so "small errors compound over the prediction horizon" (Janner et al., 2019). The classical bound behind this, the simulation lemma, scales value error by one-step model error times the effective horizon — up to quadratically in the worst case. MBPO's response is to keep the model horizon short and decoupled from the task horizon: instead of a few long rollouts from the start state, it runs many short rollouts branched from real states sampled out of the replay buffer. In practice the rollout length is often one, and the authors note their model is accurate enough for 200-step rollouts yet learns better from far shorter ones; 500-step rollouts are too inaccurate to use at all.

The same instinct — roll the model a few steps, no further — runs through the planning world models. Dreamer learns behavior entirely inside imagined latent trajectories of a fixed length: the imagination horizon is exactly 15 steps, held constant across Dreamer, DreamerV2, and DreamerV3 (Hafner et al., 2023). TD-MPC plans over a latent horizon of 5; TD-MPC2 shortens the default to 3. MuZero unrolls its learned model for only 5 hypothetical steps during training. None of these models is asked to predict far. Each is unrolled briefly and then stopped.

For generative video world models, the horizon is measured in frames and seconds, and the failure is visible rather than scalar. DIAMOND traces long-rollout collapse to compounding error that pushes the model out of distribution, and shows the diffusion formulation changes how fast that happens: a DDPM-based model drifts badly over a 1000-step rollout while an EDM-based one stays stable, even with a single denoising step. Genie 3 generates an interactive world at 24 frames per second and 720p that stays consistent for a few minutes, with visual memory reaching about one minute back — and DeepMind names that same window as the limit, because inaccuracies accumulate over time. The honest summary of the video line is that "long horizon" still means minutes, and the field describes its own state as "far from persistence."

So the world-model long horizon is a measurement of how far the prediction holds. The unit is model timesteps or frames; the quantity that accumulates is divergence of the predicted state from the real one; and the way to extend it is to make the model more accurate per step, or to roll it fewer steps before handing off.

Schematic of the two axes, not reproduced data. Left: the success-versus-human-task-length curve METR fits to read off a time horizon at a chosen reliability; the 80% horizon sits to the left of the 50% horizon. Right: prediction error growing over an autoregressive rollout, the curve MBPO and DIAMOND measure, with a learned value standing in for everything past the short horizon H.

Side by side

Reading the two horizons against each other makes the structural differences concrete. Every row below is the same word resolving to a different object.

Dimension	In a language-model agent	In a world model
The axis	Human task-completion time. A model's "time horizon" is the human duration of tasks it finishes at a set reliability.	The model's own timesteps, or frames of video, that it is unrolled forward.
What accumulates per step	Task-level reliability. End-to-end success ≈ single-step accuracy raised to the horizon.	State-prediction error. One-step model error compounds over the rollout.
How it is measured	Fit a success curve against human task length, read a horizon at 50% / 80%, track its doubling over calendar time.	Fix a horizon `H`, then measure rollout drift directly — pixel error, consistency duration.
How it is shortened	Split the task into short grounded sessions; persist memory to disk; verify each step.	Roll the model a few steps, then bootstrap a learned value; branch from real states; abstract into latent space.
Grounding	Added at every step (tool output, test results). The main thing that extends the horizon.	Removed during imagination. The model runs ungrounded for `H` steps by design.
The object at the far end	A sparse, binary verifier (tests pass or fail). It creates the credit-assignment problem.	A dense, continuous value (TD bootstrap). It dissolves the credit-assignment problem.
What more scale or "thinking" buys	Reliability doubles over time; thinking removes self-conditioning and lengthens execution.	A longer `H` often hurts; usable horizon grows from model fidelity, not calendar time.

The shared mechanism: a model conditioning on its own output

The two horizons are not unrelated. Both fields, when they look for the deepest cause of failure, arrive at the same place: a model consuming its own previous outputs. And both name the same ancestor for it — exposure bias, the train-test mismatch first framed for sequence models a decade ago, where a network trained on ground-truth prefixes is, at inference, fed its own imperfect generations and never learned to recover (Ranzato et al., 2015).

What differs is the surface the loop runs on. In a language-model agent it is a reliability effect. The Illusion of Diminishing Returns shows it causally: inject errors into the model's own context and its next-step accuracy falls, even at fixed context length — the model becomes more likely to err when its history already contains errors. The paper calls this self-conditioning, and reports two things that matter for the comparison. Scaling the model does not remove it: 200-billion-parameter models still self-condition. Explicit "thinking" — RL-trained reasoning — does remove it, and also stretches how far a model can execute in a single turn, from around 120 steps for one model to over a thousand for another.

In a world model the same loop is a fidelity effect. Each predicted state or frame becomes the input to the next, so error does not just persist, it pushes the rollout off the distribution the model was trained on, which raises the next error. This is exactly DIAMOND's drift, and the video literature's pair of failure modes — history forgetting and temporal degradation — both attributed to the teacher-forcing mismatch (Meta-ARVDM, 2025). The cures differ in kind: the agent side adds grounding and trains recovery; the world-model side shortens rollouts, changes the diffusion objective, conditions on more history, or anchors generation to its opening frames. The same loop produces a reliability problem on one side and a perceptual-drift problem on the other.

The same trick, inverted: a learned value at the end of the horizon

Both fields also stop short of running the full horizon, and lean on a learned object to cover the rest. This is where the comparison is most useful, because the object plays the same structural role and produces the opposite consequence.

In a world model the terminal object is a value function. TD-MPC plans five steps and then adds a single learned terminal Q-value that estimates all return beyond the horizon, so the short rollout stands in for an effectively infinite one. MuZero bootstraps onto an n-step search value; Dreamer closes its 15-step return with a critic, and reports that without that bootstrap the agent becomes "shortsighted" and only solves reactive tasks. This terminal value is dense and continuous, and it is trained by one-step temporal-difference bootstrapping — so the long horizon is never rolled out anywhere. It is folded into a value function during training. There is no sparse-reward credit-assignment crisis, because the value carries information at every step.

In a language-model agent the object at the end of the horizon is a verifier, and it is sparse and discrete. Most agentic tasks emit a single scalar at the end: hidden tests pass or fail, an evaluator returns pass or needs-work, functional correctness is graded once at termination. Because that signal is terminal and binary, it creates the credit-assignment problem rather than dissolving it. Value-free RL such as GRPO has to spread one outcome reward uniformly across every turn of a long trajectory, unable to tell which tool call mattered (Zeng et al., 2025). A whole line of work exists only to densify that terminal signal — turn-level rewards, hindsight redistribution, tree-search value estimation — and RL theory warns that the obvious fix is not enough: adding a constant makes the reward non-sparse without making it informative (Arumugam et al., 2021).

The same structural move — cap the horizon with a learned estimate instead of rolling it all the way out — points in opposite directions. The world model's value is dense and trained by TD bootstrapping, so the long horizon is folded into a function. The agent's verifier is a single binary signal at termination, which is what turns long-horizon credit assignment into a research problem.

So the same move — cap the horizon with a learned estimate instead of rolling it out — produces opposite work. The world model's value dissolves the long-horizon problem. The agent's verifier is what makes it hard.

Where the two horizons meet

The cleanest evidence that these are genuinely different horizons is the place they stop being different. METR's time horizon is a software-and-reasoning result. Run the same methodology on tasks that need perception and physical grounding and it collapses: horizons for visual computer-use tasks are 40 to 100 times shorter than for software, and Claude Sonnet 4.5's horizon for real-world coffee-making is about two minutes (METR, 2026).

Two minutes is the world-model regime. It is the same order as Genie 3's consistency window, and it appears for the same reason: in a physical, partially observed setting, an agent has to maintain an internal model of a world that keeps changing, and that model drifts. In domains with no clean symbolic state to check against, how long the agent can stay reliable becomes bounded by how far the model's own predictions stay faithful. The two axes stay distinct in the places each field usually measures — clean software tasks for agents, controlled dynamics for world models — and they appear to converge exactly in the messy physical middle that neither field has solved.

Open questions

The comparison is useful mostly because it makes a few questions sharper than either field asks alone.

Is the agent horizon in physical domains literally bounded by world-model rollout fidelity, or only correlated with it? The coffee-making number suggests a bound; nobody has measured it as one.
Self-conditioning has one name and two cures. Is there a world-model analog of RL-trained recovery — a model taught to correct its own drift — or is drift-suppression a fundamentally different lever?
The "densify the terminal signal" work in agent RL is, in effect, an attempt to grow a value function where there was only a verifier. How close can a sparse outcome signal be pushed toward the dense terminal value that already makes world-model planning stable?
METR's doubling is a calendar-time scaling law. World-model horizon progress reads as architectural jumps in rollout duration, not a smooth curve. Is there a measurable scaling law for consistency duration, or is it gated by architecture?
How much of the agent horizon depends on clean, auto-scorable tasks? METR's own messiness analysis suggests the headline trend rides the cleanest tasks; on messy, partially observed tasks the distinction from world-model drift may narrow.

Sources

Measuring AI Ability to Complete Long Tasks, Kwa et al. (METR), 2025 — the 50% time-horizon metric and the ~7-month doubling trend.
Clarifying limitations of time horizon, METR, 2026 — horizon as human serial labor, and its 40–100× collapse under perception / physical grounding.
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs, Sinha, Arun, Goel, Staab, Geiping, 2025 — per-step-accuracy compounding and the causal self-conditioning effect.
Is there a half-life for the success rates of AI agents?, Toby Ord, 2025 — constant-hazard model of agent success decay.
Effective harnesses for long-running agents, Anthropic, 2025 — decomposition, durable on-disk handoff, and verification against false success.
When to Trust Your Model: Model-Based Policy Optimization, Janner, Fu, Zhang, Levine, NeurIPS 2019 — compounding error and short branched rollouts.
Mastering Diverse Domains through World Models (DreamerV3), Hafner, Pasukonis, Ba, Lillicrap, 2023 — imagination horizon H=15 with a value-bootstrapped return.
Temporal Difference Learning for Model Predictive Control (TD-MPC), Hansen, Wang, Su, 2022, and TD-MPC2, 2024 — short latent planning plus a learned terminal value.
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero), Schrittwieser et al., 2020 — five-step model unroll with a bootstrapped search value.
Diffusion for World Modeling: Visual Details Matter in Atari (DIAMOND), Alonso et al., NeurIPS 2024 — rollout drift measured directly, DDPM vs EDM.
Genie 3: A new frontier for world models, Google DeepMind, 2025 — interactive worlds consistent for a few minutes.
Error Analyses of Auto-Regressive Video Diffusion Models, 2025 — history forgetting and temporal degradation as exposure bias.
From Masks to Worlds: A Hitchhiker's Guide to World Models, 2025 — the Genie progression and the "far from persistence" framing.
Sequence Level Training with Recurrent Neural Networks, Ranzato et al., 2015 — the original exposure-bias framing and its RL remedy.
Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Reward Design, 2025, and An Information-Theoretic Perspective on Credit Assignment, 2021 — why a sparse terminal reward makes long-horizon credit assignment hard.
Long-horizon agent benchmarks: OSWorld, GAIA, τ-bench — step budgets, tiered horizons, and pass^k reliability decay.

两个 Long Horizon

June 2026

同一个词，在 language-model agent 里说的是可靠性问题，在 world model 里说的是保真度问题。把这两件事分开，就能解释为什么一条 horizon 每几个月翻一倍，而另一条在最难的形态下还只能按秒来量。

在两类研究里，long horizon 都是高频词，而它们几乎不互相引用。在 language-model agent 的工作里，long horizon 说的是可靠性随时间的衰减：一个 agent 能在犯下第一个致命错误之前，把多长的任务做完。在 world model 和 model-based RL 的工作里，long horizon 说的是保真度随 rollout 的衰减：一个学到的模型能往前预测多少步，它想象出来的未来才开始偏离真实环境。词是同一个，底下那根坐标轴却不同。

把这个区分讲清楚是有用的，因为两条 horizon 的行为差得足够远，以至于一个领域里的经验搬到另一个领域常常是错的。一条按日历在翻倍；另一条只有在模型自己变得更准时才会变长。

agent 的 horizon 用人类时间来量

对 agent 的 long horizon，最干净的定义来自 METR。它的指标是 50%-task-completion time horizon：用人类专家完成任务所需的时间来度量任务长度，模型能以 50% 成功率完成的那个时长（Kwa 等，2025）。坐标轴是人类时间。一个 horizon 为一小时的模型，在人类大约要花一小时的任务上有一半概率成功——2025 年 3 月时的 Claude 3.7 Sonnet 就是这个量级。真正的结论是一个趋势：这个 horizon 自 2019 年起大约每七个月翻一倍，在 2024–2025 年更快，接近每四个月一次。之后的前沿模型把 50% 的数字从不到一小时推到了几个小时。

有两个细节让它不能被读成一个普遍的自主性指标，而它们对这次对比都重要。第一，纵轴量的是人类劳动；机器自己运行多久是另一回事。METR 明确说，这个 horizon 度量的是「模型能替代的人类串行劳动量」，而它能自己跑多久是另一回事（METR，2026）。五小时的 horizon 并不等于五小时无人值守地运行。第二，可靠性承担了大部分作用。对同一个模型，80% 成功率的 horizon 大约只有 50% horizon 的五分之一。要求更高的可靠性，会让可用任务长度急剧缩短，因为失败往往不在推理，而在被拉长到很多步的执行上。

最后这一点有它自己的论文。The Illusion of Diminishing Returns 把执行单独拎出来：直接把完整的计划和知识交给模型，只要求它把很多简单步骤一步步做下去。那些在第一步就达到 100% 准确率的模型，整段做下来仍会衰减，而衰减有一个干净的形状：完成一个 H 步任务需要每一步都成功，所以端到端成功率近似等于单步准确率的 H 次方。单步准确率的一点点提升，会复利成可完成长度的大幅增长——这正是为什么短 benchmark 看起来已经饱和，长任务却还在变好。Toby Ord 从另一面把同一条曲线写明：把 agent 建模为在人类要花的每一分钟里以恒定概率失败，就会得到成功率随任务长度指数下降，以及每个 agent 各自的「半衰期」（Ord，2025）。

所以 agent 的 long horizon 量的是可靠性。单位是人类任务时长，每步累积的是犯下任务级错误的概率；延长它的办法，是提高单步可靠性，或者缩短那条必须全部走对的步骤链。

world model 的 horizon 用它自己的步数来量

world model 的 long horizon 落在另一根坐标轴上。这里的 horizon 是学到的模型被往前 rollout 的步数，而每步累积的是预测误差。

最经典的表述来自 MBPO。一个单步 dynamics model 被自回归地反复调用，每一步预测又成为下一步的输入，于是「小误差会在预测 horizon 上复利累积」（Janner 等，2019）。它背后的经典界，simulation lemma，把 value 误差界成单步模型误差乘以 effective horizon——在最坏情况下甚至是平方关系。MBPO 的回应是把 model horizon 压短，并与 task horizon 解耦：与其从初始状态做几条很长的 rollout，不如从 replay buffer 里采样到的真实状态出发、分叉出许多条很短的 rollout。实践中 rollout 长度常常就是 1；作者还指出他们的模型对 200 步 rollout 已经足够准，但从远更短的 rollout 里学得更好，而 500 步 rollout 已经不准到根本没法用。

同样的直觉——只滚几步，不再往前——贯穿了那些做 planning 的 world model。Dreamer 完全在固定长度的想象 latent 轨迹里学习行为：imagination horizon 恰好是 15 步，在 Dreamer、DreamerV2、DreamerV3 三代里都保持不变（Hafner 等，2023）。TD-MPC 在长度为 5 的 latent horizon 上做 planning；TD-MPC2 把默认值缩到 3。MuZero 在训练时只把学到的模型展开 5 个假想步。没有一个模型被要求预测得很远，每个都只滚一小段就停下来。

对生成式的视频 world model，horizon 用帧和秒来量，失败也从一个标量变成可见的画面。DIAMOND 把长 rollout 的崩坏追溯到复利误差把模型推出训练分布，并说明 diffusion 的具体形式会改变它崩得多快：基于 DDPM 的模型在 1000 步 rollout 上严重 drift，而基于 EDM 的模型即使只用一步去噪也能保持稳定。Genie 3 以 24 帧每秒、720p 生成一个可交互世界，能保持几分钟的一致性，视觉记忆大约能回溯到一分钟之前——DeepMind 把同一个窗口称作它的上限，因为误差会随时间累积。视频这条线诚实的总结是：「long horizon」仍然只意味着几分钟，而这个领域对自己状态的描述是「离 persistence 还很远」。

所以 world model 的 long horizon 量的是预测能撑多远。单位是模型的 timestep 或帧，每步累积的是预测状态相对真实状态的偏离；延长它的办法，是让模型每一步更准，或者在交接之前少滚几步。

两根坐标轴的示意图，并非复现数据。左：METR 拟合的成功率—人类任务长度曲线，从中按设定可靠性读出一个 time horizon；80% 的 horizon 落在 50% 的左边。右：自回归 rollout 上不断增长的预测误差，是 MBPO 和 DIAMOND 度量的曲线，短 horizon H 之外的部分交给一个学到的 value 来代替。

并排来看

把两条 horizon 并排放在一起，结构差异就具体了。下表每一行，都是同一个词落到了不同的对象上。

维度	在 language-model agent 里	在 world model 里
坐标轴	人类任务完成时间。模型的「time horizon」是它在设定可靠性下能完成的任务的人类时长。	模型自己被往前 rollout 的 timestep，或视频的帧。
每步累积什么	任务级可靠性。端到端成功率 ≈ 单步准确率的 horizon 次方。	状态预测误差。单步模型误差沿 rollout 复利累积。
如何度量	对人类任务长度拟合一条成功率曲线，在 50% / 80% 读出 horizon，再看它随日历翻倍。	固定一个 horizon `H`，再直接度量 rollout 的 drift——像素误差、一致性时长。
如何缩短	把任务拆成有 grounding 的短 session；把记忆落盘；每一步都做校验。	只滚几步，再用一个学到的 value bootstrap；从真实状态分叉；抽象到 latent 空间。
grounding	每一步都加入（工具输出、测试结果）。这是延长 horizon 的主要手段。	在想象期间被刻意移除。模型按设计无 grounding 地滚 `H` 步。
末端的对象	一个稀疏的二值 verifier（测试通过或失败）。它制造了 credit assignment 问题。	一个稠密连续的 value（TD bootstrap）。它化解了 credit assignment 问题。
更大规模或「thinking」带来什么	可靠性随时间翻倍；thinking 去掉 self-conditioning，并延长执行长度。	更长的 `H` 常常更糟；可用 horizon 靠模型保真度提升，不随日历时间翻倍。

共享的机制：模型以自己先前的输出为输入

这两条 horizon 并非毫无关系。两个领域在寻找失败最深层原因时，都走到同一个地方：模型把自己先前的输出当作下一步的输入。它们也给它起了同一个祖先的名字——exposure bias，十年前为序列模型提出的训练—推理不匹配：一个在 ground-truth 前缀上训练的网络，推理时输入换成了自己并不完美的生成，却从未学会从中恢复（Ranzato 等，2015）。

区别在于这个回路作用的表面。在 language-model agent 里，它是一个可靠性效应。The Illusion of Diminishing Returns 用因果方式把它显示出来：往模型自己的 context 里注入错误，它下一步的准确率就下降，即使 context 长度保持不变——当历史里已经含有错误时，模型更容易再出错。论文把这叫 self-conditioning，并报告了两件对这次对比重要的事：扩大模型规模并不能消除它，两千亿参数级的模型仍然会 self-condition；而显式的「thinking」——经过 RL 训练的推理——能消除它，同时把模型在单轮里能执行的长度大幅拉长，从某个模型的约 120 步到另一个模型的一千步以上。

在 world model 里，同一个回路是一个保真度效应。每个预测出的状态或帧都成为下一步的输入，于是误差不仅留存，还把 rollout 推离模型训练时所见的分布，从而抬高下一步的误差。这正是 DIAMOND 的 drift，也是视频文献里那一对失败模式——history forgetting 和 temporal degradation——两者都归因于 teacher-forcing 的不匹配（Meta-ARVDM，2025）。两边的解法在性质上不同：agent 这边加入 grounding 并训练恢复能力；world model 这边缩短 rollout、更换 diffusion 目标、条件于更多历史，或把生成锚定到开头的若干帧。同一个回路，在一边产生可靠性问题，在另一边产生感知 drift 问题。

同一个机制，反过来：horizon 末端的一个学到的 value

两个领域也都没把完整的 horizon 真正滚完，转而依靠一个学到的对象来覆盖其余部分。这是这次对比最有用的地方，因为这个对象扮演相同的结构性角色，却产生相反的后果。

在 world model 里，末端的对象是一个 value function。TD-MPC 规划 5 步，然后加上一个学到的 terminal Q-value，估计 horizon 之外的全部 return，于是短 rollout 就代替了一条实际上无限长的轨迹。MuZero bootstrap 到一个 n 步搜索 value；Dreamer 用一个 critic 收束它 15 步的 return，并报告：没有这个 bootstrap，agent 就会变得「shortsighted」，只能解决反应式任务。这个 terminal value 是稠密且连续的，由单步 TD bootstrapping 训练得到——所以长 horizon 从来没有在任何地方真正被滚出来，它在训练时就被折叠进了一个 value function。这里没有稀疏奖励带来的 credit-assignment 危机，因为这个 value 在每一步都携带信息。

在 language-model agent 里，horizon 末端的对象是一个 verifier，它稀疏且离散。大多数 agentic 任务在最后才给出一个标量：隐藏测试通过或失败，一个 evaluator 返回 pass 或 needs-work，functional correctness 在终止时被打一次分。正因为这个信号是末端的、二值的，它制造了 credit assignment 问题，没有化解它。像 GRPO 这样的 value-free RL，只能把一个 outcome reward 均匀地摊到长轨迹的每一回合上，无法分辨哪一次工具调用真正重要（Zeng 等，2025）。有一整条线的工作只为把这个末端信号变稠密——turn-level reward、hindsight 重分配、tree-search value 估计——而 RL 理论提醒：显而易见的修法并不够，给奖励加一个常数能让它不再稀疏，却不能让它更有信息（Arumugam 等，2021）。

同一个结构性动作——用一个学到的估计来截断 horizon，而不把它整条滚完——指向相反的方向。world model 的 value 稠密、由 TD bootstrapping 训练，于是长 horizon 被折叠进一个函数。agent 的 verifier 是终止时的单个二值信号，正是它把长 horizon 的 credit assignment 变成了一个研究问题。

所以同一个动作——用学到的估计截断 horizon，省去整条滚完——产生了相反的工作。world model 的 value 化解了长 horizon 问题。agent 的 verifier 才是让它变难的原因。

两条 horizon 在哪里相遇

这两条 horizon 确实不同，最干净的证据，是它们停止不同的那个地方。METR 的 time horizon 是一个软件与推理上的结果。把同一套方法用到需要感知和物理 grounding 的任务上，它就坍缩：视觉 computer-use 任务的 horizon 比软件短 40 到 100 倍，而 Claude Sonnet 4.5 在真实世界冲咖啡上的 horizon 只有大约两分钟（METR，2026）。

两分钟正是 world model 的区间。它和 Genie 3 的一致性窗口同一量级，而出现的原因相同：在一个物理的、部分可观测的环境里，agent 必须维护一个对不断变化的世界的内部模型，而这个模型会 drift。在没有干净符号状态可供核对的领域里，agent 能可靠地做多长，最终被模型的预测能撑多远所界定。两根坐标轴在各自领域通常度量的地方保持分离——agent 在干净的软件任务上，world model 在受控的动力学上——而它们似乎恰好在那块谁都还没解决的、混乱的物理中间地带相遇。

开放问题

这次对比之所以有用，主要是它把几个问题问得比任一领域单独时更锋利。

在物理域里，agent 的 horizon 是否真的被 world-model rollout 保真度所界定，还是只是与之相关？冲咖啡那个数字暗示了一个上界，但没有人把它当作上界来度量过。
self-conditioning 只有一个名字，却有两种解法。world model 是否存在一个 RL 训练恢复能力的对应物——一个被教会修正自己 drift 的模型——还是 drift-suppression 本质上是另一种杠杆？
agent RL 里「把末端信号变稠密」的工作，本质上是在只有 verifier 的地方长出一个 value function。一个稀疏的 outcome 信号，能被推到多接近那个已经让 world-model planning 稳定的稠密 terminal value？
METR 的翻倍是一条日历时间上的 scaling law。world model 的 horizon 进展更像 rollout 时长上一次次架构跳变，看不出一条平滑曲线。一致性时长是否存在可度量的 scaling law，还是它被架构所限定？
agent 的 horizon 有多少依赖于干净、可自动判分的任务？METR 自己的 messiness 分析暗示，那条头条趋势是骑在最干净的任务上的；在混乱、部分可观测的任务上，它与 world-model drift 的区别可能会收窄。

Sources

Measuring AI Ability to Complete Long Tasks，Kwa 等（METR），2025——50% time-horizon 指标与约七个月翻倍的趋势。
Clarifying limitations of time horizon，METR，2026——horizon 作为被替代的人类串行劳动，以及它在感知 / 物理 grounding 下坍缩 40–100 倍。
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs，Sinha、Arun、Goel、Staab、Geiping，2025——单步准确率复利与因果性的 self-conditioning 效应。
Is there a half-life for the success rates of AI agents?，Toby Ord，2025——agent 成功率衰减的恒定 hazard 模型。
Effective harnesses for long-running agents，Anthropic，2025——任务拆分、落盘交接，以及对假成功的校验。
When to Trust Your Model: Model-Based Policy Optimization，Janner、Fu、Zhang、Levine，NeurIPS 2019——复利误差与短分叉 rollout。
Mastering Diverse Domains through World Models (DreamerV3)，Hafner、Pasukonis、Ba、Lillicrap，2023——imagination horizon H=15 与 value-bootstrap 的 return。
Temporal Difference Learning for Model Predictive Control (TD-MPC)，Hansen、Wang、Su，2022，及 TD-MPC2，2024——短 latent planning 加学到的 terminal value。
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero)，Schrittwieser 等，2020——五步模型展开加 bootstrap 的搜索 value。
Diffusion for World Modeling: Visual Details Matter in Atari (DIAMOND)，Alonso 等，NeurIPS 2024——直接度量 rollout drift，DDPM 对 EDM。
Genie 3: A new frontier for world models，Google DeepMind，2025——可交互世界保持几分钟一致。
Error Analyses of Auto-Regressive Video Diffusion Models，2025——history forgetting 与 temporal degradation 作为 exposure bias。
From Masks to Worlds: A Hitchhiker's Guide to World Models，2025——Genie 系列演进与「离 persistence 还很远」的判断。
Sequence Level Training with Recurrent Neural Networks，Ranzato 等，2015——exposure bias 最初的框架及其 RL 解法。
Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Reward Design，2025，及 An Information-Theoretic Perspective on Credit Assignment，2021——为什么稀疏的末端奖励让长 horizon 的 credit assignment 变难。
长 horizon 的 agent benchmark：OSWorld、GAIA、τ-bench——步数预算、按 horizon 分级，以及 pass^k 的可靠性衰减。