Two Long Horizons

June 2026

The same phrase names a reliability problem in language-model agents and a fidelity problem in world models. Telling the two apart explains why one horizon has been doubling every few months while the other, in its hardest form, is still measured in seconds.

Two research literatures use the phrase long horizon constantly, and they rarely cite each other. In work on language-model agents, a long horizon is a claim about reliability over time: how long a task an agent can carry before it makes one fatal mistake. In work on world models and model-based reinforcement learning, a long horizon is a claim about fidelity over rollout: how many steps a learned model can predict before its imagined future stops matching the real one. The phrase is shared. The axis underneath it is not.

The distinction is worth holding precisely, because the two horizons behave differently enough that advice from one field is often wrong in the other. One has been doubling on a calendar; the other gets longer only when the model itself gets more accurate.

The agent's horizon is measured in human time

The cleanest definition of an agent's long horizon comes from METR. Their metric is the 50%-task-completion time horizon: the length of tasks, measured by how long a human expert takes, that a model can finish with 50% success (Kwa et al., 2025). The axis is human time. A model with a one-hour horizon succeeds half the time on tasks that take a person about an hour — Claude 3.7 Sonnet, in March 2025. The headline is a trend: that horizon has doubled roughly every seven months since 2019, and faster, closer to every four months, through 2024–2025. Later frontier models pushed the 50% figure from under an hour to several hours.

Two details keep this from being a general autonomy claim, and both matter for the comparison. First, the y-axis is human labor, not machine runtime. METR is explicit that the horizon measures "the amount of serial human labor [a model] can replace," not how long it can run on its own (METR, 2026). A five-hour horizon is not five hours of unattended operation. Second, reliability does most of the work. The 80%-success horizon is about five times shorter than the 50% horizon for the same model. Demanding higher reliability collapses the usable task length, because the failures are not in reasoning, they are in execution stretched over many steps.

That last point has its own paper. The Illusion of Diminishing Returns isolates execution by handing the model the full plan and knowledge for a task and asking only that it carry out many simple steps in sequence. Models that reach 100% accuracy on the first step still degrade over the run, and the degradation has a clean shape: completing an H-step task requires succeeding at each step, so end-to-end success behaves like single-step accuracy raised to the power of the horizon. A small gain in per-step accuracy compounds into a large gain in achievable length, which is why short benchmarks look saturated while long tasks keep improving. Toby Ord makes the same curve explicit from the other side: model an agent as failing at a constant rate per human-minute of work, and you get an exponentially declining success rate with task length, and a characteristic "half-life" for each agent (Ord, 2025).

So the agent's long horizon is a measurement of reliability. The unit is human task-time; the quantity that accumulates is the probability of a task-level mistake; and the way to extend it is to raise per-step reliability or to shorten the chain of steps that all have to go right.

The world model's horizon is counted in its own steps

A world model's long horizon lives on a different axis. Here the horizon is the number of steps the learned model is unrolled forward, and the quantity that accumulates is prediction error.

The canonical statement is MBPO's. A one-step dynamics model applied autoregressively feeds each prediction back in as the next input, so "small errors compound over the prediction horizon" (Janner et al., 2019). The classical bound behind this, the simulation lemma, scales value error by one-step model error times the effective horizon — up to quadratically in the worst case. MBPO's response is to keep the model horizon short and decoupled from the task horizon: instead of a few long rollouts from the start state, it runs many short rollouts branched from real states sampled out of the replay buffer. In practice the rollout length is often one, and the authors note their model is accurate enough for 200-step rollouts yet learns better from far shorter ones; 500-step rollouts are too inaccurate to use at all.

The same instinct — roll the model a few steps, no further — runs through the planning world models. Dreamer learns behavior entirely inside imagined latent trajectories of a fixed length: the imagination horizon is exactly 15 steps, held constant across Dreamer, DreamerV2, and DreamerV3 (Hafner et al., 2023). TD-MPC plans over a latent horizon of 5; TD-MPC2 shortens the default to 3. MuZero unrolls its learned model for only 5 hypothetical steps during training. None of these models is asked to predict far. Each is unrolled briefly and then stopped.

For generative video world models, the horizon is measured in frames and seconds, and the failure is visible rather than scalar. DIAMOND traces long-rollout collapse to compounding error that pushes the model out of distribution, and shows the diffusion formulation changes how fast that happens: a DDPM-based model drifts badly over a 1000-step rollout while an EDM-based one stays stable, even with a single denoising step. Genie 3 generates an interactive world at 24 frames per second and 720p that stays consistent for a few minutes, with visual memory reaching about one minute back — and DeepMind names that same window as the limit, because inaccuracies accumulate over time. The honest summary of the video line is that "long horizon" still means minutes, and the field describes its own state as "far from persistence."

So the world-model long horizon is a measurement of how far the prediction holds. The unit is model timesteps or frames; the quantity that accumulates is divergence of the predicted state from the real one; and the way to extend it is to make the model more accurate per step, or to roll it fewer steps before handing off.

LANGUAGE-MODEL AGENT 0.8 0.5 t₈₀ t₅₀ human task length → task success accumulates: per-step reliability WORLD MODEL H naïve rollout short rollout + value value covers the rest rollout step k → prediction error accumulates: state-prediction error
Schematic of the two axes, not reproduced data. Left: the success-versus-human-task-length curve METR fits to read off a time horizon at a chosen reliability; the 80% horizon sits to the left of the 50% horizon. Right: prediction error growing over an autoregressive rollout, the curve MBPO and DIAMOND measure, with a learned value standing in for everything past the short horizon H.

Side by side

Reading the two horizons against each other makes the structural differences concrete. Every row below is the same word resolving to a different object.

Dimension In a language-model agent In a world model
The axis Human task-completion time. A model's "time horizon" is the human duration of tasks it finishes at a set reliability. The model's own timesteps, or frames of video, that it is unrolled forward.
What accumulates per step Task-level reliability. End-to-end success ≈ single-step accuracy raised to the horizon. State-prediction error. One-step model error compounds over the rollout.
How it is measured Fit a success curve against human task length, read a horizon at 50% / 80%, track its doubling over calendar time. Fix a horizon H, then measure rollout drift directly — pixel error, consistency duration.
How it is shortened Split the task into short grounded sessions; persist memory to disk; verify each step. Roll the model a few steps, then bootstrap a learned value; branch from real states; abstract into latent space.
Grounding Added at every step (tool output, test results). The main thing that extends the horizon. Removed during imagination. The model runs ungrounded for H steps by design.
The object at the far end A sparse, binary verifier (tests pass or fail). It creates the credit-assignment problem. A dense, continuous value (TD bootstrap). It dissolves the credit-assignment problem.
What more scale or "thinking" buys Reliability doubles over time; thinking removes self-conditioning and lengthens execution. A longer H often hurts; usable horizon grows from model fidelity, not calendar time.

The shared mechanism: a model conditioning on its own output

The two horizons are not unrelated. Both fields, when they look for the deepest cause of failure, arrive at the same place: a model consuming its own previous outputs. And both name the same ancestor for it — exposure bias, the train-test mismatch first framed for sequence models a decade ago, where a network trained on ground-truth prefixes is, at inference, fed its own imperfect generations and never learned to recover (Ranzato et al., 2015).

What differs is the surface the loop runs on. In a language-model agent it is a reliability effect. The Illusion of Diminishing Returns shows it causally: inject errors into the model's own context and its next-step accuracy falls, even at fixed context length — the model becomes more likely to err when its history already contains errors. The paper calls this self-conditioning, and reports two things that matter for the comparison. Scaling the model does not remove it: 200-billion-parameter models still self-condition. Explicit "thinking" — RL-trained reasoning — does remove it, and also stretches how far a model can execute in a single turn, from around 120 steps for one model to over a thousand for another.

In a world model the same loop is a fidelity effect. Each predicted state or frame becomes the input to the next, so error does not just persist, it pushes the rollout off the distribution the model was trained on, which raises the next error. This is exactly DIAMOND's drift, and the video literature's pair of failure modes — history forgetting and temporal degradation — both attributed to the teacher-forcing mismatch (Meta-ARVDM, 2025). The cures differ in kind: the agent side adds grounding and trains recovery; the world-model side shortens rollouts, changes the diffusion objective, conditions on more history, or anchors generation to its opening frames. The same loop produces a reliability problem on one side and a perceptual-drift problem on the other.

The same trick, inverted: a learned value at the end of the horizon

Both fields also stop short of running the full horizon, and lean on a learned object to cover the rest. This is where the comparison is most useful, because the object plays the same structural role and produces the opposite consequence.

In a world model the terminal object is a value function. TD-MPC plans five steps and then adds a single learned terminal Q-value that estimates all return beyond the horizon, so the short rollout stands in for an effectively infinite one. MuZero bootstraps onto an n-step search value; Dreamer closes its 15-step return with a critic, and reports that without that bootstrap the agent becomes "shortsighted" and only solves reactive tasks. This terminal value is dense and continuous, and it is trained by one-step temporal-difference bootstrapping — so the long horizon is never rolled out anywhere. It is folded into a value function during training. There is no sparse-reward credit-assignment crisis, because the value carries information at every step.

In a language-model agent the object at the end of the horizon is a verifier, and it is sparse and discrete. Most agentic tasks emit a single scalar at the end: hidden tests pass or fail, an evaluator returns pass or needs-work, functional correctness is graded once at termination. Because that signal is terminal and binary, it creates the credit-assignment problem rather than dissolving it. Value-free RL such as GRPO has to spread one outcome reward uniformly across every turn of a long trajectory, unable to tell which tool call mattered (Zeng et al., 2025). A whole line of work exists only to densify that terminal signal — turn-level rewards, hindsight redistribution, tree-search value estimation — and RL theory warns that the obvious fix is not enough: adding a constant makes the reward non-sparse without making it informative (Arumugam et al., 2021).

WORLD MODEL real state z₀ z₁ z₂ z₃ terminal value dense · continuous unroll H steps, then bootstrap one TD value → credit assignment dissolved LANGUAGE-MODEL AGENT start step step step · · · step verifier pass / fail · binary one scalar at the end, smeared over every step → credit assignment is hard
The same structural move — cap the horizon with a learned estimate instead of rolling it all the way out — points in opposite directions. The world model's value is dense and trained by TD bootstrapping, so the long horizon is folded into a function. The agent's verifier is a single binary signal at termination, which is what turns long-horizon credit assignment into a research problem.

So the same move — cap the horizon with a learned estimate instead of rolling it out — produces opposite work. The world model's value dissolves the long-horizon problem. The agent's verifier is what makes it hard.

Where the two horizons meet

The cleanest evidence that these are genuinely different horizons is the place they stop being different. METR's time horizon is a software-and-reasoning result. Run the same methodology on tasks that need perception and physical grounding and it collapses: horizons for visual computer-use tasks are 40 to 100 times shorter than for software, and Claude Sonnet 4.5's horizon for real-world coffee-making is about two minutes (METR, 2026).

Two minutes is the world-model regime. It is the same order as Genie 3's consistency window, and it appears for the same reason: in a physical, partially observed setting, an agent has to maintain an internal model of a world that keeps changing, and that model drifts. In domains with no clean symbolic state to check against, how long the agent can stay reliable becomes bounded by how far the model's own predictions stay faithful. The two axes stay distinct in the places each field usually measures — clean software tasks for agents, controlled dynamics for world models — and they appear to converge exactly in the messy physical middle that neither field has solved.

Open questions

The comparison is useful mostly because it makes a few questions sharper than either field asks alone.

Sources

  1. Measuring AI Ability to Complete Long Tasks, Kwa et al. (METR), 2025 — the 50% time-horizon metric and the ~7-month doubling trend.
  2. Clarifying limitations of time horizon, METR, 2026 — horizon as human serial labor, and its 40–100× collapse under perception / physical grounding.
  3. The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs, Sinha, Arun, Goel, Staab, Geiping, 2025 — per-step-accuracy compounding and the causal self-conditioning effect.
  4. Is there a half-life for the success rates of AI agents?, Toby Ord, 2025 — constant-hazard model of agent success decay.
  5. Effective harnesses for long-running agents, Anthropic, 2025 — decomposition, durable on-disk handoff, and verification against false success.
  6. When to Trust Your Model: Model-Based Policy Optimization, Janner, Fu, Zhang, Levine, NeurIPS 2019 — compounding error and short branched rollouts.
  7. Mastering Diverse Domains through World Models (DreamerV3), Hafner, Pasukonis, Ba, Lillicrap, 2023 — imagination horizon H=15 with a value-bootstrapped return.
  8. Temporal Difference Learning for Model Predictive Control (TD-MPC), Hansen, Wang, Su, 2022, and TD-MPC2, 2024 — short latent planning plus a learned terminal value.
  9. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero), Schrittwieser et al., 2020 — five-step model unroll with a bootstrapped search value.
  10. Diffusion for World Modeling: Visual Details Matter in Atari (DIAMOND), Alonso et al., NeurIPS 2024 — rollout drift measured directly, DDPM vs EDM.
  11. Genie 3: A new frontier for world models, Google DeepMind, 2025 — interactive worlds consistent for a few minutes.
  12. Error Analyses of Auto-Regressive Video Diffusion Models, 2025 — history forgetting and temporal degradation as exposure bias.
  13. From Masks to Worlds: A Hitchhiker's Guide to World Models, 2025 — the Genie progression and the "far from persistence" framing.
  14. Sequence Level Training with Recurrent Neural Networks, Ranzato et al., 2015 — the original exposure-bias framing and its RL remedy.
  15. Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Reward Design, 2025, and An Information-Theoretic Perspective on Credit Assignment, 2021 — why a sparse terminal reward makes long-horizon credit assignment hard.
  16. Long-horizon agent benchmarks: OSWorld, GAIA, τ-bench — step budgets, tiered horizons, and passk reliability decay.