TD-MPC: A World Model for Action

June 2026

TD-MPC makes a useful turn in model-based RL: it moves the world model's job from reconstructing every visible detail to predicting the parts that matter for action.

The entry point of TD-MPC is an old problem in model-based RL. If an agent wants to evaluate the future before acting, it needs an internal model. A more complete model looks more reliable for planning, while computation and model error both grow with the planning horizon.

The paper compresses that problem into a concrete interface: if an agent wants to imagine the future before acting, what kind of world model does it actually need?

The natural answer is to build the most complete model possible. Given the current state and an action, it should predict the next image, the next state, and every changing detail in the environment. This sounds reasonable. In continuous control, it is often too expensive.

A walking robot is surrounded by details: floor texture, shadows, background colors, pixel noise, and body details irrelevant to the task. Training a model to reproduce all of them spends capacity on information the controller may never use. Planning then calls the model repeatedly. A small one-step error can become a larger rollout error after several imagined steps.

TD-MPC makes a direct judgment: for an acting agent, the important test for a world model is whether it helps choose better actions.

The Old Problem Of Long-Horizon Planning

A simple route in reinforcement learning is to avoid explicit environment modeling and learn a policy through trial and error. Given a state, the policy outputs an action. SAC-like model-free methods live in this family. They are relatively simple and stable, while their sample efficiency can be poor.

Model-based RL takes another route. The agent first learns a model, then tries actions inside that model. It can ask:

if I execute action a, what will happen over the next few steps?

This is model predictive control, or MPC. The procedure is fixed: at each time step, look ahead for a short horizon, evaluate many candidate action sequences, choose the sequence with the highest predicted return, and execute only its first action. At the next observation, plan again.

This mechanism fits continuous control well. The action can be joint torque, velocity, direction, or another continuous vector, instead of a discrete button such as left, right, or jump. Humanoid and Dog locomotion tasks have high-dimensional action spaces, where pure trial-and-error learning can be slow.

MPC also has a clear failure mode. A short horizon can be myopic. A long horizon is expensive, and errors in the learned model accumulate. TD-MPC's central compromise is simple: use the model for the near future, and use a value function for the far future.

TD-MPC overview figure from the paper showing latent model rollouts, reward, value, and benchmark curves — Figure 1 from the TD-MPC paper. The top diagram is the method in miniature: encode observation into latent state, perform model rollouts, score them with reward and value, then execute an action. The bottom plots show the paper's headline Humanoid and Dog locomotion results.

Short-Term Model, Long-Term Q

The name TD-MPC has two parts. MPC handles short-horizon planning. Temporal difference learning trains long-term value estimates.

Long-term value here is a number produced by a network. More precisely, the paper uses Q(z, a): given a latent state z and an action a, output the expected long-term reward from that point.

During inference, TD-MPC encodes the current observation into a latent state z_0. It then samples many candidate action sequences. For each sequence, it repeatedly calls the dynamics model:

z_0 + a_0 -> z_1
z_1 + a_1 -> z_2
z_2 + a_2 -> z_3
...
z_H

This is a model rollout. The robot is still in the real environment; the learned dynamics model is being called several times internally to imagine possible futures. Each candidate trajectory receives a score:

short-term rewards + terminal Q value

The first term comes from reward predicted along the short rollout. The second term comes from the Q function at the end of the planning horizon. MPC can therefore avoid rolling out very far while still accounting for long-term return.

This is the paper's clean interface: model-based planning handles local action details, and TD value learning reconnects the local plan to the long-term objective.

TOLD: A Latent Model Built For Acting

The model inside TD-MPC is called TOLD: Task-Oriented Latent Dynamics. The name can sound abstract, but the object is a set of neural networks:

h: observation -> latent state z
d: z + action -> next z
R: z + action -> reward
Q: z + action -> long-term value
pi: z -> candidate action

The raw observation can be robot state or image input. Either way, it must become numbers. The network that turns raw input into a latent vector is the encoder, also called the representation network. TD-MPC uses h to encode an observation into z, and planning happens inside that latent representation.

The key move is that TD-MPC does not require z to reconstruct the full world. It does not train the model to predict the next image or recover every state detail. The latent dynamics is trained through reward prediction, TD value loss, and latent consistency.

TOLD training diagram from the TD-MPC paper with latent states, dynamics, reward, value, policy, online network, and target network — Figure 2 from the TD-MPC paper. TOLD is trained by rolling forward in latent space: encode the first observation, predict future latent states with `d`, attach reward/value/policy heads at each latent state, and use target networks for the bootstrapped targets.

Read this figure as the training contract. The model is not asked to draw the next camera frame. It is asked to make latent states that support three predictions the planner will actually use: immediate reward, terminal value, and the next latent state. That is where the paper's "task-oriented" claim becomes concrete.

This changes the objective of the world model. The training signal pushes z toward information that affects reward, value, and planning. For Humanoid walking, posture, velocity, and falling matter. Background texture exists in the world, yet it is mostly a burden for the current task.

This leaves an important question: how much generality does a task-oriented representation lose? If the reward changes, or the task changes, can the same latent still be reused? TD-MPC's answer is engineering-first: make the current task strong. The tradeoff works, and it creates room for later world-model work.

What Makes TD-MPC's World Model Different

A common intuition for world models is reconstruction-oriented: a model understands the world when it can predict future observations. TD-MPC is closer to control-oriented modeling: the model's main responsibility is to support action selection.

Terminology

reconstruction-oriented training is organized around reconstruction: compress the observation into a latent state, then recover the image or full state from that latent. Its standard is reconstruction quality: how faithful the recovery is.

control-oriented training is organized around action. The latent state only needs to help predict reward, value, and the next latent state so MPC can choose a better action. Its standard is action quality: how good the selected action is in the environment.

This resembles MuZero in spirit. MuZero also organizes its model around a planning interface: representation, reward, value, and policy. TD-MPC brings that style into high-dimensional continuous action spaces such as Dog and Humanoid locomotion.

What The Experiments Actually Test

The experiments are an interface stress test. They ask whether TD-MPC survives several pressures at once: state-based control, image-based control, sparse rewards, goal-conditioned manipulation, high-dimensional actions, multi-task learning, multi-modal input, and limited planning compute. The paper reports 92 continuous-control tasks across DMControl and Meta-World.

The first evidence block is state-based DMControl. The comparison is useful because the baselines attack the problem from different sides: SAC is model-free, LOOP augments SAC with a learned model, MPC:sim plans with a ground-truth simulator, and the TD-MPC ablations remove either the latent model or the latent consistency regularizer.

DMControl learning curves from the TD-MPC paper comparing TD-MPC with SAC, LOOP, MPC sim, and ablations across 15 tasks — Figure 3 from the paper. Each small plot is one DMControl task. The red curve is TD-MPC; blue is SAC; orange is LOOP; the black dashed line is MPC with a ground-truth simulator. The informative pattern is where the gains are largest: tasks with more complex dynamics such as Quadruped and Acrobot.

This figure is the main empirical support for the method story. If the latent model were only a cheap reconstruction shortcut, it would be hard to explain why it helps more on tasks where local planning matters. The ablation curves also matter: removing the latent structure or the consistency regularizer weakens performance, which supports the claim that the latent rollout objective is doing real work.

The image-based setting is a stricter test. Here, TD-MPC receives pixels instead of compact simulator state, but it still does not train a reconstruction model. The 100k benchmark table is useful because several baselines are specialized for pixels, while TD-MPC keeps the same control-oriented modeling principle.

Table 1 from the TD-MPC paper showing image-based DMControl 100k returns across SAC, CURL, DrQ, PlaNet, Dreamer, MuZero, EfficientZero, and TD-MPC — Table 1 from the paper. The rows are image-based DMControl 100k tasks. TD-MPC is competitive with pixel-specialized baselines on several tasks, and performs strongly on Finger Spin and Walker Walk while avoiding the action discretization used by MuZero-style methods.

Image-based DMControl learning curves from the TD-MPC paper comparing TD-MPC with CURL, DrQ, DrQ-v2, and Dreamer-v2 — Figure 4 from the paper. The longer image-based benchmark shows that TD-MPC's competitiveness extends beyond the 100k-step setting and remains visible on harder pixel-control tasks, including Walker and Quadruped variants.

Meta-World and multi-modal control test a different question: does the reward-centric latent representation still help outside pure locomotion? In the top row below, TD-MPC improves success rate on 50 goal-conditioned manipulation tasks and on MT10 multi-task learning. In the bottom row, the multi-modal version uses proprioception plus an egocentric camera; the blind variant keeps only proprioception and falls behind.

Meta-World and multi-modal RL curves from the TD-MPC paper comparing TD-MPC, SAC, MPC sim, and a blind TD-MPC variant — Figure 5 from the paper. Top: Meta-World success rate. Bottom: multi-modal locomotion with proprioception plus camera input. This is the paper's evidence that TOLD can use different input modalities while keeping the same latent planning interface.

The compute-budget experiment is small, but it is conceptually important. MPC has a knob at inference time: how many model steps and optimization iterations should the agent spend before acting? TD-MPC improves as planning budget increases, while the learned policy prior alone is weaker than planning.

Variable computational budget plots from the TD-MPC paper showing Quadruped Walk return as planning horizon and CEM iterations vary — Figure 6 from the paper. Left varies the planning horizon; right varies CEM iterations. The orange star marks the default setting. The gray cross is the learned policy without full planning. This figure is the cleanest evidence for the paper's claim that more planning compute can translate into better control.

The strongest ablation for the world-model claim is the latent dynamics objective. The paper compares no regularization, reconstruction loss, contrastive loss, and latent state consistency. Reconstruction and contrastive objectives help over no regularization, but the proposed latent consistency objective is the most consistent across the 15 DMControl tasks.

Latent dynamics objective ablation from the TD-MPC appendix comparing no regularization, reconstruction, contrastive loss, and latent state consistency — Figure 10 from the paper appendix. This ablation is the empirical version of the paper's modeling claim: the most stable objective aligns the latent model with future latent consistency, reward, and value. Direct reconstruction is weaker in this comparison.

Put together, the experiments support a narrower and stronger claim than "world models help RL." TD-MPC shows that, for continuous control, a task-oriented latent model can be good enough for short-horizon planning, and a terminal value function can cover the rest of the return. The remaining weakness is also visible in the plots: exploration-heavy tasks such as Finger Turn Hard are still difficult, and task-oriented representations may not transfer cleanly when the reward or dynamics regime changes.

Open Questions: The Cost Of Task-Oriented World Models

TD-MPC's core assumption can be written as a testable interface: latent rollout predicts short-term rewards, terminal Q estimates return beyond the planning window, and MPC chooses actions using their combined score. That interface moves the world-model evaluation standard from reconstruction quality to action quality.

The first risk is representation sufficiency. The training objective is pulled by the current task's reward and value, so the latent state tends to preserve variables that are useful for the current task. In Humanoid walking, discarding background texture is usually good. After task changes, reward rewrites, or contact dynamics entering a new regime, a discarded variable may become necessary state information.

The second risk is regime mixing. TD-MPC uses one unified latent dynamics model, while real physical systems often have discontinuous regimes. A foot touching the ground and a foot leaving the ground follow different transition dynamics. A hand hitting an object and a hand moving through free space do as well. A single dynamics model may average these modes. PRISM-WM pushes in a related direction by using MoE dynamics to explicitly decompose hybrid dynamics.

The third risk is horizon decomposition. TD-MPC splits the future into two parts: short-term model rollout and long-term terminal Q. This is effective in continuous control, but it assumes that the latent state at the end of the short rollout contains enough information for the Q function to estimate the long-term objective. If long-term success depends on variables lost during the early rollout, terminal value can become overly optimistic.

Later world-model work can inherit this evaluation style and push on three pressure points: whether the representation is sufficient, whether latent dynamics can express multiple transition regimes, and whether terminal value can reliably cover the future beyond the planning window.

Sources

Temporal Difference Learning for Model Predictive Control, Hansen, Wang, and Su, ICML 2022.
TD-MPC project page, including videos and benchmark summaries.
MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model, Schrittwieser et al., Nature 2020.

TD-MPC：世界模型不必还原世界

June 2026

一篇关于 model-based RL 的论文有一个真正有意思的转向：它把 world model 从“预测世界长什么样”，推向“预测什么对行动有用”。

TD-MPC 的入口是 model-based RL 里一个老问题：智能体如果要在行动前评估未来，需要一个内部模型；模型越完整，规划看起来越可靠，但计算和误差也会一起变大。

这篇论文把问题压到一个更具体的接口：如果一个智能体想在行动前先想象未来，它到底需要一个什么样的 world model？

一个自然的答案是：模型越完整越好。给它当前状态和动作，它应该预测下一帧图像、下一时刻状态、环境里所有会变化的东西。这个答案听起来很合理，也很诱人。问题是，在连续控制里，这个答案往往太贵。

机器人走路时，世界里有很多细节：地面纹理、阴影、背景颜色、像素噪声、和任务无关的身体细节。如果一个模型被训练去还原所有这些东西，它会消耗大量 capacity。更麻烦的是，planning 会把模型连续调用多步。一步预测错一点，rollout 之后误差会逐步放大。

TD-MPC 的判断很直接：world model 不必先成为一个完整的世界重建器。对一个要行动的智能体来说，真正重要的是这个模型能不能帮助它选择更好的动作。

从长远规划的老问题开始

强化学习里有一条最朴素的路线：不显式建模环境，只靠试错学一个 policy。给定状态，policy 输出动作。SAC 这类 model-free 方法就属于这个大方向。它们简单、稳定，但样本效率常常很低。

另一条路线是 model-based。智能体先学一个模型，再在模型里试动作。它可以先问：

如果执行动作 a，未来几步会发生什么？

这就是 model predictive control，简称 MPC。MPC 的流程很固定：每个时刻短暂地往前看几步，试很多候选动作序列，选预测回报最高的那一条，然后只执行第一步。下一时刻拿到新的 observation 后，再重新规划。

这个机制很适合连续控制，因为动作从“左、右、跳”这种离散按钮，变成关节力矩、速度、方向这些连续数值。Humanoid 或 Dog locomotion 这种任务，action space 可以很高维，直接靠 model-free 试错会很慢。

MPC 的问题也很清楚。它如果只看未来几步，就容易短视；如果看很远，计算又太贵，而且 learned model 的误差会累积。TD-MPC 的核心折中就在这里：近处用模型仔细算，远处用 value function 估。

短期靠模型，长期靠 Q

TD-MPC 的名字可以拆成两半。MPC 负责短期 planning。Temporal difference learning 负责学习长期价值。

这里的长期价值很具体：它是一个网络输出的数字。论文里更准确地说是 Q(z, a)：输入一个 latent state z 和一个 action a，输出从这里开始预计还能拿到多少长期 reward。

推理时，TD-MPC 会从当前 observation 得到一个 latent state z_0。然后它采样很多条候选动作序列，对每条序列反复调用 dynamics model：

z_0 + a_0 -> z_1
z_1 + a_1 -> z_2
z_2 + a_2 -> z_3
...
z_H

这就是 model rollout。rollout 在这里指模型内部的未来展开：机器人还停在真实环境里，dynamics model 被连续调用几步。每条候选轨迹都会得到一个分数：

short-term rewards + terminal Q value

前半部分来自短期 rollout 中预测到的 reward。后半部分来自短期规划末端的 Q function。这样，MPC 不需要一直规划到很远，也不至于完全短视。

这也是这篇论文最干净的接口：model-based planning 负责局部动作细节，TD value learning 负责把长期目标接回来。

TOLD：一个为行动服务的 latent model

TD-MPC 里的模型叫 TOLD，Task-Oriented Latent Dynamics。这个名字容易显得抽象，但它具体就是一组神经网络：

h: observation -> latent state z
d: z + action -> next z
R: z + action -> reward
Q: z + action -> long-term value
pi: z -> candidate action

原始 observation 可能是机器人状态，也可能是图像。它们最终都要变成数字。把原始输入变成 latent vector 的网络叫 encoder，也可以叫 representation network。TD-MPC 用 h 把 observation 编码成 z，然后所有 planning 都在这个 latent 表示上发生。

关键在于，TD-MPC 不要求 z 能重建完整世界。它不让模型预测下一张图片，也不要求它还原所有 state 细节。它只用 reward prediction、TD value loss 和 latent consistency 去训练这个 latent dynamics。

这张图可以当成 TOLD 的训练契约来看。模型没有被要求画出下一张相机图像，它被要求产出 planning 会用到的三类预测：即时 reward、terminal value、下一个 latent state。这就是 task-oriented 这个词落到数学和训练目标里的地方。

这就改变了 world model 的目标。训练目标会把 z 推向那些会影响 reward、value、planning 的信息。对于 Humanoid 走路，身体姿态、速度、是否会摔倒很重要；背景纹理在真实世界里存在，却是当前任务里的负担。

这里有一个很值得继续追问的地方：task-oriented representation 会不会牺牲通用性？如果 reward 改了，任务改了，这个 latent 还能复用吗？TD-MPC 的回答偏工程：先把当前任务做强。这个取舍很有效，但也给后面的工作留下了空间。

TD-MPC 的 world model 特殊在哪里

很多 world model 的直觉是 reconstruction-oriented：模型越能预测未来 observation，就越懂世界。TD-MPC 更接近 control-oriented：模型的主要职责是支持行动选择。

Terminology

reconstruction-oriented 的训练目标是重建：把 observation 压成 latent state，再从 latent state 还原图像或完整状态。它关心的是 reconstruction quality，也就是还原得像不像。

control-oriented 的训练目标围绕行动：latent state 只要能帮助预测 reward、value 和下一步状态，再让 MPC 选出更好的 action。它关心的是 action quality，也就是选出来的动作最后好不好。

这和 MuZero 有一点相似。MuZero 的模型也围绕 planning 接口组织：representation、reward、value 和 policy。TD-MPC 把这个思路带进高维 action space 的机器人运动，例如 Dog 和 Humanoid 这类连续控制任务。

实验到底在检验什么

这篇论文的实验是一组接口压力测试。它在检验 TD-MPC 这个接口能不能同时扛住几类压力：state-based control、image-based control、sparse reward、goal-conditioned manipulation、高维连续 action、多任务学习、多模态输入，以及有限 planning compute。论文一共评估了 DMControl 和 Meta-World 里的 92 个连续控制任务。

第一块证据是 state-based DMControl。这个比较有用，是因为几个 baseline 分别代表不同路线：SAC 是 model-free；LOOP 是在 SAC 上加 learned model；MPC:sim 是直接用 ground-truth simulator 做 planning；TD-MPC 的 ablation 则分别去掉 latent model 或 latent consistency regularizer。

这张图是方法主张的核心实验证据。如果 latent model 只是一个便宜的 reconstruction shortcut，它很难解释为什么在更需要局部 planning 的任务上提升更大。Ablation 曲线也重要：去掉 latent 结构或 consistency regularizer 后，性能会变弱，这说明 latent rollout objective 是有效组件。

image-based 设置更严格。这里 TD-MPC 的输入是 pixels，已经脱离 compact simulator state，但它仍然没有训练 reconstruction model。100k benchmark 的表格很有价值，因为里面不少 baseline 是专门为 pixel RL 调过的；TD-MPC 仍然坚持 control-oriented 的 modeling principle。

Meta-World 和 multi-modal control 检验的是另一件事：这种 reward-centric latent representation 离开纯 locomotion 后还管不管用。下图上半部分是 50 个 goal-conditioned manipulation tasks 和 MT10 多任务学习；下半部分是 proprioception 加 egocentric camera 的多模态 locomotion。只保留 proprioception 的 blind 版本明显弱很多。

computational budget 这个实验很小，但概念上很关键。MPC 在 inference time 有一个旋钮：智能体在行动前到底花多少模型步数和优化迭代去规划？TD-MPC 随着 planning budget 增加而变好，而单独使用 learned policy prior 会弱于完整 planning。

对 world model 主张最硬的一组 ablation，是 latent dynamics objective。论文比较了 no regularization、reconstruction loss、contrastive loss 和 latent state consistency。reconstruction 和 contrastive 都比没有 regularization 更好，但论文提出的 latent consistency 在 15 个 DMControl 任务上最稳定。

合起来看，这套实验支持的是一个更窄的判断：在连续控制里，一个 task-oriented latent model 可以足够支撑短期 planning，再用 terminal value function 估计规划窗口之外的 return。弱点也能从图里看到：Finger Turn Hard 这种 exploration-heavy task 仍然困难；如果 reward 或 dynamics regime 变了，task-oriented representation 还能不能复用，也还没有完全解决。

开放问题：task-oriented world model 的代价

TD-MPC 的核心假设可以写成一个可检验接口：latent rollout 负责短期 reward，terminal Q 负责规划窗口之外的 return，MPC 用两者合成的分数选择动作。这个接口把 world model 的评价标准从 reconstruction quality 移到 action quality。

第一处风险来自 representation sufficiency。训练目标一直由当前任务的 reward 和 value 牵引，latent state 会倾向于保留当前任务能用上的变量。在 Humanoid 走路里，背景纹理被剪掉通常是好事；任务切换、reward 改写、接触动力学进入新 regime 之后，被剪掉的变量可能正好变成下一阶段需要的状态信息。

第二处风险来自 regime mixing。TD-MPC 使用一个统一的 latent dynamics model，可真实物理经常呈现多个不连续 regime。脚接触地面和离开地面，手碰到物体和没碰到物体，对应不同的转移动力学。一个单一 dynamics model 可能会把这些模式平均掉。PRISM-WM 对应的推进方向，是用 MoE dynamics 显式分解 hybrid dynamics。

第三处风险来自 horizon decomposition。TD-MPC 把未来拆成两段：短期交给 model rollout，远期交给 terminal Q。这个分解在连续控制里很有效，但它也假设短期 rollout 末端的 latent state 已经包含足够信息，让 Q function 可以用于估计长期目标。如果长期成功依赖早期 rollout 没有保留下来的变量，terminal value 会产生偏乐观的估计。

后续 world model 工作可以直接继承这个评价方式，并在三个压力点上推进：representation 是否足够、latent dynamics 是否能表达多 regime 转移、terminal value 是否能稳定补上规划窗口之外的未来。

Sources

Temporal Difference Learning for Model Predictive Control, Hansen, Wang, and Su, ICML 2022.
TD-MPC project page, including videos and benchmark summaries.
MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model, Schrittwieser et al., Nature 2020.