GRASP: planning by optimizing the states in between

June 2026 · world models · planning · GRASP (arXiv:2602.00475)

GRASP changes the object a planner is allowed to optimize. In a standard world-model rollout, the intermediate states are computed consequences of the action sequence. In GRASP, those intermediate states become virtual states: variables optimized together with the actions, then checked by the learned dynamics.

The identity of a state changes

A learned world model gives a transition function \(F_\theta(s_t, a_t)\). Given the current state \(s_0\) and an action \(a_0\), it predicts the next state. A standard rollout planner optimizes only the action sequence:

\[ \text{variables} = a_0,\ldots,a_{T-1} \] \[ \begin{aligned} s_1 &= F_\theta(s_0,a_0),\\ s_2 &= F_\theta(s_1,a_1),\\ &\ldots\\ s_T &= F_\theta(s_{T-1},a_{T-1}). \end{aligned} \]

Here \(s_1,s_2,\ldots\) are rollout results. The planner cannot move \(s_2\) directly. It can only change earlier actions and ask the world model to recompute the chain.

GRASP keeps \(s_0\) fixed as the current state and \(s_T\) fixed as the goal, then adds the intermediate states to the optimization variables:

\[ \begin{aligned} \text{variables} = &\ a_0,\ldots,a_{T-1},\\ &\ s_1,\ldots,s_{T-1} \end{aligned} \] \[ \begin{aligned} \mathcal{L}_{\text{dyn}} &= \sum_{t=0}^{T-1} \left\lVert F_\theta(s_t,a_t)-s_{t+1}\right\rVert^2 . \end{aligned} \]

The same symbols now mean something different. \(s_1\) and \(s_2\) are no longer only what the model produced. They are virtual states proposed by the planner. The world model checks them: if the planner proposes \(s_{t+1}\) but \(F_\theta(s_t,a_t)\) cannot reach it, the loss increases.

Original schematic. GRASP keeps the world model, but changes the planner's variables. The paper's Figure 2 shows the same serial-vs-lifted planning distinction.

Why this helps long-horizon planning

In a standard action-only rollout, the final goal error must travel backward through a long composition of \(F_\theta\). The planning problem becomes a long backpropagation path through a learned dynamics model. The paper's linear analysis shows the shooting formulation can acquire curvature that grows exponentially with horizon, while the lifted-state formulation has a horizon-independent smoothness bound.

The practical consequence is simple: the planner can adjust the middle of the trajectory directly, then use local dynamics consistency to align each adjacent pair of states and actions. This is useful when the correct plan is non-greedy. Push-T, for example, may require moving away from the goal before reaching it later. A planner that only sees terminal error through a long serial rollout has a harder optimization problem.

The extra mechanisms

Turning states into optimization variables creates new failure modes, so GRASP adds three stabilizers.

Soft dynamics constraint. The transition equality becomes a penalty. The planner can search over intermediate states, but those states are scored by whether the world model can connect them with actions.

Stop-gradient through state inputs. In a high-dimensional visual latent space, \(\nabla_s F_\theta(s,a)\) can be brittle. The optimizer can find a tiny state perturbation that makes the model output the goal, even when the trajectory has no physical meaning. GRASP uses stop-gradient copies \(\bar{s}_t\), so gradients through \(F_\theta(\bar{s}_t,a_t)\) flow through actions, not through the state input.

State noise plus rollout sync. Noise on virtual states helps the optimizer escape poor local minima. A periodic full-rollout gradient step then pulls the action sequence back toward a strict serial rollout. Noise explores; sync keeps the plan tied to what the model can actually roll out.

Evidence

The experiments use learned visual dynamics from DINO-WM on PointMaze, WallSingle, and Push-T. The baselines are CEM, vanilla gradient descent through serial rollout, and LatCo. The headline experiment stretches Push-T to longer horizons.

Horizon (Push-T)	CEM	GD	GRASP
H = 40	61.4% / 35.3s	51.0% / 18.0s	59.0% / 8.5s
H = 50	30.2% / 96.2s	37.6% / 76.3s	43.4% / 15.2s
H = 60	7.2% / 83.1s	16.4% / 146.5s	26.2% / 49.1s
H = 80	2.8% / 132.2s	6.4% / 161.3s	10.4% / 58.9s

Success rate / median successful-trial time on long-range Push-T, from Table 1 of arXiv:2602.00475. LatCo is omitted here for space; it is lower than all three at these horizons.

The pattern is the claim. At short horizons, strong baselines remain competitive. As the horizon grows, CEM and serial-gradient planning degrade faster, while GRASP keeps a higher success rate and lower median success time. The ablation table also supports the mechanism: removing sync nearly collapses success, allowing state gradients to flow drops accuracy from 59.0% to 46.6%, and the state-noise level changes both success and time.

Where it sits in the world-model map

TD-MPC makes latent world models useful for short-horizon control with a terminal value. PRISM-WM changes the dynamics model, so contact and hybrid regimes are not averaged by a single transition function. GRASP changes the planner that uses the model. It shows that long-horizon failure can come from the optimization interface between planner and world model, even before the model architecture changes.

This also marks the boundary. GRASP makes search through a learned model easier. It does not certify that the searched future is true in the environment. A stronger planner can still find plans that exploit a bad world model. That is why it connects naturally to train-test-gap work, event verifiers, and trust-horizon estimates.

References

Michael Psenka et al., Parallel Stochastic Gradient-Based Planning for World Models, 2026. Project page: michaelpsenka.io/grasp.
Gaoyue Zhou et al., DINO-WM, the visual world-model framework used in the experiments.

GRASP：把中间状态也变成规划变量

2026 年 6 月 · 世界模型 · 规划 · GRASP (arXiv:2602.00475)

GRASP 改的是 planner 的优化变量。普通 world-model rollout 里，中间状态只是动作序列经过模型后算出来的结果。GRASP 把这些中间状态变成 virtual states，和动作一起优化，再用 learned dynamics 检查它们是否一致。

状态的身份变了

一个 learned world model 给出 transition function \(F_\theta(s_t,a_t)\)。给定当前状态 \(s_0\) 和动作 \(a_0\)，它预测下一步状态。普通 rollout planner 直接优化的只有动作序列：

\[ \text{优化变量} = a_0,\ldots,a_{T-1} \] \[ \begin{aligned} s_1 &= F_\theta(s_0,a_0),\\ s_2 &= F_\theta(s_1,a_1),\\ &\ldots\\ s_T &= F_\theta(s_{T-1},a_{T-1}). \end{aligned} \]

这里的 \(s_1,s_2,\ldots\) 是 rollout 结果。planner 不能直接移动 \(s_2\)，只能改前面的 action，然后让 world model 重新算整条链。

GRASP 固定当前状态 \(s_0\)，把终点 \(s_T\) 设为 goal，然后把中间状态也放进优化变量：

\[ \begin{aligned} \text{优化变量} = &\ a_0,\ldots,a_{T-1},\\ &\ s_1,\ldots,s_{T-1} \end{aligned} \] \[ \begin{aligned} \mathcal{L}_{\text{dyn}} &= \sum_{t=0}^{T-1} \left\lVert F_\theta(s_t,a_t)-s_{t+1}\right\rVert^2 . \end{aligned} \]

同样写作 \(s_1,s_2\)，身份已经变了。它们已从模型 rollout 产物，变成 planner 主动提出的 virtual states。world model 的作用是检查：如果 planner 设了一个 \(s_{t+1}\)，但 \(F_\theta(s_t,a_t)\) 到不了这个状态，loss 就会变大。

原创示意图。GRASP 保留 world model，但改变 planner 的变量。论文 Figure 2 给出同样的 serial planning 与 lifted planning 对比。

为什么这对长规划有用

普通 action-only rollout 里，最终目标误差要穿过很长一串 \(F_\theta\) 才能影响早期动作。规划问题变成一条很长的 learned dynamics 反向传播路径。论文的线性分析显示，shooting formulation 的曲率会随 horizon 指数增长，而 lifted-state formulation 的 smoothness bound 与 horizon 无关。

工程上，这意味着 planner 可以直接调整轨迹中段，再用局部 dynamics consistency 对齐每一段状态和动作。这个性质对非贪心任务很重要。比如 Push-T 里，正确动作可能先让物体远离目标，再从另一侧回到目标。只通过终点误差反传来找这种计划，优化难度更高。

另外三个稳定机制

把状态变成优化变量后，会出现新问题，所以 GRASP 继续加了三个稳定机制。

Soft dynamics constraint。动力学等式变成惩罚项。planner 可以搜索中间状态，但这些状态要接受 world model 的局部一致性检查。

Stop-gradient through state inputs。在高维视觉 latent space 里，\(\nabla_s F_\theta(s,a)\) 很脆。优化器可能找到一个微小 state perturbation，让模型输出 goal，但这条轨迹没有物理意义。GRASP 用 stop-gradient copy \(\bar{s}_t\)，让 \(F_\theta(\bar{s}_t,a_t)\) 的梯度主要通过 action 流动，避免沿 state input 的脆弱方向更新。

State noise plus rollout sync。virtual states 上的噪声帮助优化器逃出坏局部最优。周期性的完整 rollout gradient step 再把 action sequence 拉回严格的 serial rollout。noise 负责探索，sync 负责让计划继续受模型 rollout 约束。

实验证据

实验使用 DINO-WM 学到的视觉动力学，在 PointMaze、WallSingle、Push-T 上比较 CEM、穿过串行 rollout 的普通 GD、LatCo 和 GRASP。关键实验把 Push-T 的 horizon 拉长。

Horizon（Push-T）	CEM	GD	GRASP
H = 40	61.4% / 35.3s	51.0% / 18.0s	59.0% / 8.5s
H = 50	30.2% / 96.2s	37.6% / 76.3s	43.4% / 15.2s
H = 60	7.2% / 83.1s	16.4% / 146.5s	26.2% / 49.1s
H = 80	2.8% / 132.2s	6.4% / 161.3s	10.4% / 58.9s

长程 Push-T 的成功率 / 成功 trial 收敛中位时间，数据引自原文 Table 1（arXiv:2602.00475）。LatCo 在这些 horizon 上低于表中三者，因篇幅略去。

结果模式就是论文的主张。短 horizon 下，强 baseline 仍然有竞争力。horizon 变长后，CEM 和串行梯度规划下降更快；GRASP 保持更高成功率和更低成功时间。消融也支持机制：去掉 sync 几乎让成功率崩掉；让 state gradient 继续流动，准确率从 59.0% 降到 46.6%；state noise 的强度会同时影响成功率和时间。

放回 world-model 底图

TD-MPC 让 latent world model 用于短 horizon 控制，并用 terminal value 估计后面的 return。PRISM-WM 改 dynamics model，让 contact 和 hybrid regime 不被一个 transition function 平均掉。GRASP 改使用模型的 planner。它说明 long-horizon failure 可以来自 planner 和 world model 之间的优化接口，即使模型架构还没有变化。

边界也很清楚。GRASP 让 learned model 里的搜索更容易，但它不证明搜索出来的未来在真实环境里成立。更强的 planner 仍然可能更快找到 world model 的漏洞。所以它自然连到 train-test gap、event verifier、trust horizon 这些问题。

参考

Michael Psenka et al., Parallel Stochastic Gradient-Based Planning for World Models, 2026。项目页：michaelpsenka.io/grasp。
Gaoyue Zhou et al., DINO-WM，实验中使用的视觉 world-model 框架。