GRASP: planning by optimizing the states in between
GRASP changes the object a planner is allowed to optimize. In a standard world-model rollout, the intermediate states are computed consequences of the action sequence. In GRASP, those intermediate states become virtual states: variables optimized together with the actions, then checked by the learned dynamics.
The identity of a state changes
A learned world model gives a transition function \(F_\theta(s_t, a_t)\). Given the current state \(s_0\) and an action \(a_0\), it predicts the next state. A standard rollout planner optimizes only the action sequence:
Here \(s_1,s_2,\ldots\) are rollout results. The planner cannot move \(s_2\) directly. It can only change earlier actions and ask the world model to recompute the chain.
GRASP keeps \(s_0\) fixed as the current state and \(s_T\) fixed as the goal, then adds the intermediate states to the optimization variables:
The same symbols now mean something different. \(s_1\) and \(s_2\) are no longer only what the model produced. They are virtual states proposed by the planner. The world model checks them: if the planner proposes \(s_{t+1}\) but \(F_\theta(s_t,a_t)\) cannot reach it, the loss increases.
Why this helps long-horizon planning
In a standard action-only rollout, the final goal error must travel backward through a long composition of \(F_\theta\). The planning problem becomes a long backpropagation path through a learned dynamics model. The paper's linear analysis shows the shooting formulation can acquire curvature that grows exponentially with horizon, while the lifted-state formulation has a horizon-independent smoothness bound.
The practical consequence is simple: the planner can adjust the middle of the trajectory directly, then use local dynamics consistency to align each adjacent pair of states and actions. This is useful when the correct plan is non-greedy. Push-T, for example, may require moving away from the goal before reaching it later. A planner that only sees terminal error through a long serial rollout has a harder optimization problem.
The extra mechanisms
Turning states into optimization variables creates new failure modes, so GRASP adds three stabilizers.
Soft dynamics constraint. The transition equality becomes a penalty. The planner can search over intermediate states, but those states are scored by whether the world model can connect them with actions.
Stop-gradient through state inputs. In a high-dimensional visual latent space, \(\nabla_s F_\theta(s,a)\) can be brittle. The optimizer can find a tiny state perturbation that makes the model output the goal, even when the trajectory has no physical meaning. GRASP uses stop-gradient copies \(\bar{s}_t\), so gradients through \(F_\theta(\bar{s}_t,a_t)\) flow through actions, not through the state input.
State noise plus rollout sync. Noise on virtual states helps the optimizer escape poor local minima. A periodic full-rollout gradient step then pulls the action sequence back toward a strict serial rollout. Noise explores; sync keeps the plan tied to what the model can actually roll out.
Evidence
The experiments use learned visual dynamics from DINO-WM on PointMaze, WallSingle, and Push-T. The baselines are CEM, vanilla gradient descent through serial rollout, and LatCo. The headline experiment stretches Push-T to longer horizons.
| Horizon (Push-T) | CEM | GD | GRASP |
|---|---|---|---|
| H = 40 | 61.4% / 35.3s | 51.0% / 18.0s | 59.0% / 8.5s |
| H = 50 | 30.2% / 96.2s | 37.6% / 76.3s | 43.4% / 15.2s |
| H = 60 | 7.2% / 83.1s | 16.4% / 146.5s | 26.2% / 49.1s |
| H = 80 | 2.8% / 132.2s | 6.4% / 161.3s | 10.4% / 58.9s |
The pattern is the claim. At short horizons, strong baselines remain competitive. As the horizon grows, CEM and serial-gradient planning degrade faster, while GRASP keeps a higher success rate and lower median success time. The ablation table also supports the mechanism: removing sync nearly collapses success, allowing state gradients to flow drops accuracy from 59.0% to 46.6%, and the state-noise level changes both success and time.
Where it sits in the world-model map
TD-MPC makes latent world models useful for short-horizon control with a terminal value. PRISM-WM changes the dynamics model, so contact and hybrid regimes are not averaged by a single transition function. GRASP changes the planner that uses the model. It shows that long-horizon failure can come from the optimization interface between planner and world model, even before the model architecture changes.
This also marks the boundary. GRASP makes search through a learned model easier. It does not certify that the searched future is true in the environment. A stronger planner can still find plans that exploit a bad world model. That is why it connects naturally to train-test-gap work, event verifiers, and trust-horizon estimates.
References
- Michael Psenka et al., Parallel Stochastic Gradient-Based Planning for World Models, 2026. Project page: michaelpsenka.io/grasp.
- Gaoyue Zhou et al., DINO-WM, the visual world-model framework used in the experiments.