Long-horizon rollouts as overestimation control

June 2026 · offline model-based RL · NEUBAY

Offline model-based RL usually fights value overestimation with conservatism: penalize out-of-dataset actions, and keep rollouts short. NEUBAY removes the conservatism and lets a long rollout do the work. The mechanism is concrete — a posterior over world models decides how far to roll out, and the discount factor demotes the one term that overestimates.

The setup: learning from a fixed dataset

In offline RL the agent gets one batch of logged transitions and cannot collect more. A model-based agent learns a world model from that batch — a transition-and-reward function that maps a state and action to the next state and reward — and then trains a policy inside the model instead of in the real environment.

This is where value overestimation appears. The policy improves by preferring actions the model scores highly, but the model was fit on a finite dataset, so on actions the data never covered it extrapolates, and it tends to extrapolate upward. The optimizer then selects exactly those actions whose high score is an artifact of the model. The standard defense is conservatism: add a pessimism penalty on out-of-dataset actions, and keep the imagined rollout short so it does not wander into unsupported regions where error compounds.

How far to roll out

NEUBAY targets the rollout length directly. The value of a candidate action is estimated from an imagined rollout of horizon \(H\): a sum of model-predicted rewards along the way, plus one bootstrapped value estimate at the end to cover the rest.

\[ V(s_0,a_0)\;\approx\;\underbrace{\sum_{t=0}^{H-1}\gamma^{t}\,\hat r_t}_{\text{imagined rewards}}\;+\;\underbrace{\gamma^{H}\,\hat V(s_H)}_{\text{bootstrapped terminal value}} \]

The two terms behave differently. The imagined rewards are low-bias when the model generalizes. The bootstrapped terminal value is the term most prone to overestimation — it is the value function reporting a number for a state the rollout has arrived at. And that term is multiplied by \(\gamma^{H}\). With a discount near \(0.99\), a one-step rollout keeps almost all of it; a several-hundred-step rollout drives it to nearly zero.

A candidate action's score = imagined rewards + discounted terminal value · · · r₀ r₁ r₂ r₍ₕ₋₁₎ imagined rewards — low bias γᴴ · V(s_H) terminal value — overestimates Weight on the terminal value, γᴴ (γ = 0.99) H=1 → 0.99 H=100 → 0.37 H=500 → 0.007
Schematic (original). A longer rollout replaces one high-bias terminal value with a longer sum of lower-bias model rewards, because the terminal term carries weight \(\gamma^{H}\). NEUBAY makes the bias–variance version of this argument; see the analysis section of the paper.

Why longer rollouts lower the bias

The paper frames this as a bias–variance trade-off on the value target. Bootstrapping from a learned value function injects the overestimation bias in a single term; relying instead on many imagined reward steps trades that bias for the model's own generalization error, which stays low when the model is accurate. Because the bootstrapped term is exponentially discounted with \(H\), extending the horizon reduces the agent's sensitivity to value-function error. Removing conservatism does not make overestimation worse here — once it is removed, the long rollout is what keeps the value target honest.

What lets it go long: a posterior over world models

A long rollout through a single learned model accumulates its errors, which is the usual reason offline methods keep rollouts short. NEUBAY avoids the single-model trap with a Bayesian treatment: instead of one world model it keeps a posterior over world models, approximated by an ensemble, and trains a history-dependent agent to maximize return averaged over that posterior. The name comes from this neutral Bayesian principle — neutral because it drops the explicit pessimism penalty, Bayesian because it represents what it does not know as a distribution over models rather than a single point estimate.

The ensemble also supplies the stopping rule. Where the data is dense the ensemble members agree, so epistemic uncertainty is low; where the data thins out they disagree. NEUBAY rolls out until that disagreement crosses an uncertainty threshold \(\zeta\), then truncates. The horizon is therefore adaptive and state-dependent: it is long in well-supported regions and short where the model stops being trustworthy.

A posterior over world models: agreement in-distribution, disagreement out-of-distribution rollout step → in-distribution: models agree low uncertainty out-of-distribution: models diverge uncertainty > ζ → truncate the rollout here
Schematic (original). The ensemble approximates the posterior; its spread is the epistemic uncertainty. The rollout runs until the spread exceeds \(\zeta\), so the effective horizon is decided by where the model stops agreeing with itself.

Evidence: 33 datasets and one decisive ablation

NEUBAY is evaluated on 33 datasets across four offline suites — D4RL locomotion (12), and the NeoRL locomotion (9), Adroit (6), and AntMaze (6) benchmarks — against 15 baselines spanning conservative model-free methods (CQL, IQL, EDAC, ReBRAC), conservative model-based methods (MOPO, COMBO, RAMBO, MOBILE, and others), and Bayesian-inspired methods (APE-V, MAPLE, CBOP, MoDAP). Without an explicit conservatism penalty it stays competitive on standard data and is strongest on low-quality datasets, where conservative methods tend to fail.

The ablation that carries the central claim is on the uncertainty threshold \(\zeta\), because \(\zeta\) sets how far the rollout runs. Tightening it shortens the horizon and the failure mode that appears is severe value overestimation; loosening it produces horizons in the tens-to-hundreds of steps and recovers performance. The paper notes these horizons run counter to the conventional preference for short rollouts.

D4RL locomotion score vs. threshold ζ (numbers from the paper, Table 2) 22.5 ζ = 0.9 · short rollout failure mode: overestimation 80.1 ζ = 1.0 · horizon ≈ 64–512
Schematic (original); the two scores and the 64–512-step horizon are reported in the paper, Table 2. Forcing the rollout short collapses performance through overestimation; allowing long adaptive rollouts recovers it.
Adaptive long-horizon rollouts improve performance and lower estimated Q
Figure 1 from NEUBAY (arXiv:2512.04341), CC BY 4.0 — adaptive long-horizon rollouts improve performance (left) while lowering the estimated Q-value on the offline dataset (middle): longer rollouts suppress overestimation.
Components

The default ensemble size is large — \(N = 100\) world models, with layer normalization inside each — and the paper reports that reducing it to 20 or 5 degrades performance. The posterior is maintained over the joint reward–transition model, not over the value function; the value function is the agent's own estimate trained on top.

Limitations

The threshold \(\zeta\) is a hyperparameter, so "how far to trust the model" is set rather than learned end-to-end. The \(N = 100\) ensemble is computationally heavy, and the argument leans on the ensemble's spread being a faithful stand-in for epistemic uncertainty, which is not guaranteed in high-dimensional learned models. The result also sits in tension with the online latent-MPC line, where TD-MPC deliberately keeps the rollout short and invests in an accurate terminal value instead — the same overestimation problem, addressed from the opposite end of the horizon.

References

  1. NEUBAY: Long-Horizon Model-Based Offline Reinforcement Learning Without Explicit Conservatism (arXiv:2512.04341). Code: github.com/twni2016/neubay.
  2. Contrast on the short-horizon end: Temporal Difference Learning for Model Predictive Control (TD-MPC).

想象得越长,价值高估反而越小

2026 年 6 月 · 离线 model-based RL · NEUBAY

离线 model-based RL 通常用保守来对付价值高估:惩罚数据集没覆盖的动作,并把 rollout 砍短。NEUBAY 去掉了保守,改让长 rollout 来做这件事。机制很具体——对 world model 维护一个后验来决定 rollout 走多远,而折扣因子让唯一会高估的那一项变得很小。

设定:只能用一批固定数据

离线 RL 里,agent 只拿到一批记录下来的转移,不能再去采集。model-based 的做法是先从这批数据学一个 world model——一个把状态和动作映射到下一刻状态与奖励的转移-奖励函数——然后在模型内部训练策略,真实环境不再参与训练。

价值高估就出在这里。策略靠偏好模型给高分的动作来改进,但模型是在有限数据上拟合的,对数据没覆盖的动作只能外推,而且往往向上外推。优化器随后正好选中这些"高分只是模型假象"的动作。标准对策是保守:对数据集外的动作加一个悲观惩罚,并把想象的 rollout 砍短,避免它走进缺乏支撑、误差累积的区域。

rollout 要走多远

NEUBAY 直接作用在 rollout 长度这个变量上。一个候选动作的价值,来自一段 horizon 为 \(H\) 的想象 rollout:沿途若干步模型预测的奖励之和,再在终点补一个 bootstrapped 价值估计来覆盖后面。

\[ V(s_0,a_0)\;\approx\;\underbrace{\sum_{t=0}^{H-1}\gamma^{t}\,\hat r_t}_{\text{imagined rewards}}\;+\;\underbrace{\gamma^{H}\,\hat V(s_H)}_{\text{terminal value}} \]

这两项性质不同。模型泛化得还行时,想象奖励是低偏差的。而终端价值是最容易高估的一项——它是价值函数对 rollout 走到的那个状态报出来的数。关键在于这一项乘了 \(\gamma^{H}\)。折扣取 \(0.99\) 左右时,一步 rollout 几乎原样保留它;走几百步则把它折扣到接近零。

一个候选动作的分 = 想象奖励之和 + 被折扣的终端价值 · · · r₀ r₁ r₂ r₍ₕ₋₁₎ 想象奖励 — 低偏差 γᴴ · V(s_H) 终端价值 — 会高估 终端价值的权重 γᴴ(γ = 0.99) H=1 → 0.99 H=100 → 0.37 H=500 → 0.007
原创示意图。rollout 越长,就用一段更长的低偏差模型奖励替换掉那一个高偏差的终端价值,因为终端项带着权重 \(\gamma^{H}\)。NEUBAY 给出的是它的 bias–variance 版本,见原文的分析部分。

为什么走得越长,偏差越低

原文把它写成价值目标上的一个 bias–variance 权衡。从学到的价值函数 bootstrap,会把高估偏差集中在一项里;改成依赖许多步想象奖励,则是用模型自身的泛化误差换掉那项偏差,而模型准确时这个误差很低。由于 bootstrapped 项随 \(H\) 指数折扣,拉长 horizon 就降低了 agent 对价值函数误差的敏感度。去掉保守在这里并不会让高估变严重——恰恰是去掉之后,长 rollout 才是抑制价值目标高估的那个机制。

敢走长的前提:对 world model 维护一个后验

用单个学到的模型走长 rollout 会累积它的误差,这正是离线方法通常把 rollout 砍短的原因。NEUBAY 用贝叶斯的方式绕开单模型陷阱:不是一个 world model,而是对 world model 维护一个后验,用 ensemble 近似,再训练一个依赖历史的 agent,去最大化在这个后验上平均的回报。名字就来自这个 neutral Bayesian(中性贝叶斯)原则——neutral,因为它去掉了显式悲观惩罚;Bayesian,因为它把"自己不知道的部分"表示成模型上的一个分布,用一群模型代替单点估计。

ensemble 同时给出停止规则。数据稠密处,成员们意见一致,epistemic uncertainty 低;数据稀疏处,它们彼此发散。NEUBAY 一直 rollout,直到这种分歧超过不确定性阈值 \(\zeta\),然后截断。于是 horizon 是自适应、状态依赖的:在有支撑的区域里很长,在模型不再可信处则很短。

world model 的后验:分布内一致,分布外发散 rollout 步数 → 分布内:模型一致 不确定性低 分布外:模型发散 不确定性 > ζ → 在这里截断 rollout
原创示意图。ensemble 近似后验,它的离散程度就是 epistemic uncertainty。rollout 一直走到离散程度超过 \(\zeta\),所以有效 horizon 由"模型在哪一步不再与自己一致"决定。

证据:33 个数据集和一个决定性消融

NEUBAY 在四个离线套件、共 33 个数据集上评测——D4RL locomotion(12),以及 NeoRL 的 locomotion(9)、Adroit(6)、AntMaze(6)——对比 15 个 baseline,覆盖保守的 model-free 方法(CQL、IQL、EDAC、ReBRAC)、保守的 model-based 方法(MOPO、COMBO、RAMBO、MOBILE 等)以及贝叶斯路线(APE-V、MAPLE、CBOP、MoDAP)。在没有显式保守惩罚的情况下,它在标准数据上保持有竞争力,并在低质量数据集上最强——而保守方法在这类数据上往往失败。

承担核心论点的消融是关于不确定性阈值 \(\zeta\) 的,因为 \(\zeta\) 决定 rollout 走多远。把它收紧会缩短 horizon,随之出现的失败模式正是严重的价值高估;放松它则得到几十到几百步的 horizon,性能恢复。原文指出这些 horizon 与"偏好短 rollout"的常规直觉相反。

D4RL locomotion 分数 vs. 阈值 ζ(数字引自原文 Table 2) 22.5 ζ = 0.9 · 短 rollout 失败模式:高估 80.1 ζ = 1.0 · horizon ≈ 64–512
原创示意图;两个分数与 64–512 步的 horizon 由原文 Table 2 报告。强行把 rollout 砍短会因高估而崩,允许长的自适应 rollout 则恢复性能。
自适应长 rollout 提升性能并压低估计 Q 值
图 1,引自 NEUBAY(arXiv:2512.04341),CC BY 4.0——自适应长 rollout 提升性能(左),同时压低离线数据集上估计的 Q 值(中):长 rollout 抑制了高估。
组件

默认 ensemble 很大——\(N = 100\) 个 world model,每个内部带 layer normalization——原文报告把它降到 20 或 5 会变差。后验维护在"奖励-转移联合模型"上;价值函数是 agent 在其上训练的自有估计。

局限

阈值 \(\zeta\) 是超参数,所以"模型该被信任多远"由人手设定,没有端到端学出来。\(N = 100\) 的 ensemble 计算很重,而且论证依赖 ensemble 的离散程度能忠实代表 epistemic uncertainty,这在高维学到的模型里并不保证。这一结果也与在线 latent-MPC 路线形成张力:TD-MPC 故意把 rollout 砍短,转而投入一个准确的终端价值——同一个高估问题,从 horizon 的相反一端处理。

参考

  1. NEUBAY:Long-Horizon Model-Based Offline Reinforcement Learning Without Explicit Conservatism(arXiv:2512.04341)。代码:github.com/twni2016/neubay
  2. 短 horizon 一端的对照:Temporal Difference Learning for Model Predictive Control (TD-MPC)