MOMBO: Deterministic Uncertainty Propagation for Offline RL
TL;DR: Best convergence rate (avg AULC rank 1.2) across all 12 D4RL offline benchmarks. Deterministic moment matching replaces Monte Carlo Bellman targets, with provably tighter suboptimality bounds. NeurIPS 2024.
Introduction
Offline reinforcement learning (learning policies from pre-collected datasets without environment interaction) is essential for high-stakes domains where real-world exploration is costly or dangerous (healthcare, robotics, autonomous driving). The core obstacle is distributional shift: value estimates for actions underrepresented in the dataset become inflated, with no corrective feedback. MOMBO (Moment Matching Offline Model-Based Policy Optimization) identifies the root cause of training instability in model-based offline RL: high-variance Bellman targets from Monte Carlo sampling. MOMBO fixes this with deterministic moment matching, yielding provably faster convergence.
Problem Statement
- Model-based offline RL methods (MOPO, MOBILE) apply Pessimistic Value Iteration (PEVI): penalize Q-value estimates by the learned dynamics model’s uncertainty to keep the policy conservative about unseen state-action pairs.
- All existing PEVI methods sample a single next state (N=1) from the Gaussian dynamics model and evaluate the Q-network on it. A single sample is cheap but injects high variance into every Bellman target.
- This high variance corrupts gradient updates, slows convergence, and forces larger penalty coefficients to compensate, making model-based offline RL often slower than model-free approaches despite having access to synthetic data.
- Theoretically: suboptimality scales as O(1/√N) in the number of MC samples. At N=1, the bound is at its weakest; at N=1 it is also undefined in the limit, revealing a fundamental limitation.
- Gap: No existing method propagates next-state uncertainty analytically through the Q-network, despite this being the direct source of training instability.
Methodology
MOMBO replaces Monte Carlo sampling with progressive moment matching: the Gaussian next-state distribution output by the learned dynamics model is propagated through the Q-network layer by layer, analytically tracking the mean and variance of hidden activations.
Pessimistic Bellman target (exact, no sampling):
\[\hat{\mathcal{B}}_\text{pess} = r + \gamma \mu_\text{MM} - \beta \gamma \sigma_\text{MM}\]
Implementation details:
- Linear layers: transform mean and variance analytically (exact Gaussian propagation)
- ReLU activations: compute the first two moments via the Gaussian CDF/PDF (closed-form)
- Result: a Normal distribution over Q-values at each next state, used directly to form the pessimistic target (mean − β × std)
- Requires only two forward passes; no additional parameters or rollouts
Theoretical improvement over MC-based PEVI:
| Method | Bound type | Key term |
|---|---|---|
| MC-based PEVI (N=1) | Probabilistic (holds w/ prob 1−δ) | Scales with R²_max/(1−γ)² |
| MOMBO | Deterministic (always holds) | Depends only on network activation constants G_l, C_l ≤ 1 |
MOMBO’s bound is strictly tighter: it holds without probability qualification and depends only on the network’s Lipschitz structure.
Results
Evaluated on the D4RL offline benchmark across 12 environment-dataset combinations: halfcheetah, hopper, and walker2d × random, medium, medium-replay, and medium-expert (4 seeds). Two metrics: Normalized Reward (final policy quality) and AULC (area under the learning curve, measuring convergence speed and stability).
MOMBO achieves the best average AULC ranking of 1.2 across all 12 settings:
| Dataset type | MOPO AULC rank | MOBILE AULC rank | MOMBO AULC rank |
|---|---|---|---|
| random | 2.7 | 2.0 | 1.3 |
| medium | 2.7 | 2.0 | 1.3 |
| medium-replay | 2.3 | 2.0 | 1.7 |
| medium-expert | 2.7 | 2.0 | 1.3 |
| Overall | 2.7 | 2.2 | 1.2 |
Rank 1 = best. Lower is better.
Selected AULC scores on the most practically relevant settings:
| Task | MOMBO | MOBILE | MOPO |
|---|---|---|---|
| medium — hopper | 95.9 ± 2.5 | 82.2 ± 7.3 | 37.0 ± 15.3 |
| medium — walker2d | 84.0 ± 1.1 | 79.0 ± 1.3 | 77.6 ± 1.3 |
| medium-replay — hopper | 87.3 ± 2.0 | 78.7 ± 4.0 | 81.7 ± 4.6 |
| medium-expert — halfcheetah | 95.2 ± 0.7 | 94.5 ± 1.8 | 77.1 ± 4.0 |
| medium-expert — walker2d | 98.9 ± 3.3 | 94.3 ± 0.9 | 88.3 ± 6.3 |
MOMBO’s advantage is largest on AULC rather than final reward, directly confirming the lower-variance Bellman target hypothesis. The medium-hopper gap (95.9 vs 82.2 vs 37.0) is the most striking: MOPO’s high variance under medium-quality data collapses entirely, while MOMBO stays stable.
Conclusion
- Root cause identified: High MC variance in Bellman targets (not model quality) is the primary source of instability in model-based offline RL.
- Provably tighter guarantees: MOMBO’s deterministic suboptimality bound improves on probabilistic MC bounds; constants depend only on network architecture, not on reward scale or sample count.
- Fastest convergence: Best AULC ranking of 1.2 across all 12 D4RL settings; most striking on medium-hopper (AULC 95.9 vs 82.2 vs 37.0).
- Minimal overhead: Moment matching requires only two forward passes through the Q-network, with no additional parameters or MC rollouts.
- Practically relevant: Advantage is largest on medium-quality datasets (the norm in real applications), where MC variance is most destructive to learning stability.
References
- Akgül, A., Haußmann, M., & Kandemir, M. (2024). Deterministic Uncertainty Propagation for Improved Model-Based Offline Reinforcement Learning. NeurIPS 2024. arXiv:2406.04088
- Jin, Y., Yang, Z., & Wang, Z. (2021). Is Pessimism Provably Efficient for Offline RL? ICML 2021.
- Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J. Y., Levine, S., Finn, C., & Ma, T. (2020). MOPO: Model-based Offline Policy Optimization. NeurIPS 2020.
- Sun, Y., et al. (2023). Model-Bellman Inconsistency for Model-based Offline Reinforcement Learning. ICML 2023.
- Wu, A., Nowozin, S., Meeds, E., Turner, R. E., Hernández-Lobato, J. M., & Gaunt, A. L. (2019). Deterministic Variational Inference for Robust Bayesian Neural Networks. ICLR 2019.
- Fu, J., Kumar, A., Nachum, O., Tucker, G., & Levine, S. (2020). D4RL: Datasets for Deep Data-Driven Reinforcement Learning. arXiv:2004.07219.