MOMBO: Deterministic Uncertainty Propagation for Offline RL
TL;DR: Moment matching replaces Monte Carlo sampling in pessimistic offline RL, cutting suboptimality and accelerating convergence across D4RL benchmarks. Published at NeurIPS 2024.
Introduction
Offline reinforcement learning (learning policies from pre-collected datasets without environment interaction) is essential for high-stakes domains where real-world exploration is costly or dangerous (healthcare, robotics, autonomous driving). The core obstacle is distributional shift: value estimates for actions underrepresented in the dataset become inflated, with no corrective feedback. MOMBO (Moment Matching Offline Model-Based Policy Optimization) identifies the root cause of training instability in model-based offline RL: high-variance Bellman targets from Monte Carlo sampling. MOMBO fixes this with deterministic moment matching, yielding provably faster convergence.
Problem Statement
- Model-based offline RL methods (MOPO, MOBILE) apply Pessimistic Value Iteration (PEVI): penalize Q-value estimates by the learned dynamics model’s uncertainty to keep the policy conservative about unseen state-action pairs.
- All existing PEVI methods sample a single next state (N=1) from the Gaussian dynamics model and evaluate the Q-network on it. A single sample is cheap but injects high variance into every Bellman target.
- This high variance corrupts gradient updates, slows convergence, and forces larger penalty coefficients to compensate, making model-based offline RL often slower than model-free approaches despite having access to synthetic data.
- Theoretically: suboptimality scales as O(1/√N) in the number of MC samples. At N=1, the bound is at its weakest; at N=1 it is also undefined in the limit, revealing a fundamental limitation.
- Gap: No existing method propagates next-state uncertainty analytically through the Q-network, despite this being the direct source of training instability.
Methodology
MOMBO replaces Monte Carlo sampling with progressive moment matching: the Gaussian next-state distribution output by the learned dynamics model is propagated through the Q-network layer by layer, analytically tracking the mean and variance of hidden activations.
Pessimistic Bellman target (exact, no sampling):
\[\hat{\mathcal{B}}_\text{pess} = r + \gamma \mu_\text{MM} - \beta \gamma \sigma_\text{MM}\]
Implementation details:
- Linear layers: transform mean and variance analytically (exact Gaussian propagation)
- ReLU activations: compute the first two moments via the Gaussian CDF/PDF (closed-form)
- Result: a Normal distribution over Q-values at each next state, used directly to form the pessimistic target (mean − β × std)
- Requires only two forward passes; no additional parameters or rollouts
Theoretical improvement over MC-based PEVI:
| Method | Bound type | Key term |
|---|---|---|
| MC-based PEVI (N=1) | Probabilistic (holds w/ prob 1−δ) | Scales with R²_max/(1−γ)² |
| MOMBO | Deterministic (always holds) | Depends only on network activation constants G_l, C_l ≤ 1 |
MOMBO’s bound is strictly tighter: it holds without probability qualification and depends only on the network’s Lipschitz structure.
Results
Evaluated on the D4RL offline benchmark across 12 environment-dataset combinations: halfcheetah, hopper, and walker2d × random, medium, medium-replay, and medium-expert (4 seeds). Two metrics: Normalized Reward (final policy quality) and AULC (area under the learning curve, measuring convergence speed and stability).
MOMBO achieves the best average AULC ranking of 1.2 across all 12 settings:
| Dataset type | MOPO AULC rank | MOBILE AULC rank | MOMBO AULC rank |
|---|---|---|---|
| random | 2.7 | 2.0 | 1.3 |
| medium | 2.7 | 2.0 | 1.3 |
| medium-replay | 2.3 | 2.0 | 1.7 |
| medium-expert | 2.7 | 2.0 | 1.3 |
| Overall | 2.7 | 2.2 | 1.2 |
Rank 1 = best. Lower is better.
Selected AULC scores on the most practically relevant settings:
| Task | MOMBO | MOBILE | MOPO |
|---|---|---|---|
| medium — hopper | 95.9 ± 2.5 | 82.2 ± 7.3 | 37.0 ± 15.3 |
| medium — walker2d | 84.0 ± 1.1 | 79.0 ± 1.3 | 77.6 ± 1.3 |
| medium-replay — hopper | 87.3 ± 2.0 | 78.7 ± 4.0 | 81.7 ± 4.6 |
| medium-expert — halfcheetah | 95.2 ± 0.7 | 94.5 ± 1.8 | 77.1 ± 4.0 |
| medium-expert — walker2d | 98.9 ± 3.3 | 94.3 ± 0.9 | 88.3 ± 6.3 |
MOMBO’s advantage is largest on AULC rather than final reward, directly confirming the lower-variance Bellman target hypothesis. The medium-hopper gap (95.9 vs 82.2 vs 37.0) is the most striking: MOPO’s high variance under medium-quality data collapses entirely, while MOMBO stays stable.
Conclusion
- Root cause identified: High MC variance in Bellman targets (not model quality) is the primary source of instability in model-based offline RL.
- Provably tighter guarantees: MOMBO’s deterministic suboptimality bound improves on probabilistic MC bounds; constants depend only on network architecture, not on reward scale or sample count.
- Fastest convergence: Best AULC ranking of 1.2 across all 12 D4RL settings; most striking on medium-hopper (AULC 95.9 vs 82.2 vs 37.0).
- Minimal overhead: Moment matching requires only two forward passes through the Q-network, with no additional parameters or MC rollouts.
- Practically relevant: Advantage is largest on medium-quality datasets (the norm in real applications), where MC variance is most destructive to learning stability.
References
- Akgül, A., Haußmann, M., & Kandemir, M. (2024). Deterministic Uncertainty Propagation for Improved Model-Based Offline Reinforcement Learning. NeurIPS 2024. arXiv:2406.04088
- Jin, Y., Yang, Z., & Wang, Z. (2021). Is Pessimism Provably Efficient for Offline RL? ICML 2021.
- Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J. Y., Levine, S., Finn, C., & Ma, T. (2020). MOPO: Model-based Offline Policy Optimization. NeurIPS 2020.
- Sun, Y., et al. (2023). Model-Bellman Inconsistency for Model-based Offline Reinforcement Learning. ICML 2023.
- Wu, A., Nowozin, S., Meeds, E., Turner, R. E., Hernández-Lobato, J. M., & Gaunt, A. L. (2019). Deterministic Variational Inference for Robust Bayesian Neural Networks. ICLR 2019.
- Fu, J., Kumar, A., Nachum, O., Tucker, G., & Levine, S. (2020). D4RL: Datasets for Deep Data-Driven Reinforcement Learning. arXiv:2004.07219.