CDDP: Continual Learning of Multi-modal Dynamics
TL;DR: A neural episodic memory with a Dirichlet Process prior enables a dynamics model to continually learn new behavioral modes without forgetting old ones. Published at L4DC 2024.
Introduction
Real-world dynamical systems exhibit multiple distinct modes of behavior that can appear sequentially over time: a robot encountering different terrains, a vehicle under varying loads, or weather patterns shifting across climates. Learning these modes continually (without replaying old data) while avoiding catastrophic forgetting is an open challenge. CDDP (Continual Dynamic Dirichlet Process) addresses this with a Bayesian State-Space Model augmented by a neural episodic memory and a Dirichlet Process prior, enabling automatic mode discovery and zero-forgetting transfer across tasks.
This work was the second contribution of my Master’s thesis at Istanbul Technical University, published at L4DC 2024.
Problem Statement
- Bayesian State-Space Models (BSSMs) can fit a single dynamical mode well but are not designed for continual multi-modal learning.
- The standard continual learning fix, Variational Continual Learning (VCL), transfers posterior parameters from one task to the next as the new prior. For classification, this works; for dynamics, it fails because the shared parameter space cannot simultaneously represent modes with fundamentally different transition structures.
- VCL also requires knowing which mode is active at test time, a strong and often unrealistic assumption.
- Catastrophic forgetting: adapting to a new mode overwrites representations of earlier ones when only parameter transfer is used.
- Gap: No prior method handles continual learning of sequential tasks with unknown, multi-modal dynamics without explicit mode labels or per-task network heads.
Methodology
CDDP augments a BSSM with two key components: a neural episodic memory of mode descriptors, and a Dirichlet Process (DP) prior on attention weights.
Memory-gated transition kernel: Given a context window of observations y_{1:C}, an encoder maps them to a query. Attention weights over R memory slots are computed via cosine similarity; the top-matching descriptor is retrieved and injected into the state transition kernel as an additional input, with no parameter transfer between tasks.
\[w_r(y_{1:C}, m_r) = \frac{e^{\langle m_r,\, e_\lambda(y_{1:C}) \rangle}}{\sum_{j=1}^{R} e^{\langle m_j,\, e_\lambda(y_{1:C}) \rangle}}\]After observing a new task, memory is updated by a convex interpolation: high-similarity slots absorb the new mode; low-similarity slots are left largely unchanged, preserving old knowledge without parameter transfer.
Dirichlet Process prior (automatic mode discovery): The mixture weight π follows a GEM (stick-breaking) distribution with concentration α₀. Small α₀ concentrates mass on a few slots; large α₀ spreads mass broadly. The model never needs to be told how many modes exist.
Training: the variational objective (ELBO) includes a KL term that aligns learned attention weights with the DP prior, encouraging sparse and interpretable mode assignments.
Results
Evaluated on 3 synthetic and 2 real-world multi-modal trajectory datasets, each structured as a continual learning sequence. The model sees tasks one at a time; no mode labels are given at test time.
Datasets
These datasets are not standard ML benchmarks; they are drawn from dynamical systems and human motion capture, each presenting a distinct type of multi-modal behavior:
Sine Waves — 1D oscillations $y_t = A\sin(2\pi f t)$ with 5 amplitude levels $A \in {3,6,9,12,15}$ and 3 frequency levels $f \in {\tfrac{2}{3}, 1, \tfrac{4}{3}}$, yielding 15 modes across 5 tasks. The simplest benchmark; modes differ in scale and oscillation rate.
Lotka-Volterra — classic predator-prey ordinary differential equation (ODE):
\[\frac{dx_t}{dt} = \alpha x_t - \beta x_t y_t, \qquad \frac{dy_t}{dt} = \delta x_t y_t - \gamma y_t\]Eight modes generated by varying the biological parameters $(\alpha, \beta, \gamma, \delta)$ across 4 tasks. Sequence length 25, step size Δt = 0.4. Each mode produces qualitatively different oscillatory dynamics between prey (x) and predator (y) populations.
Lorenz Attractor — chaotic 3D system with sensitive dependence on initial conditions:
\[\frac{dx_t}{dt} = \sigma(y_t - x_t), \qquad \frac{dy_t}{dt} = x_t(\rho - z_t) - y_t, \qquad \frac{dz_t}{dt} = x_t y_t - \beta z_t\]Twelve modes from different parameter triples $(\sigma, \rho, \beta)$ across 4 tasks. Sequence length 50, step size Δt = 0.01. This is the hardest synthetic benchmark: neighboring trajectories diverge exponentially, making mode identification from a short context window especially challenging.
Libras Movement — 2D hand-movement trajectories from 15 classes of Brazilian Sign Language (LIBRAS), captured via video at 45 frames per sequence. 5 tasks, 15 modes, 180 train / 180 test sequences. Modes correspond to distinct sign gestures with different spatial extents and trajectories.
Character Trajectories — 3-attribute stylus-pen trajectories (x position, y position, pen tip force) for 20 English characters, subsampled to length 109. 5 tasks, 20 modes, 1422 train / 1436 test sequences. The most challenging real-world dataset: 20 distinct character shapes with shared stroke primitives require fine-grained mode discrimination.
Dataset summary:
| Type | Dataset | Tasks | Modes | Seq. Length | Attributes |
|---|---|---|---|---|---|
| Synthetic | Sine Waves | 5 | 15 | 15 | 1 |
| Synthetic | Lotka-Volterra | 4 | 8 | 25 | 2 |
| Synthetic | Lorenz Attractor | 4 | 12 | 50 | 3 |
| Real-world | Libras | 5 | 15 | 45 | 2 |
| Real-world | Character Trajectories | 5 | 20 | 109 | 3 |
Quantitative Results
Metrics: AUC of NMSE and NLL plotted against tasks learned, averaged over 10 repetitions. Lower is better for both.
- NMSE (Normalized MSE): prediction error relative to signal magnitude
- NLL (Negative Log-Likelihood): calibration quality of the predictive distribution
Main results — AUC NMSE ↓ and AUC NLL ↓ (mean ± SE, 10 seeds). Lower is better.
| Dataset | VCL-BSSM NMSE | CDDP NMSE | VCL-BSSM NLL | CDDP NLL |
|---|---|---|---|---|
| Sine Waves | 1.00 ± 0.04 | 0.91 ± 0.03 | 3.57 ± 0.09 | 3.50 ± 0.09 |
| Lotka-Volterra | 0.58 ± 0.04 | 0.60 ± 0.06 | 1.50 ± 0.05 | 1.32 ± 0.08 |
| Lorenz Attractor | 0.26 ± 0.00 | 0.24 ± 0.01 | 4.42 ± 0.04 | 4.35 ± 0.06 |
| Libras | 0.14 ± 0.00 | 0.14 ± 0.00 | -0.37 ± 0.02 | -0.39 ± 0.04 |
| Character Trajectories | 0.87 ± 0.04 | 0.64 ± 0.01 | 0.14 ± 0.02 | -0.19 ± 0.03 |
CDDP wins 4/5 on NMSE and 5/5 on NLL. Largest gain on Character Trajectories: −26% NMSE, NLL drops from 0.14 to −0.19 (better calibration).
Ablation on Sine Waves confirms that both learned memory content and the absence of parameter transfer are necessary for best performance. Fixed-initialization variants degrade monotonically with initialization magnitude; adding parameter transfer to CDDP does not help and slightly hurts NLL.
Conclusion
- First study on continual learning of multi-modal dynamical systems, introducing both the problem formulation and the associated continual learning risk objective.
- VCL-BSSM introduced as a strong parameter-transfer baseline for practitioners adapting continual classification methods to dynamics.
- Memory beats parameter transfer: CDDP outperforms VCL-BSSM in 4/5 datasets on NMSE and 5/5 on NLL; memory preserves structure that shared parameters cannot represent simultaneously.
- No mode labels required: the DP prior discovers the number of active modes automatically from data.
- Broad applicability: the framework applies directly to weather forecasting (features transferred across climates), autonomous driving (adapting across countries), and model-based RL (handling environment changes from agent actions or external factors).
References
- Akgül, A., Unal, G., & Kandemir, M. (2024). Continual Learning of Multi-modal Dynamics with External Memory. Proceedings of the 6th Annual Learning for Dynamics and Control Conference (L4DC 2024). arXiv:2203.00936
- Nguyen, C. V., Li, Y., Bui, T. D., & Turner, R. E. (2018). Variational Continual Learning. ICLR 2018.
- Rangapuram, S. S., et al. (2018). Deep State Space Models for Time Series Forecasting. NeurIPS 2018.
- Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica.
- Graves, A., Wayne, G., & Danihelka, I. (2014). Neural Turing Machines. arXiv:1410.5401.
- Kirkpatrick, J., et al. (2017). Overcoming catastrophic forgetting in neural networks. PNAS.