iS-QL: Bridging Target-free and Target-based Reinforcement Learning
TL;DR: Parameter sharing between online and target networks (keeping only the final linear layer separate) closes the stability gap of target-free RL while halving memory, with gains across Atari, DMC, and language. Published at ICLR 2026.
Introduction
Deep reinforcement learning algorithms rely on accurate value estimates to learn good policies. A standard stabilization trick, the target network, maintains a delayed copy of the value network to compute training targets, preventing the moving-target instability that arises from bootstrapping with a rapidly changing network. Yet target networks come at a cost: they double the memory dedicated to Q-networks, which directly limits how large a network the GPU can fit. iS-QL (iterated Shared Q-Learning) sidesteps this binary trade-off by sharing all network parameters except the final output layer between the online and target sides, delivering target-based stability at near target-free memory cost.
In this collaborative work (4th author), I designed and conducted the offline language model experiments — specifically the evaluation of iS-ILQL on the Wordle task using a GPT-2 backbone.
Problem Statement
- Target networks (Mnih et al., 2015) stabilize training by decoupling the regression target from the changing online network. They are critical for large architectures and are shown to matter even for methods originally designed without them.
- The cost is doubled memory footprint for Q-networks, limiting usable network size on constrained hardware (edge devices, high-dimensional inputs, mixture-of-experts critics).
- Target-free methods avoid the extra memory but suffer severe performance drops: a 10–60% AUC gap relative to their target-based counterparts in standard benchmarks.
- Gap: No prior work escapes this binary choice. The question is whether a hybrid architecture can achieve target-based stability with target-free memory usage.
Methodology
iS-QL uses a single Q-network with K+1 linear output heads, sharing all parameters in the backbone (convolutional or MLP body) while keeping the heads lightweight and separate.
Architecture (Figure 1):
Key ideas:
- Let ω denote the shared backbone parameters and ω₀, ω₁, …, ω_K the K+1 head parameters. Define θ_k = (ω, ω_k).
- Head ω₀ is never updated by gradient descent; it plays the role of the target network.
- The training loss sums K temporal-difference objectives in a chain:
where head k−1 provides regression targets for head k (stop-gradient applied).
-
Every T steps, heads are cyclically shifted: ω_k ← ω_{k+1} for k = 0, …, K−1. This propagates learned values backward through the chain and refreshes the frozen head ω₀ with a recent snapshot of ω₁, exactly as DQN’s hard target update, but only for a tiny linear layer.
-
Learning K consecutive Bellman iterations in parallel improves sample efficiency beyond simply sharing the backbone.
Why it works: three mechanisms analysed in the paper:
| Mechanism | Target-free | iS-QL K=1 | Target-based |
|---|---|---|---|
| Gradient alignment with target-based | low | high | — |
| Target churn (instability of regression targets) | high | intermediate | zero |
| Feature srank (representational capacity) | low | higher | moderate |
Variants evaluated:
- iS-DQN — discrete online RL (Atari)
- iS-CQL — discrete offline RL (Atari)
- iS-SAC — continuous online RL (DeepMind Control Suite)
- iS-ILQL — offline language RL (Wordle, GPT-2 small backbone)
- iS-Stream Q(λ) — streaming RL (no replay buffer, no batch updates)
Results
All AUC scores are normalized by the target-based approach (= 100); higher is better. Results use IQM with 95% stratified bootstrap intervals.
Online Discrete Control — Atari
Evaluated on 15 Atari games with CNN+LayerNorm:
| Method | Normalized AUC | Parameters vs target-based |
|---|---|---|
| TF-DQN (target-free) | 90% | ~50% |
| TB-DQN (target-based) | 100% | 100% |
| iS-DQN K=9 | 106% | ~50% |
iS-DQN K=9 outperforms the target-based approach by 6% while using approximately half its parameters. Without LayerNorm, where target-free suffers a 60% performance drop, iS-DQN K=1 already cuts this gap to 18%, by storing only one lightweight linear head. Results on the IMPALA architecture confirm the trend: iS-DQN fully closes the performance gap as K increases.
Offline Discrete Control — Atari
Evaluated on 10 Atari games with IMPALA+LayerNorm and CQL loss (10% of DQN dataset):
| Method | Performance gap vs target-based |
|---|---|
| TF-CQL (target-free) | −26% |
| iS-CQL K=9 | −6% |
iS-CQL shrinks the offline performance gap from 26% to 6%.
Online Continuous Control — DeepMind Control Suite
Evaluated on 7 hard DMC tasks with SAC+SimbaV2+BatchNorm:
- iS-SAC K=1 fully recovers the performance drop of the target-free approach.
- Reduces total parameter count by 49% (SimbaV2 uses a large critic; only the linear head is duplicated).
Offline Language Modeling — Wordle
Evaluated with Implicit Language Q-Learning (ILQL) on the Wordle word-guessing game using GPT-2 small (264M parameters total):
| Method | Normalized AUC | Parameters |
|---|---|---|
| TF-ILQL | ≈ TB-ILQL | −88M vs TB |
| TB-ILQL | 100% | 264M |
| iS-ILQL K=9 | > 105% | 264M − 88M = 176M |
iS-ILQL K=9 outperforms the target-based approach by more than 5% and saves 88 million parameters (33% RAM reduction). Because both the online and target embeddings share a single forward pass, iS-ILQL also trains faster than TB-ILQL.
Streaming RL — Atari (no replay buffer)
Applied to Stream Q(λ) [Elsayed et al., 2024] on 7 Atari games without replay buffer or batch updates:
- iS-Stream Q(λ) K=3 improves over the target-free baseline by more than 10% in AUC, matching or outperforming the target-based reference on 6 out of 7 games.
Conclusion
- Simple modification, broad impact: Sharing all parameters except the final linear head reduces memory to near target-free levels while restoring target-based stability across five distinct RL settings.
- Iterated Bellman updates amplify the gain: Learning K consecutive Bellman updates in parallel with the shared backbone significantly narrows, and in some settings eliminates, the performance gap with target-based methods.
- Scalable to large architectures: The 49% total parameter reduction on SimbaV2 and 33% RAM saving on GPT-2 confirm practical value for memory-constrained hardware.
- Analysis-backed: Gradient alignment, target churn, and srank measurements all confirm that iS-QL’s learning dynamics are systematically closer to target-based than target-free, explaining the empirical gains.
- Orthogonal to existing regularization: iS-QL combines additively with LayerNorm, BatchNorm, MellowMax, and other target-free stabilizers; the gains are complementary.
References
- Vincent, T., Tripathi, Y., Faust, T., Akgül, A., Oren, Y., Kandemir, M., Peters, J., & D’Eramo, C. (2026). Bridging the Performance-gap between Target-free and Target-based Reinforcement Learning. Fourteenth International Conference on Learning Representations (ICLR 2026).
- Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518.
- Vincent, T., et al. (2025). Iterated Q-Network: Beyond One-Step Bellman Updates in Deep Reinforcement Learning. arXiv:2403.02107.
- Gallici, M., et al. (2025). Simplifying Deep Temporal Difference Learning. ICLR 2025.
- Bhatt, A., et al. (2024). CrossQ: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity. ICLR 2024.
- Snell, C., et al. (2023). Offline RL for Natural Language Generation with Implicit Language Q Learning. ICLR 2023.
- Elsayed, M., et al. (2024). Streaming Deep Reinforcement Learning Finally Works. arXiv:2410.10939.
- Lee, H., et al. (2025). Hyperspherical Normalization for Scalable Deep Reinforcement Learning. arXiv:2502.15280.