iS-QL: Bridging Target-free and Target-based Reinforcement Learning

Introduction

The target-free vs. target-based choice in deep RL had no middle ground: target networks stabilize training but double Q-network memory, while target-free methods cut memory at the cost of a 10–60% performance drop on standard benchmarks. No prior work escaped this binary. iS-QL (iterated Shared Q-Learning) resolves it by sharing all network parameters except the final linear output layer between the online and target sides — delivering target-based stability at near target-free memory cost across five distinct RL settings.

In this collaborative work (4th author), I designed and conducted the offline language model experiments — specifically the evaluation of iS-ILQL on the Wordle task using a GPT-2 backbone.

Problem Statement

Target networks (Mnih et al., 2015) stabilize training by decoupling the regression target from the changing online network. They are critical for large architectures and are shown to matter even for methods originally designed without them.
The cost is doubled memory footprint for Q-networks, limiting usable network size on constrained hardware (edge devices, high-dimensional inputs, mixture-of-experts critics).
Target-free methods avoid the extra memory but suffer severe performance drops: a 10–60% AUC gap relative to their target-based counterparts in standard benchmarks.
Gap: No prior work escapes this binary choice. The question is whether a hybrid architecture can achieve target-based stability with target-free memory usage.

Methodology

iS-QL uses a single Q-network with K+1 linear output heads, sharing all parameters in the backbone (convolutional or MLP body) while keeping the heads lightweight and separate.

Architecture (Figure 1):

Figure 1. Conceptual comparison of target-based, target-free, shared features, and iterated shared features (iS-QL). In the shared-features variant, only the last linear layer is duplicated as the target; the backbone is shared with the live online network. iS-QL extends this with K+1 heads forming a chain of consecutive Bellman iterations; each head is trained to approximate the Bellman update of the previous one. From Vincent et al., ICLR 2026.

Key ideas:

Let ω denote the shared backbone parameters and ω₀, ω₁, …, ω_K the K+1 head parameters. Define θ_k = (ω, ω_k).
Head ω₀ is never updated by gradient descent; it plays the role of the target network.
The training loss sums K temporal-difference objectives in a chain:

\[\mathcal{L}^{\text{iS-QL}}(\theta) = \sum_{k=1}^{K} \mathcal{L}^{\text{TD}}(\theta_k,\, \theta_{k-1})\]

where head k−1 provides regression targets for head k (stop-gradient applied).

Every T steps, heads are cyclically shifted: ω_k ← ω_{k+1} for k = 0, …, K−1. This propagates learned values backward through the chain and refreshes the frozen head ω₀ with a recent snapshot of ω₁, exactly as DQN’s hard target update, but only for a tiny linear layer.
Learning K consecutive Bellman iterations in parallel improves sample efficiency beyond simply sharing the backbone.

Why it works: three mechanisms analysed in the paper:

Mechanism	Target-free	iS-QL K=1	Target-based
Gradient alignment with target-based	low	high	—
Target churn (instability of regression targets)	high	intermediate	zero
Feature srank (representational capacity)	low	higher	moderate

Variants evaluated:

iS-DQN — discrete online RL (Atari)
iS-CQL — discrete offline RL (Atari)
iS-SAC — continuous online RL (DeepMind Control Suite)
iS-ILQL — offline language RL (Wordle, GPT-2 small backbone)
iS-Stream Q(λ) — streaming RL (no replay buffer, no batch updates)

Results

All AUC scores are normalized by the target-based approach (= 100); higher is better. Results use IQM with 95% stratified bootstrap intervals.

Online Discrete Control — Atari

Evaluated on 15 Atari games with CNN+LayerNorm:

Method	Normalized AUC	Parameters vs target-based
TF-DQN (target-free)	90%	~50%
TB-DQN (target-based)	100%	100%
iS-DQN K=9	106%	~50%

iS-DQN K=9 outperforms the target-based approach by 6% while using approximately half its parameters. Without LayerNorm, where target-free suffers a 60% performance drop, iS-DQN K=1 already cuts this gap to 18%, by storing only one lightweight linear head. Results on the IMPALA architecture confirm the trend: iS-DQN fully closes the performance gap as K increases.

Offline Discrete Control — Atari

Evaluated on 10 Atari games with IMPALA+LayerNorm and CQL loss (10% of DQN dataset):

Method	Performance gap vs target-based
TF-CQL (target-free)	−26%
iS-CQL K=9	−6%

iS-CQL shrinks the offline performance gap from 26% to 6%.

Online Continuous Control — DeepMind Control Suite

Evaluated on 7 hard DMC tasks with SAC+SimbaV2+BatchNorm:

iS-SAC K=1 fully recovers the performance drop of the target-free approach.
Reduces total parameter count by 49% (SimbaV2 uses a large critic; only the linear head is duplicated).

Offline Language Modeling — Wordle

Evaluated with Implicit Language Q-Learning (ILQL) on the Wordle word-guessing game using GPT-2 small (264M parameters total):

Figure 2. Performance on the Wordle offline RL task (GPT-2 small backbone). iS-ILQL K=9 improves over the target-based approach by more than 5% in normalized AUC while saving 33% of RAM (88 million parameters). Sharing features also enables computing the TD error in a single forward pass, reducing training time. From Vincent et al., ICLR 2026.

Method	Normalized AUC	Parameters
TF-ILQL	≈ TB-ILQL	−88M vs TB
TB-ILQL	100%	264M
iS-ILQL K=9	> 105%	264M − 88M = 176M

iS-ILQL K=9 outperforms the target-based approach by more than 5% and saves 88 million parameters (33% RAM reduction). Because both the online and target embeddings share a single forward pass, iS-ILQL also trains faster than TB-ILQL.

Streaming RL — Atari (no replay buffer)

Applied to Stream Q(λ) [Elsayed et al., 2024] on 7 Atari games without replay buffer or batch updates:

iS-Stream Q(λ) K=3 improves over the target-free baseline by more than 10% in AUC, matching or outperforming the target-based reference on 6 out of 7 games.

Conclusion

Simple modification, broad impact: Sharing all parameters except the final linear head reduces memory to near target-free levels while restoring target-based stability across five distinct RL settings.
Iterated Bellman updates amplify the gain: Learning K consecutive Bellman updates in parallel with the shared backbone significantly narrows, and in some settings eliminates, the performance gap with target-based methods.
Scalable to large architectures: The 49% total parameter reduction on SimbaV2 and 33% RAM saving on GPT-2 confirm practical value for memory-constrained hardware.
Analysis-backed: Gradient alignment, target churn, and srank measurements all confirm that iS-QL’s learning dynamics are systematically closer to target-based than target-free, explaining the empirical gains.
Orthogonal to existing regularization: iS-QL combines additively with LayerNorm, BatchNorm, MellowMax, and other target-free stabilizers; the gains are complementary.

References

Vincent, T., Tripathi, Y., Faust, T., Akgül, A., Oren, Y., Kandemir, M., Peters, J., & D’Eramo, C. (2026). Bridging the Performance-gap between Target-free and Target-based Reinforcement Learning. Fourteenth International Conference on Learning Representations (ICLR 2026).
Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518.
Vincent, T., et al. (2025). Iterated Q-Network: Beyond One-Step Bellman Updates in Deep Reinforcement Learning. arXiv:2403.02107.
Gallici, M., et al. (2025). Simplifying Deep Temporal Difference Learning. ICLR 2025.
Bhatt, A., et al. (2024). CrossQ: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity. ICLR 2024.
Snell, C., et al. (2023). Offline RL for Natural Language Generation with Implicit Language Q Learning. ICLR 2023.
Elsayed, M., et al. (2024). Streaming Deep Reinforcement Learning Finally Works. arXiv:2410.10939.
Lee, H., et al. (2025). Hyperspherical Normalization for Scalable Deep Reinforcement Learning. arXiv:2502.15280.