Researchers are tackling the persistent problem of optimising complex policies in continuous-action reinforcement learning, and Qiyang Li from UC Berkeley, alongside Sergey Levine also from UC Berkeley, present a promising new solution in their work on Q-learning with Adjoint Matching (QAM). Their innovative algorithm addresses the numerical instability often encountered when using gradient-based optimisation for expressive policies, avoiding both the loss of crucial gradient information and the biases of existing approximation methods. QAM cleverly employs adjoint matching , a technique borrowed from generative modelling , to create a stable, step-wise objective function, enabling consistently superior performance on challenging, sparse-reward tasks in both offline and offline-to-online reinforcement learning scenarios.
Existing methods typically circumvent this issue by either discarding valuable gradient information or employing approximations that compromise policy expressivity or introduce bias. This innovative approach eliminates unstable backpropagation while maintaining an unbiased and highly expressive policy at the optimum. The research establishes a new benchmark for policy optimisation in complex environments, demonstrating superior performance across a range of benchmarks.
Prior methods often either ignored this crucial information, sacrificing learning efficiency, or distilled expressive policies into simpler, one-step approximations, thereby limiting their potential. QAM, however, constructs an unbiased gradient estimate by transforming the critic’s gradient at noiseless actions using a flow model derived from a prior constraint, effectively aligning the policy’s velocity field with the optimal state-conditioned velocity field implied by both the critic and the prior. This allows for efficient convergence to the optimal policy while preserving the full expressivity of multi-step flow models. Experiments show that QAM consistently achieves state-of-the-art results on the OGBench benchmark, attaining an aggregated score of 46 across 50 tasks, significantly exceeding the performance of methods like FAWAC (8), FBRAC (11), CGQL (30), and others. This breakthrough reveals a pathway to unlock the full potential of expressive policies in reinforcement learning, paving the way for more robust and adaptable agents capable of tackling complex, real-world challenges. The work opens exciting possibilities for applications in robotics, autonomous driving, and other domains where complex action spaces and sparse rewards are prevalent.
Q-learning with Adjoint Matching for Flow Policies offers
The research tackles a long-standing challenge in continuous-action RL by directly utilising the critic’s first-order information, something previously hindered by numerical instability during backpropagation through multi-step denoising processes inherent in flow or diffusion policies. Existing methods circumvented this by either discarding crucial gradient information or employing approximations that reduced policy expressivity or introduced bias. Researchers engineered a system where a behaviour flow policy, denoted as πβ(·|s), is fine-tuned via adjoint matching to align with the optimal behaviour-constrained policy, represented as πθ(·|s) ∝ πβ(·|s)eQ(s,·). This process begins with sampling noise, z, from a normal distribution, N, which is then used to generate initial actions, vβ, before undergoing a denoising process.
This innovative approach circumvents the instability issues associated with traditional gradient-based optimisation of complex policies. Furthermore, the research demonstrated superior performance on the OGBench dataset, achieving an aggregated score of 46 on 50 tasks, surpassing methods like FAWAC (8), FBRAC (11), CGQL (30), and IFQL (33). This method achieves a significant advancement in continuous-action RL, offering a pathway to train highly expressive policies with improved stability and performance.
Q-learning with Adjoint Matching stabilises flow policy optimisation
The team measured performance using a flow model that approximates intermediate states, Xt, defined by the equation Xt = (1 −t)X0 + tX1, where X0 represents noise distributed as N(0, Id) and X1 represents data. Specifically, the flow model approximates Xt via an ordinary differential equation: d Xt = f( Xt, t)dt. Results demonstrate that the flow matching objective, LFM(θ) = Et∼U[0,1],x0∼N ,x1∼D ∥fθ((1 −t)x0 + tx1, t) −x1 + x0∥2 2, successfully recovers the marginal distribution of the original denoising process Xt, pD(xt), for each t ∈ [0, 1]. Further analysis involved constructing a family of stochastic differential equations (SDEs) admitting the same marginals, defined as d Xt = f( Xt, t) + σ2 t t 2(1 −t) f( Xt, t) + Xt/t dt + σtdBt, where Bt is Brownian motion and σt 0 is a noise schedule.
Measurements confirm that adjoint matching modifies a base flow-matching generative model, fβ, to generate a tilt distribution, pθ(X1) ∝pβ(X1)eQ(X1), effectively up-weighting or down-weighting the probability of each example. The stochastic optimal control loss, L(θ) = EX={Xt}t Z 1 0 2 σ2 t ∥fθ(Xt, t) −fβ(Xt, t)∥2 2 dt −Q(X1), was minimised to achieve the correct marginal tilt distribution for X1. The breakthrough delivers a ‘lean’ adjoint state, computed via a reverse ODE, satisfying d g(X, t) = −∇Xt [2fβ(Xt, t) −Xt/t] g(X, t)dt, with the boundary condition g(X, 1) = −∇X1Q(X1). Tests prove that this ‘lean’ adjoint state requires only the base flow model, fβ, eliminating the need for fθ, and resulting in the adjoint matching objective: LAM(θ) = EX Z 1 0 ∥2(fθ(Xt, t) −fβ(Xt, t))/σt + σt g(X, t)∥2 2dt.
QAM Stabilises Policy Learning with Adjoint Matching, improving
By avoiding approximations that sacrifice policy expressiveness or discarding valuable gradient information, QAM offers a more robust and effective solution for training complex agents. The authors highlight that their method operates within a common offline-to-online framework, first pre-training on offline datasets before fine-tuning online, and could potentially be adapted for use in purely online settings. The authors acknowledge that their current evaluation focuses on the offline-to-online RL paradigm, and further research could explore the algorithm’s performance in other contexts. They also note the importance of carefully crafting the step-wise objective function to avoid introducing biases or instability, a challenge inherent in methods that bypass direct backpropagation. Future work might investigate the application of adjoint matching to other areas, such as online reinforcement learning and generative modelling, potentially improving the training of diffusion and flow-matching models, particularly in addressing bias problems associated with guidance techniques.
👉 More information
🗞 Q-learning with Adjoint Matching
🧠 ArXiv: https://arxiv.org/abs/2601.14234
