Researchers are tackling the challenge of inefficient learning in complex reinforcement learning environments, where sparse rewards and vast state spaces often hinder progress. Hadi Salloum, Ali Jnadi, and Yaroslav Kholodov, alongside Alexander Gasnikov from the Phystech School of Applied Mathematics and Computer Science, MIPT, and Innopolis University, present a novel approach , MC+QUBO , which reformulates episode selection as a Quadratic Unconstrained Binary Optimisation problem and leverages quantum-inspired samplers to dramatically improve learning speed and policy quality. By intelligently filtering trajectory batches to maximise reward and encourage exploration, their method demonstrably outperforms standard Monte Carlo techniques in GridWorld environments, suggesting a promising new direction for integrating quantum-inspired optimisation into reinforcement learning algorithms.
Researchers are tackling the challenge of inefficient learning in complex Reinforcement learning environments, where sparse rewards and vast state spaces often hinder progress.
MC+QUBO boosts Monte Carlo reinforcement learning efficiency significantly
Researchers explored both Simulated Quantum Annealing (SQA) and Simulated Bifurcation (SB) as effective black-box solvers within this framework, showcasing the versatility of the approach. The study establishes a clear advantage in scenarios where sample efficiency is paramount, offering a pathway to tackle complex reinforcement learning problems previously intractable due to computational limitations. The QUBO formulation, rooted in the principles of the Ising model from statistical mechanics, provides a robust mathematical foundation for this optimisation process. Specifically, the QUBO is constructed with a symmetric matrix Q and a vector q, allowing for the encoding of both reward maximisation and state-space coverage objectives.
This encoding facilitates the use of SQA and SB, which mimic quantum annealing and utilise bifurcation phenomena respectively, to efficiently search for optimal episode subsets. The work opens exciting avenues for combining classical and quantum-inspired techniques to enhance the performance of reinforcement learning agents. Furthermore, the research details the implementation of SQA, which simulates quantum annealing using classical stochastic updates, and SB, a classical dynamical system that drives variables towards binary states via bifurcation. Both methods rely on carefully tuned parameters, annealing schedules for SQA and parameters governing the bifurcation process for SB, to balance exploration and exploitation during the optimisation process. The successful integration of these solvers into the MC+QUBO framework demonstrates the feasibility of leveraging physics-inspired algorithms to address key challenges in reinforcement learning. This breakthrough not only improves the efficiency of learning but also provides a valuable tool for exploring complex decision-making problems across diverse applications, from robotics to game playing and beyond.
QUBO Formulation for Monte Carlo Policy Evaluation offers
Researchers then explored the efficacy of both Simulated Quantum Annealing (SQA) and Simulated Bifurcation (SB) as black-box solvers within this QUBO framework. The QUBO formulation itself was derived from the Ising model, establishing a connection between spin configurations and binary variables via the equation si = 2xi −1. To achieve this, the team constructed the QUBO Hamiltonian, HQUBO(x) = x⊤Qx + q⊤x, where Q represents a symmetric matrix encoding pairwise couplings and linear biases, and x denotes the binary variable vector. The conversion from the Ising Hamiltonian, HIsing(s) = − X (i,j)∈E Jijsisj − X i∈V hisi, involved substituting sisj = 4xixj −2xi −2xj + 1 and collecting terms to define Qij = −4Jij (i j) and qi = 2X j=i Jij −2hi. This precise mathematical formulation allowed the researchers to translate the reinforcement learning problem into a computationally tractable QUBO instance. Furthermore, the study employed SQA, simulating quantum annealing using classical stochastic updates, interpolating between a driver Hamiltonian H0 and the problem Hamiltonian HP with HSQA(t) = A(t)H0 + B(t)HP, where A(0) = 1, B(0) = 0, and A(T) = 0, B(T) = 1.
MC+QUBO accelerates learning in GridWorld environments
Experiments revealed that this new method, termed MC+QUBO, consistently outperformed vanilla Monte Carlo across a series of finite-horizon GridWorld environments of sizes ranging from 3 × 3 to 20 × 20. Results demonstrate that MC+QUBO converged in fewer batches than the baseline algorithm, with the improvement becoming particularly pronounced in larger environments, specifically those exceeding 10 × 10, where sparse rewards and expansive state spaces typically impede policy evaluation. The algorithm achieves this by focusing learning on informative and diverse trajectories, actively avoiding redundant episodes that contribute little to accurate value estimation. This selection process is encoded as a QUBO, and solutions were delivered within practical computational budgets using a Quantum Inspired Black Box Solver.
Data shows that the resulting policies achieved higher average returns than those obtained with vanilla Monte Carlo across all tested grid sizes. The margin of improvement was greatest in the larger grids, where MC+QUBO effectively maintained exploratory diversity while simultaneously accelerating convergence. Specifically, rolling mean returns, measured with a window size of 6, consistently favoured MC+QUBO in environments ranging from 3×3 with 0.22 obstacle density to 20×20 with 0.01 obstacle density. The team’s implementation prioritised state-space coverage over direct reward optimisation, yielding more balanced learning and improved performance. Solver latency, measured during testing, ranged from 0.5 to 2 seconds per batch, dominated by cloud communication, with actual solver time remaining negligible at approximately 10-100 milliseconds for problems with n ≤ 200.
MC+QUBO boosts Monte Carlo reinforcement learning performance significantly
This improvement suggests that -inspired optimisation techniques can function effectively as decision-making subroutines within reinforcement learning frameworks. The authors acknowledge that the computational latency was dominated by cloud communication for matrix transfer, though the actual solver time remained minimal for problem sizes tested, up to 200 variables. Future work could explore extending the framework to continuous control problems, hierarchical reinforcement learning, or multi-agent systems. Further investigation into dynamic tuning of selection weights and hybrid criteria is also planned, alongside potential deployment on actual quantum hardware. By integrating reinforcement learning with combinatorial optimisation, this study establishes a pathway for new algorithms where quantum and quantum-inspired methods actively contribute to the learning process.
👉 More information
🗞 Quantum-Inspired Episode Selection for Monte Carlo Reinforcement Learning via QUBO Optimization
🧠 ArXiv: https://arxiv.org/abs/2601.17570
