Quantum Computing Enables Faster Stochastic Contextual Bandits for Actions and Time

Sequential decision-making in complex environments relies heavily on stochastic contextual bandits, yet training effective neural networks for these tasks remains a considerable hurdle. Yuqi Huang, Vincent Y. F. Tan, and Sharu Theresa Jose, from the National University of Singapore and the University of Birmingham, present a new approach to overcome limitations such as over-parameterisation and computational instability often encountered when applying quantum neural networks. Their research introduces the Neural Tangent Kernel-Upper Confidence Bound (QNTK-UCB) algorithm, which freezes the quantum neural network at initialisation and uses its static Neural Tangent Kernel for ridge regression. This innovative method not only avoids unstable training but also exploits a unique inductive bias, achieving a significantly improved parameter scaling and demonstrably superior sample efficiency in low-data scenarios. The work establishes how the properties of the Neural Tangent Kernel can deliver implicit regularisation and sharper spectral decay, potentially unlocking a genuine advantage in online learning applications.

Quantum Algorithm for Contextual Bandits Stabilised

Stochastic contextual bandits are fundamental to sequential decision-making, yet present considerable challenges for current neural network-based algorithms. These difficulties are particularly pronounced when scaling to quantum neural networks (QNNs) owing to issues such as massive over-parameterisation, computational instability, and the barren plateau phenomenon. The research objective is to develop a scalable and stable quantum algorithm for stochastic contextual bandits that avoids the pitfalls associated with training QNNs. The approach centres on utilising the QNTK, a kernel derived from the QNN’s Jacobian at initialisation, to approximate the expected reward function.

This allows the algorithm to function as a kernel-based method, sidestepping the complexities of gradient descent and the associated optimisation challenges. Through this methodology, the algorithm aims to achieve competitive performance with reduced computational cost and improved stability. A specific contribution of this work is the formulation of QNTK-UCB, which provides a theoretically grounded and practically viable solution for quantum contextual bandit problems.

The algorithm’s design ensures that it benefits from the expressivity of QNNs while remaining computationally tractable. Furthermore, the study demonstrates the algorithm’s effectiveness through simulations, showcasing its ability to learn optimal policies in various bandit environments. By employing this kernelised approach, the team circumvented the non-convex optimisation landscape of variational circuits while retaining the quantum feature map’s inherent advantages. The core innovation lies in harnessing the QNTK to achieve a significantly improved parameter scaling of ̃Ω((TK)3) for QNTK-UCB, where T represents the time horizon and K the number of actions.

This represents a substantial reduction compared to the ̃Ω((TK)8) parameters required by classical NeuralUCB to guarantee similar regret bounds. Experiments employed non-linear synthetic benchmarks and tasks native to quantum computing, specifically a quantum initial state recommendation for Variational Quantum Eigensolver (VQE). These experiments were designed to demonstrate QNTK-UCB’s superior sample efficiency, particularly in low-data scenarios.

The research leveraged the properties of the QNTK, which provides implicit regularisation and sharper spectral decay, enabling the potential for achieving a “quantum advantage” in online learning. The study also addressed the ‘barren plateau’ phenomenon, a significant challenge in deep QNN training, by focusing on architectures where training dynamics are governed by a fixed analytic kernel determined at initialisation. This allowed the team to avoid the exponential measurement requirements typically needed for accurate gradient estimation.

Furthermore, the work demonstrates how the quantum feature space facilitates more efficient linearisation than its classical counterpart. The team’s theoretical analysis and empirical results highlight the potential of QNTK-UCB to unlock more effective online learning algorithms, particularly in contexts where data is limited and computational resources are constrained. This novel approach addresses limitations inherent in scaling neural networks for stochastic contextual bandits, specifically overcoming issues of over-parameterization, computational instability, and the barren plateau phenomenon.

The team measured a substantially improved parameter scaling of ̃Ω((TK)3) for QNTK-UCB, a considerable reduction compared to the ̃Ω((TK)8) parameters required by classical NeuralUCB to achieve comparable regret guarantees. This advancement unlocks the potential for more efficient learning in complex environments. Data shows superior sample efficiency in low-data regimes when tested on non-linear synthetic benchmarks and quantum initial state recommendation tasks for Variational Quantum Eigensolver (VQE) applications. The breakthrough delivers a clear path toward achieving “quantum advantage” in online learning scenarios.

Measurements confirm that the quantum feature space allows for more efficient linearization than its classical counterpart, a key factor in the algorithm’s performance. The research demonstrates that QNTK-UCB, the first contextual bandit algorithm utilizing the empirical QNTK for reward estimation, effectively exploits quantum expressive power without the instabilities of explicit parameterized quantum circuit training. Tests prove the framework’s ability to handle both classical and quantum-native reward functions, broadening its applicability across diverse learning tasks.

Further analysis quantified the quantum effective dimension, revealing its crucial role in achieving the improved parameter scaling. Results demonstrate that the inherent properties of the QNTK provide implicit regularization and a sharper spectral decay, contributing to the algorithm’s enhanced performance. Empirical evaluations on both synthetic and eigensolver tasks confirm QNTK-UCB’s superior sample efficiency, particularly in low-data scenarios.

👉 More information
🗞 Quantum-Enhanced Neural Contextual Bandit Algorithms
🧠 ArXiv: https://arxiv.org/abs/2601.02870

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Neuroevolution Achieves Efficient Network Evolution with Novel Inflate and Deflate Operators

Neuroevolution Achieves Efficient Network Evolution with Novel Inflate and Deflate Operators

January 17, 2026
Emancipatory Information Access Platforms Achieve Resistance to Authoritarian Capture Amidst Rising Democratic Erosion

Emancipatory Information Access Platforms Achieve Resistance to Authoritarian Capture Amidst Rising Democratic Erosion

January 17, 2026
Transformer Language Models Achieve Improved Arithmetic with Value-Aware Numerical Representations

Transformer Language Models Achieve Improved Arithmetic with Value-Aware Numerical Representations

January 17, 2026