Scientists are tackling the persistent problem of coordination in multi-agent reinforcement learning (MARL), a key obstacle to developing truly collaborative artificial intelligence. John Gardiner, Orlando Romero, Brendan Tivnan, Nicolò Dal Fabbro, and George J. Pappas et al. from Nasdaq, Inc. and the University of Pennsylvania present a novel framework that trains MARL agents to utilise quantum entanglement as a coordination resource, moving beyond traditional methods reliant on shared randomness. This research is significant because it demonstrates the potential to achieve superior performance in cooperative tasks, leveraging the principles of quantum mechanics to unlock strategies unavailable with classical approaches and offering a substantial advantage in scenarios that demand complex, decentralised decision-making.
This breakthrough permits a broader range of communication-free correlated policies than previously possible with shared randomness alone.
Motivated by principles from quantum physics, the study demonstrates that shared entanglement can enable superior strategies in cooperative games where direct communication is absent, a phenomenon termed quantum advantage. The work introduces a new approach to MARL, moving beyond traditional methods that rely on shared randomness to achieve coordination among agents.
Central to this innovation is a differentiable policy parameterisation that allows for optimisation over quantum measurements. This, combined with a unique policy architecture that separates joint policies into a quantum coordinator and decentralised local actors, facilitates end-to-end learning. The researchers successfully demonstrate the effectiveness of their method by first learning strategies that achieve quantum advantage in single-round games treated as black box oracles.
Subsequently, they extend this capability to a more complex multi-agent sequential decision-making problem, formulated as a decentralised partially observable Markov decision process. This research delineates a hierarchy of joint policy classes for communication-free cooperative MARL, encompassing shared randomness policies, shared quantum entanglement policies, and non-signalling policies.
The development of QuantumSoftmax, a differentiable transformation mapping matrices to quantum measurements, is a key component of the framework. By enabling gradient-based optimisation over these measurements, the system can learn optimal entangled strategies directly from experience. The resulting policies exhibit a demonstrable advantage in scenarios where communication is restricted or impossible, opening avenues for applications in low-latency systems and other constrained environments.
Validation of the framework involved applying it to single-round cooperative games with established quantum advantage, confirming the recovery of known optimal entangled strategies. Further application to a multi-router multi-server queueing problem, formulated as a decentralised partially observable Markov decision process, yielded sequential decision-making policies that achieve quantum advantage, a result previously demonstrated only through queueing-theoretic methods. This work represents a significant step towards harnessing the power of quantum entanglement for enhanced coordination in multi-agent systems.
Quantum Measurement Optimisation via Differentiable Policy Architecture
A novel differentiable policy parameterization underpins this work, enabling optimisation over quantum measurements for multi-agent reinforcement learning. The research introduces QuantumSoftmax, a transformation that maps arbitrary square complex-valued matrices to represent a quantum measurement, facilitating end-to-end gradient-based optimisation.
This allows agents to learn strategies exploiting shared quantum entanglement as a coordination resource, surpassing the limitations of shared randomness alone. The methodology decomposes joint policies into a quantum coordinator and decentralised local actors, creating a distinct policy architecture.
The quantum coordinator samples correlated advice via quantum measurements, while local actors condition their actions on this advice, seamlessly integrating with policy gradient methods. This architecture was implemented within a modified multi-agent proximal policy optimisation algorithm, specifically designed to learn entangled policies for sequential decision-making.
Validation commenced with single-round cooperative games possessing theoretically established quantum advantage, confirming the algorithm’s ability to recover known optimal entangled strategies. Subsequently, the framework was applied to a multi-router multi-server queueing problem, formulated as a decentralised partially observable Markov decision process.
This allowed the demonstration of learned sequential decision-making policies achieving quantum advantage in a setting previously analysed solely through queueing-theoretic methods. The study establishes a hierarchy of joint policy classes, including shared randomness policies, shared quantum entanglement policies, and non-signalling policies, demonstrating the expanded expressiveness of entangled strategies.
Exploiting quantum entanglement for cooperative multi-agent reinforcement learning
Strategies attaining quantum advantage were learned in single-round games treated as black box oracles. This research introduces the first framework for training multi-agent reinforcement learning (MARL) agents to exploit shared quantum entanglement as a coordination resource, exceeding the capabilities of shared randomness alone.
The work delineates a hierarchy of joint policy classes for communication-free cooperative MARL, including shared randomness policies, shared quantum entanglement policies, and non-signalling policies. A novel differentiable policy parameterization, termed QuantumSoftmax, was developed to enable optimisation over quantum measurements.
This transformation maps arbitrary square complex-valued matrices, facilitating end-to-end gradient-based optimisation. The research also introduces a policy architecture that decomposes joint policies into a quantum coordinator and decentralized local actors. This separation allows for seamless integration with policy gradient methods, enabling the learning of entangled policies for sequential decision-making.
Validation of the framework was initially performed on single-round cooperative games with established quantum advantage. The algorithm successfully recovered known optimal entangled strategies in these scenarios, demonstrating its ability to learn effective coordination mechanisms. Subsequently, the framework was applied to a multi-router multi-server queueing problem formulated as a decentralized partially observable Markov decision process (Dec-POMDP).
Policies achieving quantum advantage were learned in this setting, previously analysed solely through queueing-theoretic methods, showcasing the applicability of the approach to complex sequential decision-making tasks. The study confirms the potential for exploiting quantum entanglement to enhance coordination in multi-agent systems without communication.
Entanglement-enhanced coordination improves multi-agent reinforcement learning performance
Researchers have developed a new framework for training multi-agent reinforcement learning (MARL) agents to utilise shared quantum entanglement as a coordination resource. This approach enables the creation of more effective strategies in scenarios where direct communication is limited, surpassing the capabilities of methods relying solely on shared randomness.
The framework introduces a differentiable policy parameterization and a novel policy architecture that separates joint policies into a coordinator and individual local actors. Demonstrations within single-round games and a decentralised partially observable Markov decision process (Dec-POMDP) confirm the ability of this method to learn strategies exhibiting quantum advantage.
Specifically, the learned policies achieved lower wait times compared to optimal strategies based on shared randomness, indicating that entanglement mitigates the effects of communication constraints. These findings align with established results in physics, which suggest that shared entanglement can improve performance in cooperative games lacking communication.
The authors acknowledge limitations including the current restriction of entanglement use to within single time steps and the absence of realistic constraints like imperfect state preparation within the learning process. Future research will focus on theoretically and empirically defining the conditions under which entanglement enhances decentralised sequential decision-making.
Further work will also explore extending the framework to allow entanglement to be shared across multiple rounds and investigating the potential for entanglement to reduce the complexity of finite-state controllers. This research highlights a potential shift towards practical applications of quantum entanglement, particularly in areas like high-frequency trading where latency is a critical factor.
👉 More information
🗞 Learning to Coordinate via Quantum Entanglement in Multi-Agent Reinforcement Learning
🧠 ArXiv: https://arxiv.org/abs/2602.08965
