The pursuit of universal machine learning algorithms hinges on efficiently representing data, and researchers are now quantifying exactly how much quantum hardware this requires. Sydney Leither, Michael Kubal, and Sonika Johri, all from Coherent Computing Inc., demonstrate a method for calculating the number of qubits necessary to solve a machine learning problem to a specified accuracy. Their work establishes the first resource estimation framework for variational machine learning, revealing that the recently proposed bit-bit encoding scheme achieves universal approximation constructively and efficiently. Applying this scheme to datasets ranging from medium-sized collections on OpenML to the massive transcriptomic Tahoe-100M dataset, the team finds that encoding complexity does not always increase with data size, and can even decrease with the number of features, potentially opening doors to machine learning tasks beyond the reach of classical computers.
Bit-bit Encoding Enables Scalable Quantum Learning
Researchers have engineered a novel approach to assess the feasibility of quantum machine learning by focusing on efficient data encoding. The team developed bit-bit encoding, a technique that transforms datasets with numeric features and discrete labels into bitstrings, allowing their representation within a quantum computer’s quantum state. This directly addresses data compression, a critical bottleneck in quantum machine learning, and enables predictable accuracy in representing complex datasets. The method involves dimensionality reduction to create these bitstrings, which correspond to the basis states of a quantum computer at initialization and measurement.
To rigorously evaluate performance, scientists pioneered a resource estimation framework, calculating the number of qubits required to encode datasets to a desired degree of accuracy. This framework introduces Qdataset, a resource metric quantifying the qubit requirements for modeling a dataset using bit-bit encoding. Experiments employing incremental increases in qubit allocations reveal a trade-off between maximizing theoretical training accuracy and achieving generalization. Analysis of datasets from the OpenML platform demonstrates that typical examples require, on average, only 20 qubits for complete coverage, suggesting limited potential for quantum advantage in these scenarios.
Further extending the methodology, the team developed techniques to handle very large datasets, such as the giga-scale Tahoe-100M transcriptomic dataset. Scaling experiments with this dataset reveal that quantum models require a number of qubits, peaking at approximately 50, that exceed the capabilities of classical simulation. This finding indicates that large transcriptomic datasets represent a promising arena for demonstrating quantum advantage. The researchers also found that dimensionality reduction methods producing independent or de-correlated features achieve lower Qdataset values than non-linear or unprocessed data, highlighting the importance of data pre-processing.
The technique reveals that the number of qubits does not necessarily increase with the number of features in a dataset, and may even decrease under certain conditions. In essence, this research presents a promising approach to overcoming the scalability challenges in quantum machine learning by focusing on efficient data encoding, optimized training techniques, and the application of these methods to complex real-world datasets. The focus on multiomics data and high-energy physics suggests that this work could have a significant impact on these fields.
Bit-bit Encoding Achieves Universal Quantum Learning
Researchers have demonstrated that a recently proposed bit-bit encoding scheme constructively and efficiently realizes universal approximation in quantum machine learning, a critical property for any generally applicable learning paradigm. This breakthrough delivers the first resource estimation framework, allowing scientists to calculate the number of qubits required to solve a learning problem with a desired degree of accuracy. The team proved, using established mathematical theorems, that a variational quantum model employing bit-bit encoding possesses this essential universal approximation capability, while also scaling polynomially in dataset size. Experiments reveal that typical medium-sized classical machine learning datasets require, on average, only 20 qubits for complete coverage using this encoding scheme, suggesting these datasets are unlikely to demonstrate a quantum advantage.
However, scaling experiments with the massive Tahoe-100M transcriptomic dataset, containing data from over 100 million samples, showed that encoding this dataset requires approximately 50 qubits, exceeding the capabilities of classical simulation. This finding indicates that such large datasets represent a promising arena for identifying potential quantum advantages. The team introduced Qdataset, a novel resource metric that precisely estimates the number of qubits needed to model a dataset using bit-bit encoding. Analysis across various dimensionality reduction methods demonstrates that schemes producing independent or de-correlated features achieve lower Qdataset values, highlighting the importance of data pre-processing. Importantly, the research shows that the number of qubits does not necessarily increase with the number of features in a dataset, and may even decrease, challenging conventional assumptions about data complexity and resource requirements. This work establishes a crucial foundation for assessing the potential of quantum machine learning and identifying datasets where quantum advantage is most likely to emerge.
👉 More information
🗞 How many qubits does a machine learning problem require?
🧠 ArXiv: https://arxiv.org/abs/2508.20992
