Flaky tests, which yield inconsistent results without code changes, represent a significant challenge to software dependability, and their prevalence in quantum software remains largely unexplored. Dongchan Kim, Hamidreza Khoramrokh, and Lei Zhang, all from the University of Maryland, Baltimore County, alongside Andriy Miranskyy from Toronto Metropolitan University, now present the first large-scale dynamic investigation into this critical issue. The team executed over 27,000 tests within the Terra quantum software suite across numerous releases, revealing that while overall flakiness rates are low, detecting these unreliable tests requires substantial computational resources. Their analysis demonstrates that many flaky tests exhibit very small failure probabilities, necessitating tens of thousands of repetitions for confident identification, and highlights specific software components, such as the ‘transpiler’ and ‘quantum_info’ modules, as particularly susceptible. This research establishes a crucial baseline for understanding and mitigating test flakiness in the rapidly evolving field of quantum software development.
Qiskit Flaky Tests, Causes and Analysis
This work examines the presence of flaky tests in the Qiskit open-source quantum computing software framework, where flaky tests are defined as tests that pass or fail inconsistently without any changes to the underlying code. Such behavior presents a significant challenge for software development, testing reliability, and long-term maintenance. To investigate this issue, the researchers systematically executed the Qiskit Terra test suite, which comprises a comprehensive set of tests central to the framework’s functionality, in order to identify and analyze instances of test flakiness.
The study focused on Qiskit versions ranging from 0.25.0 to 1.2.4, enabling the researchers to observe how test flakiness evolved over time across multiple releases. A detailed taxonomy of Qiskit modules was used to provide structural context for the analysis, making it possible to determine where flaky tests were concentrated within the codebase. Automated testing tools such as tox supported consistent execution and management of the test runs throughout the study.
The results indicate that although flaky tests do exist in Qiskit, they are not uniformly distributed across all modules. Certain components exhibit higher susceptibility to flakiness than others, highlighting the importance of understanding the internal architecture of the software when diagnosing testing issues. The study emphasizes that thorough root cause analysis is critical for developing effective strategies to reduce or eliminate flaky tests, thereby improving the reliability and maintainability of quantum software frameworks like Qiskit.
Large Scale Flaky Test Characterization in Qiskit
Researchers conducted the first large-scale dynamic characterization of flaky tests within quantum software by re-executing the Terra test suite 10,000times across 23 distinct releases. This extensive re-execution identified 290 distinct flaky tests among a total of 27,026 test cases. The study employed a carefully controlled environment to isolate genuine flakiness from external factors and quantified flakiness by measuring test-outcome variability. Empirical failure probabilities were estimated, and Wilson confidence intervals were used to determine appropriate rerun budgets for reliable detection of failures.
Analysis revealed that while overall flakiness rates were low, ranging from 0 to 0.4%, the behavior was highly episodic, with nearly two-thirds of flaky tests appearing in only a single release. The resulting dataset, comprising per-test execution outcomes, has been released publicly to support future research.
Flaky Quantum Tests Exhibit Episodic, Sparse Failures
Scientists conducted a large-scale dynamic characterization of flaky tests within quantum software, executing the Terra test suite 10,000times across 23 releases. This work, requiring approximately 70 CPU-years of computation, identified 290 distinct flaky tests among 27,026 test cases. Results demonstrate that while overall flakiness rates were low, ranging from 0 to 0.4%, the behavior of these tests was highly episodic, with nearly two-thirds appearing in only one release. The team measured empirical failure probabilities, revealing that many exhibited extremely sparse failures, with probabilities around 10−4.
This implies that tens of thousands of executions may be necessary to confidently detect these failures, posing a significant challenge for typical continuous integration budgets. Researchers formally quantified the number of repetitions required to detect failures of varying rarity, establishing a probabilistic characterization of detectability. A public dataset containing the curated set of 290 flaky tests, annotated with results from the 10,000 controlled executions per release, is available at https://zenodo. org/records/17979349.
Quantum Test Flakiness, Patterns and Challenges
This research presents the first large-scale dynamic characterization of flaky tests within quantum software, identifying 290 distinct flaky tests across 27,026 test cases in the Terra test suite. The study demonstrates that while overall flakiness rates are low, ranging from 0 to 0.4%, these tests exhibit complex behaviour, with approximately two-thirds appearing in only a single software release. Analysis reveals that many quantum flaky tests fail with very small empirical probabilities, requiring tens of thousands of executions to achieve confident detection. This poses a substantial challenge for quantum software testing, as standard continuous integration pipelines often lack the necessary rerun budgets to reliably identify these infrequent failures. Researchers have publicly released a comprehensive dataset of per-test execution outcomes, providing valuable ground truth for future work in quantum software testing and debugging.
👉 More information
🗞 Detecting Flaky Tests in Quantum Software: A Dynamic Approach
🧠 ArXiv: https://arxiv.org/abs/2512.18088
