Researchers are tackling the significant testing challenges posed by increasingly complex artificial intelligence/machine learning (AI/ML) systems and emerging quantum computing software. Lamine Rihani from SAP Labs France, in collaboration with colleagues, presents a novel approach called reverse n-wise output testing to address issues arising from high-dimensional input spaces and probabilistic outputs. This mathematically principled paradigm inverts traditional testing methods by constructing covering arrays based on output characteristics, such as confidence calibration, fairness partitions, and measurement outcome distributions, and then optimises for inputs that elicit these specific behaviours. The framework promises explicit coverage guarantees, improved fault detection, and more efficient validation pipelines for both AI/ML and quantum systems, representing a substantial advance in ensuring the trustworthiness and reliability of these critical technologies.

Current testing methods struggle with the unpredictable nature of these technologies, leaving room for hidden errors and unreliable results. Scientists are increasingly focused on the behavioural testing of artificial intelligence and machine learning (AI/ML) and quantum systems.

These systems, integral to applications like recommendation engines and fraud detection, directly impact financial outcomes and regulatory compliance. Defects in their behaviour can lead to unfair decisions or invalid scientific conclusions, making testing fundamentally more challenging than traditional software testing. Inputs are typically high-dimensional and continuous, outputs are probabilistic, and correctness is defined by behavioural properties such as calibration and fairness, rather than simple functional predicates.

These systems also evolve frequently under MLOps and quantum DevOps practices, necessitating scalable and repeatable testing strategies. N-wise combinatorial testing has proven effective for conventional software by constructing covering arrays over input parameters. However, when applied to AI/ML and quantum systems, this input-centric view has limitations.

Covering all relevant feature combinations does not guarantee exercise of important regions of the output space. For ML models, properties like confidence calibration and fairness are defined over collections of predictions, not feature tuples. For quantum programs, correctness is expressed through measurement outcome distributions and error syndrome patterns, requiring repeated execution and statistical characterisation.

Consequently, input-centric testing often yields high nominal input coverage but poor coverage of critical behavioural phenomena. This paper introduces reverse n-wise output testing, a paradigm inversion of traditional combinatorial testing. It constructs covering arrays directly over semantically meaningful output equivalence classes, such as ML confidence calibration buckets, fairness partitions, and quantum measurement outcome/error syndrome distributions, rather than over inputs.

The framework then solves the computationally challenging black-box inverse mapping problem via gradient-free metaheuristic optimisation to derive input configurations capable of eliciting targeted behavioural signatures from opaque models. This delivers synergistic benefits, including explicit customer-centric prediction coverage guarantees, substantial improvements in fault detection rates, enhanced test suite efficiency, and structured MLOps/quantum validation pipelines with automated partition discovery from uncertainty analysis and coverage drift monitoring.

The work addresses how to reorient combinatorial testing from inputs to outputs, enabling n-wise coverage over semantically meaningful behaviours of black-box AI/ML and quantum systems while maintaining practical test generation. The main contributions are a formal output-centric combinatorial framework introducing output covering arrays with abstract output dimensions, a reverse n-wise testing methodology for AI/ML and quantum systems involving output partition design, covering array generation, and inverse mapping problem solving, an empirical demonstration on an industrial-style ML component evaluating behavioural coverage and fault detection relative to a traditional input-centric baseline, and a discussion of integration into MLOps/quantum validation pipelines, complementarity with property-based and metamorphic testing, and new research directions in behavioural coverage criteria and intelligent test input synthesis.

Classical combinatorial testing is grounded in covering array theory, constructing test suites to ensure all t-way combinations of input parameter values appear in at least one test case. This approach has been widely adopted, but remains fundamentally input-centric, treating outputs only as pass/fail oracles. Consequently, coverage guarantees are expressed over input interaction spaces, not behavioural properties.

Testing of AI/ML systems has focused on constructing test scenarios over input features, leveraging covering array efficiency but maintaining an input-centric perspective. While broader frameworks emphasize behavioural properties like calibration and fairness, existing ML testing approaches typically address these properties individually, lacking n-wise coverage guarantees over combinations of behavioural dimensions.

Quantum software testing is an emerging area dealing with probabilistic measurement outcomes and hardware noise. Existing work combines formal reasoning, statistical checking, and small-scale test generation. Combinatorial testing has been applied to quantum programs by defining parameters over quantum inputs and generating t-way combinations, evaluating resulting measurement statistics for faults.

These approaches extend classical ideas but remain input-centric, lacking explicit n-wise coverage guarantees over measurement outcome categories or error syndromes. Input, output relation combinatorial testing attempts to integrate outputs by encoding expected relationships between inputs and outputs within a search-based optimisation framework. However, the underlying covering array remains defined over input parameters, with outputs appearing as constraints or objectives in the generation process, and no explicit notion of covering output spaces.

Eliciting system behaviours through output-focused combinatorial testing

Reverse n-wise output testing initiates with the construction of output covering arrays, a departure from traditional input-centric methods. These arrays are designed directly over domain-specific output equivalence classes, such as machine learning confidence calibration buckets or quantum syndrome patterns, effectively defining combinatorial factors as output dimensions rather than input parameters.

This innovative approach allows for explicit guarantees of customer-centric prediction and measurement coverage, focusing on the observable behaviours of a system rather than its internal workings. Following array creation, the research addresses the computationally challenging problem of black-box inverse mapping. This involves employing gradient-free metaheuristic optimisation techniques to synthesize input feature configurations or circuit parameters.

The goal is to elicit targeted behavioural signatures from opaque models, meaning systems where the internal logic is not directly accessible. This process is crucial for identifying inputs that will trigger specific, pre-defined outputs, enabling focused testing and fault detection. To demonstrate the methodology, the study applied reverse n-wise testing to a representative black-box machine learning component within an enterprise analytics workflow.

Output tuples were defined over prediction confidence, temporal aggregations, and geographical regions, allowing for a comprehensive evaluation of behavioural coverage. Simultaneously, the research injected behavioural faults to assess the framework’s ability to detect failures in calibration, boundary conditions, and error syndromes, providing a robust validation of its performance.

Superior fault detection and behavioural coverage via reverse n-wise output testing

Reverse n-wise output testing achieved 96.8% output coverage (OCov2) using 189 tests, a substantial improvement over 62.3% attained by input-centric covering array testing with the same number of tests. This performance represents a 55% increase in behavioral coverage, demonstrating the effectiveness of constructing covering arrays directly over output equivalence classes.

The research further quantified performance with a metric η2, measuring 322 tuples per test, indicating a high density of targeted behavioral signatures elicited from the models. Critically, this approach detected 100% of the eight injected faults, representing realistic machine learning failure modes, a result unmatched by other methods tested. Fault detection rates varied by category, with reverse n-wise successfully identifying all instances of confidence miscalibration, sex and age fairness violations, and boundary misclassifications.

In contrast, input-centric methods achieved fault detection rates ranging from 25% to 75%, highlighting the limitations of focusing solely on input feature space. The systematic targeting of n-way interactions, specifically combinations of confidence, fairness, and calibration, proved essential for uncovering these complex failure scenarios. The study established the practical scalability of reverse n-wise testing, completing all 189 tests in 1.8 minutes using parallel processing.

OCA (output covering array) generation required less than 10 seconds, while the inverse mapping process converged in 150 oracle calls per tuple. This represents a 1.5x efficiency gain over random sampling, confirming the method’s viability for integration into industrial machine learning pipelines. Furthermore, the test budget of 189 tests represents only 0.0007% of the exhaustive test space (26 million tests), demonstrating significant reduction in testing effort.

Targeting behaviours unlocks systematic testing of AI and quantum systems

The relentless pursuit of trustworthy artificial intelligence has long been hampered by a fundamental mismatch between how we test software and how these complex machine learning systems actually behave. Traditional testing relies on meticulously mapping inputs to outputs, but AI models operate as opaque ‘black boxes’ where even slight input changes can yield unpredictable results.

This work offers a compelling shift in perspective, moving away from exhaustive input searches and towards a strategy of deliberately targeting specific, potentially problematic, outputs. By constructing tests based on desired or undesirable model behaviours, calibration errors, fairness violations, even specific quantum measurement outcomes, researchers have created a framework for systematically probing the boundaries of AI and quantum systems.

This isn’t simply about finding bugs; it’s about guaranteeing a certain level of ‘coverage’ across the full spectrum of possible behaviours, a crucial step towards regulatory compliance as seen with the recent EU AI Act. However, scaling this ‘reverse n-wise output testing’ presents significant challenges. While the initial demonstrations are promising, extending coverage to more complex scenarios, particularly in areas like computer vision and natural language processing, will require substantial computational resources and clever algorithmic improvements.

The success of automated partition discovery, identifying meaningful groupings of outputs, will be key. Furthermore, establishing theoretical guarantees about the effectiveness of this inverse mapping process remains an open question. Looking ahead, the true potential of this approach lies in its integration into continuous integration and deployment pipelines.

Imagine AI systems that not only learn and adapt but also continuously validate their own behaviour against pre-defined quality criteria. This isn’t just about building better AI; it’s about building AI we can demonstrably trust.

👉 More information
🗞 Reverse N-Wise Output-Oriented Testing for AI/ML and Quantum Computing Systems
🧠 ArXiv: https://arxiv.org/abs/2602.14275

Tags:

decision boundary regions Fault Detection Rates gradient-free metaheuristic optimisation. measurement outcome distributions ML confidence calibration Mlops Reverse n-wise output testing syndrome patterns

AI Testing Focuses on Outcomes, Not Inputs

Eliciting system behaviours through output-focused combinatorial testing

Superior fault detection and behavioural coverage via reverse n-wise output testing

Targeting behaviours unlocks systematic testing of AI and quantum systems

Rohail T.

Latest Posts by Rohail T.:

Quantum Error Correction Gains a Clearer Building Mechanism for Robust Codes

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently

Protected: Quantum Computing Tackles Fluid Dynamics with a New, Flexible Algorithm