Researchers at Singapore Management University, led by Laura Wynter, have developed a novel methodology to systematically evaluate reasoning fragments and aggregate them using quality-derived weights, offering a principled approach to combining reasoning processes within large language models. These models, while demonstrating proficiency in various expert-level examinations, often exhibit brittleness when applied to specialised, evidence-intensive domains such as law. Errors in these domains stem not only from deficiencies in general world knowledge but also from the models’ inability to discern subtle distinctions between pieces of evidence and their inconsistent application of supporting evidence during reasoning. The team’s work addresses these limitations, aiming to improve the reliability of artificial reasoning systems.
EP-HUBO overcomes limitations in legal reasoning through optimised evidence prioritisation
A +23.2 percentage point increase in accuracy on the LEXam benchmark was achieved following the implementation of EP-HUBO, an innovative method for evidence selection within large language models. Traditional techniques often struggle to prioritise stronger arguments that may be buried within extensive reasoning chains generated by these models. This presents a significant barrier in legal reasoning, where nuanced evidence and precise interpretation are paramount, frequently confounding standard model performance. EP-HUBO addresses this by reframing the evidence selection process as a combinatorial optimisation problem. This allows less frequent, yet demonstrably stronger, hypotheses to override popular but potentially flawed conclusions that might otherwise dominate the output. The combinatorial optimisation aspect involves exploring various combinations of evidence to identify those that yield the most logically sound and well-supported conclusions.
Initially, a leading large language model consistently favoured a single answer choice on 87.7% of questions presented within the LEXam benchmark. This highlights a tendency towards confirmation bias, where the model prioritises its initial assumptions even in the face of contradictory evidence. However, utilising HUBO-selected evidence, derived from the optimised evidence pools, significantly lessened this tendency, yielding an additional 11.4 percentage point accuracy gain. This demonstrates the method’s ability to encourage more nuanced and objective reasoning. Experiments were also conducted utilising the Dirac-3 photonic entropy quantum machine from Quantum Computing Inc. This hardware achieved comparable results to those obtained via simulated annealing on traditional computers, suggesting the potential for quantum computing approaches to accelerate and enhance this type of reasoning process. The Dirac-3 machine leverages the principles of quantum mechanics to explore a vast solution space more efficiently than classical algorithms, potentially offering a speed advantage for complex optimisation problems.
This approach meticulously parses reasoning fragments into per-hypothesis evidence pools. Each pool contains evidence specifically relevant to a particular hypothesis being considered. These pools are then weighted based on three key criteria: relevance (how directly the evidence supports the hypothesis), specificity (the precision with which the evidence addresses the hypothesis), and distinctiveness (how unique the evidence is compared to other supporting information). This weighting scheme represents a major advance in artificial reasoning, particularly in areas lacking thorough pre-existing knowledge. The combination of these factors allows EP-HUBO to identify and prioritise the most compelling evidence, even if it is not the most frequently cited. While the initial results are promising, they are currently limited to low-contamination domains, meaning the benchmark material has not been extensively pre-processed or “seen” by the models during their training phase. Further research is needed to assess performance on more widely-exposed datasets, as the benefits of EP-HUBO diminish when benchmark data has already “contaminated” the model, raising concerns about its broader applicability, but does not negate its value as a direction for future work. The concept of ‘contamination’ refers to the unintentional inclusion of benchmark data within the training set of the language model, which can lead to artificially inflated performance scores.
Mitigating data contamination is important for enhancing reasoning in artificial intelligence models
Reliably mimicking human reasoning in complex, specialised fields remains a significant challenge despite the rapid advances in artificial intelligence. The success of new techniques, such as EP-HUBO, is contingent upon a vital condition: the absence of pre-existing knowledge within the language model itself regarding the specific benchmark material. This is because pre-existing knowledge can overshadow the model’s ability to engage in genuine reasoning, leading to inflated performance metrics that do not reflect true understanding. Establishing a principled method for aggregating reasoning fragments, as EP-HUBO provides, proves particularly valuable when evaluating large language models in areas demanding careful evidence assessment and logical deduction. This underscores the critical need to address data contamination in artificial intelligence development. EP-HUBO, or Evidence Pool Higher-Order Binary Optimisation, systematically assesses reasoning components, assigning weights based on relevance, specificity and distinctiveness, in stark contrast to simpler methods that merely favour the most frequent answer or the most superficially plausible conclusion. The binary optimisation aspect refers to the process of selecting the best combination of evidence pools to maximise the overall reasoning score.
Future work will focus on adapting the method to handle datasets with higher levels of pre-existing model knowledge. Potential strategies include techniques like data augmentation, where new, slightly modified examples are added to the training set to reduce the impact of existing data, and adversarial training, where the model is deliberately exposed to challenging examples designed to expose its weaknesses. Another avenue for exploration is the development of techniques to identify and filter out contaminated data from benchmark datasets. Addressing data contamination is crucial for ensuring that the performance gains observed with EP-HUBO and other advanced reasoning techniques are genuine and reflect a true improvement in the model’s ability to reason, rather than simply its ability to memorise and regurgitate information. The ultimate goal is to create artificial intelligence systems that can not only process information but also understand it, evaluate it critically, and draw sound conclusions based on evidence, mirroring the complexities of human thought.
The research demonstrated that a new method, EP-HUBO, improves reasoning in large language models by optimising the selection of evidence used to support answers. This is important because current models often struggle with tasks requiring careful evidence assessment, such as legal reasoning, and can be misled by common but weak arguments. EP-HUBO systematically weights evidence based on relevance, specificity and distinctiveness, offering a more principled way to aggregate reasoning fragments than simply choosing the most frequent response. The authors intend to adapt this method for datasets where models already possess significant knowledge, and to address the issue of data contamination in benchmark datasets.
👉 More information
🗞 Quantum-Inspired Trace-Augmented Evidence Selection for Reasoning over Structured Hypothesis Spaces
🧠 ArXiv: https://arxiv.org/abs/2606.06941
