LLMs assessed: new method improves open question answer evaluation.

The assessment of open-ended question answering remains a significant challenge in natural language processing, particularly as large language models (LLMs) become increasingly sophisticated. Traditional evaluation metrics often fail to capture the nuances of semantic similarity, while current LLM-based approaches frequently lack transparency. Researchers now propose a method to address these limitations by recognising the fundamental difference between questions demanding factual recall and those requiring more complex reasoning. Yongqi Fan, Yating Wang, Guandong Wang, Jie Zhai, Jingping Liu, Qi Ye, and Tong Ruan, all from the School of Information Science and Engineering at East China University of Science and Technology, detail their approach in the article, “MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs”.

Their work introduces MinosEval, a novel evaluation method that adapts its scoring strategy based on whether a question seeks factual information or a more subjective response, aiming for improved alignment with human judgement and greater interpretability of results.

Automatic evaluation of open-ended question answering systems currently relies heavily on metrics such as ROUGE and BERTScore, however these often struggle to accurately capture semantic similarity between generated responses and ideal answers. This limitation stems from the inherent difficulty in quantifying nuanced reasoning and diverse expression, both crucial elements of effective open-ended question answering, and necessitates the development of more sophisticated evaluation techniques. Consequently, researchers have introduced MinosEval, a novel evaluation method designed to address these shortcomings by adopting a more nuanced and adaptive approach to assessment.

MinosEval functions by initially classifying questions as either ‘factoid’ or ‘non-factoid’, recognising that different evaluation strategies are appropriate for each type of inquiry. Factoid questions demand concise, factual answers, for example, ‘What is the capital of France?’, while non-factoid questions require more elaborate and nuanced responses, such as ‘Discuss the ethical implications of artificial intelligence’. For factoid questions, MinosEval employs an adaptive key-point scoring strategy, focusing on identifying and assessing crucial information within the response, moving beyond simple lexical overlap – that is, matching words. Conversely, for non-factoid questions, the method utilises an instance-aware listwise ranking strategy, evaluating responses in relation to one another, providing a more holistic assessment of quality and coherence. This ranking considers the entire set of generated answers, rather than evaluating each in isolation.

Experimental results, conducted across multiple open-ended question answering datasets, including newly constructed resources designed to expand existing community benchmarks, demonstrate that MinosEval exhibits a stronger correlation with human annotations compared to traditional metrics. This improved alignment suggests that MinosEval provides a more reliable and accurate assessment of response quality, offering a more nuanced understanding of system performance. Furthermore, the method’s design facilitates greater interpretability, offering insights into the reasoning behind its evaluations, enabling researchers to pinpoint strengths and weaknesses in generated responses.

The study highlights the importance of tailoring evaluation strategies to the specific characteristics of question types, moving away from the limitations of one-size-fits-all approaches. By distinguishing between factoid and non-factoid questions, MinosEval overcomes a key limitation of existing methods, which often apply a uniform assessment framework regardless of question complexity. Researchers constructed comprehensive datasets, encompassing both established benchmarks and newly created resources, to validate the effectiveness of MinosEval, providing a more challenging evaluation environment.

The design of MinosEval prioritises interpretability, offering insights into the reasoning behind its evaluations. The method’s adaptive approach and focus on interpretability make it a valuable tool for researchers and developers alike. Future work could focus on addressing the limitations of current evaluation metrics, potentially exploring metrics that can better assess the creativity and originality of responses, or account for the context of the conversation. Further validation across a wider range of datasets and question types will also be crucial to confirm its generalisability and robustness.

The research team plans to release the MinosEval code and datasets to the public, enabling other researchers to reproduce their results and build upon their work. They also plan to organise workshops and tutorials to train other researchers on how to use MinosEval and interpret its results. The development of MinosEval was supported by funding from several sources, including government grants and industry partnerships. The research team is grateful for the support of these funding agencies and industry partners.

The study concludes that MinosEval represents a significant advancement in the field of automatic evaluation of open-ended question answering systems, and it paves the way for more effective system development and evaluation. The research team is committed to continuing to improve MinosEval and to sharing their work with the broader research community.

👉 More information
🗞 MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs
🧠 DOI: https://doi.org/10.48550/arXiv.2506.15215

Quantum News

Quantum News

As the Official Quantum Dog (or hound) by role is to dig out the latest nuggets of quantum goodness. There is so much happening right now in the field of technology, whether AI or the march of robots. But Quantum occupies a special space. Quite literally a special space. A Hilbert space infact, haha! Here I try to provide some of the news that might be considered breaking news in the Quantum Computing space.

Latest Posts by Quantum News:

Network-based Quantum Annealing Predicts Effective Drug Combinations

Network-based Quantum Annealing Predicts Effective Drug Combinations

December 24, 2025
Scientists Guide Zapata's Path to Fault-Tolerant Quantum Systems

Scientists Guide Zapata’s Path to Fault-Tolerant Quantum Systems

December 22, 2025
NVIDIA’s ALCHEMI Toolkit Links with MatGL for Graph-Based MLIPs

NVIDIA’s ALCHEMI Toolkit Links with MatGL for Graph-Based MLIPs

December 22, 2025