The challenge of building question-answering systems that work reliably across diverse languages remains a significant hurdle in artificial intelligence, and researchers are now focusing on languages beyond English to address this. Parker Riley, Siamak Shakeri, and Waleed Ammar, from Google and Holistic Intelligence for Global Good, along with Jonathan H. Clark, introduce TyDi QA-WANA, a new benchmark dataset designed to test these systems in ten varieties of West Asia and North Africa. This dataset, comprising over 28,000 examples, uniquely focuses on information-seeking questions, those genuinely asked out of curiosity, and uses full articles as context, demanding more sophisticated text understanding. By collecting questions directly in each language, rather than relying on translations, the team avoids potential cultural biases and provides a more authentic test of a system’s ability to access and utilise information across a wider range of linguistic contexts
Many users worldwide regularly use technology to help them answer information-seeking questions. This creates a need for models capable of effectively utilising large text contexts to find answers, particularly when those answers are embedded within lengthy articles. To evaluate this ability, researchers created a question answering dataset where each question is paired with an entire article that may or may not contain the answer. The relatively large size of these articles results in a task suitable for assessing models’ capacity to process and understand extensive textual information. Furthermore, the data was collected directly in each language variety, without the use of translation, in order to avoid issues of cultural relevance and ensure accurate evaluation across different linguistic contexts. The team releases the code and data to facilitate further improvement by the research community
Minimal Span Extraction for Low-Resource Languages
Researchers introduced MinSpan, a new dataset designed for evaluating information-seeking question answering in ten low-resource languages. The dataset emphasizes minimal span extraction, requiring models to identify the shortest possible text segment that answers a given question within a larger context. The languages covered include Arabic, Azerbaijani, Farsi (Persian), and Uzbek, with data sourced from Wikipedia articles. A key feature of MinSpan is its focus on testing models’ ability to handle long contexts, which is increasingly important with the development of larger language models. The dataset comprises questions designed to require finding specific information within the provided text, with answers expected to be minimal spans.
The dataset’s size allows for robust evaluation of model performance. The authors used modern Large Language Models (LLMs) as baselines, employing them in a zero-shot setting without any fine-tuning. They report performance metrics to evaluate the models’ ability to extract answers accurately. Analysis of the results reveals performance variations across different languages and question types, with a breakdown of performance on answerable versus unanswerable questions. This study demonstrates that modern LLMs can extract answers from long contexts in low-resource languages. The authors are releasing the dataset and code to facilitate further research. MinSpan provides a valuable benchmark for evaluating models’ ability to handle long-context question answering in diverse languages.
TyDi QA-WANA Challenges Multilingual Question Answering
Researchers have created TyDi QA-WANA, a new question-answering dataset designed to challenge modern artificial intelligence systems and improve their ability to understand and respond to questions in under-represented languages. The dataset comprises over 28,000 questions spanning ten language varieties of western Asia and northern Africa, focusing on eliciting genuine information-seeking queries rather than questions with simple, readily available answers. This approach ensures the task requires more sophisticated reasoning and comprehension from the AI models tested. The creation of TyDi QA-WANA addresses a significant gap in current AI evaluation, as many existing datasets are limited to English or a small number of widely-used languages, and often lack the complexity needed to assess advanced long-context models.
The researchers specifically focused on languages with limited pre-training data, recognizing that AI performance tends to be weaker in these areas, and designed the dataset to utilize entire articles as context, testing the model’s ability to process and extract information from substantial amounts of text. This is particularly relevant given recent advances in long-context language models capable of handling over a million input tokens. The data collection process involved prompting annotators with excerpts from Wikipedia articles and asking them to formulate questions they were genuinely curious about, rather than questions directly answered by the prompt. Researchers then retrieved relevant Wikipedia articles and had human annotators identify the minimal span of text that answers each question, or indicate if the question cannot be answered from the provided article.
This new dataset allows for a more robust evaluation of AI systems’ ability to handle complex, information-seeking questions in diverse languages, and provides a valuable resource for improving AI performance in under-represented linguistic regions. By focusing on long-context comprehension and genuine curiosity, TyDi QA-WANA pushes the boundaries of current AI capabilities and paves the way for more inclusive and effective language technologies. The dataset and associated code are publicly available to encourage further research and development in this critical area.
TyDi QA-WANA Dataset Challenges Question Answering
The research presents TyDi QA-WANA, a new question-answering dataset comprising over 28,000 examples in ten varieties of Western Asia and Northern Africa. This dataset distinguishes itself by focusing on eliciting genuine information-seeking questions, rather than those easily answered with simple techniques, and by collecting data directly in each language variety to ensure cultural relevance. The task requires models to identify minimal answer spans within provided articles, or to determine if an answer exists at all, presenting a challenging evaluation of text comprehension abilities. The creation of TyDi QA-WANA addresses a gap in available resources for question-answering systems in these specific language varieties.
By focusing on minimal answer spans, the dataset encourages the development of models capable of precise text understanding and avoids rewarding overly verbose responses. The authors acknowledge that articles in Arabic varieties are written in Modern Standard Arabic, while the questions are in the local varieties, which may introduce a degree of linguistic difference. They have released both the dataset and associated code to facilitate further research and improvement in question-answering technologies for these languages.
👉 More information
🗞 TyDi QA-WANA: A Benchmark for Information-Seeking Question Answering in Languages of West Asia and North Africa
🧠 DOI: https://doi.org/10.48550/arXiv.2507.17709
