Synthetic Data for Multilingual AI: 9.5M Data Points Enhance Indic Language Systems

Creating artificial intelligence systems that function effectively across multiple languages and cultures presents a significant challenge, especially when working with languages where digital resources are limited. Pranjal Chitale from Microsoft Corporation, Varun Gumma from Nanyang Technological University, Sanchit Ahuja from Northeastern University, and colleagues address this issue by exploring the potential of synthetic data, specifically for Indian languages. Their work investigates a novel method for generating culturally relevant datasets, prompting large language models to create data grounded in language-specific Wikipedia content, offering an alternative to simply translating existing datasets from languages like English. The team introduces Updesh, a substantial synthetic dataset comprising 9. 5 million data points across 13 Indian languages, and demonstrates that models trained on this data achieve substantial performance gains on generative tasks and reduce performance gaps between languages with varying levels of digital resources, providing compelling evidence for the importance of culturally grounded data curation in multilingual AI development.

Detailed Evaluation of Multilingual Language Models

This appendix details the methodology and evaluation procedures used in a research project focused on multilingual large language models (LLMs), providing sufficient detail for reproducibility and rigorous assessment. The team meticulously documents experiments, scoring rubrics, and prompts used throughout the study. Researchers assessed model performance using backtranslation, a technique measuring translation quality and preservation of meaning, crucial for evaluating reasoning abilities across languages. They carefully documented training hyperparameters, including the base model, sequence length, batch size, and optimization algorithms. The team utilized diverse evaluation datasets covering natural language understanding, generation, instruction following, and multilingual coverage, including MMLU and IFBench, with a focus on Indian languages. They employed a specific prompt to translate English instructions into other languages for use with the IFEval and IFBench datasets, using standardized evaluation prompts and detailed scoring rubrics to objectively assess model performance.

Wikipedia Grounded Multilingual Data Generation

This study introduces a novel framework for generating high-quality, multilingual, and multicultural synthetic data, creating a large-scale dataset for 13 Indian languages. Researchers moved beyond traditional translation methods, implementing a bottom-up generation strategy grounded in language-specific Wikipedia content to ensure cultural relevance and linguistic accuracy. The team prompted large language models, exceeding 235 billion parameters, to generate data based on retrieved Wikipedia content, building a dataset comprising 9. 5 million data points encompassing diverse reasoning and generative tasks, with an emphasis on long-context, multi-turn capabilities and alignment with Indian cultural contexts.

The data generation process began with a small set of human-curated seeds. Comprehensive evaluation, incorporating both automated metrics and over 10,000 human assessments, assessed data quality. Downstream evaluations involved fine-tuning models on the generated dataset and assessing performance across 15 diverse multilingual datasets, demonstrating significant gains on generative tasks and competitive results on multiple-choice natural language understanding tasks, particularly in low and medium-resource languages.

Updesh Dataset Boosts Indian Language AI Performance

Scientists have developed Updesh, a high-quality synthetic dataset designed to improve artificial intelligence performance across multiple Indian languages and cultural contexts, addressing the challenge of creating effective AI systems with limited existing data resources. The team generated 9. 5 million data points encompassing 13 Indian languages, focusing on diverse reasoning and generative tasks, with an emphasis on long-context, multi-turn conversations. Experiments reveal that models trained on Updesh consistently achieve significant gains on generative tasks and remain competitive on multiple-choice natural language understanding tasks, notably in low and medium-resource languages, narrowing the performance gap with high-resource languages.

The dataset’s creation involved prompting large language models, exceeding 235 billion parameters, to ground data generation in Wikipedia content specific to each language, complementing traditional translation methods. Comprehensive evaluation, incorporating both automated metrics and human annotation across 10,000 assessments, indicates the generated data is high quality. This research confirms that incorporating context-aware, culturally grounded methodologies is crucial for developing effective multilingual AI systems, emphasizing the importance of involving native speakers in seed data selection and evaluation.

Updesh Dataset Boosts Indian Language AI Performance

This research demonstrates the potential of synthetically generated data to address the scarcity of resources for multilingual artificial intelligence, particularly for Indian languages. The team developed Updesh, a large-scale dataset comprising 9. 5 million data points across 13 Indian languages, created through a novel bottom-up approach grounded in culturally specific Wikipedia content. Comprehensive evaluations, incorporating both automated metrics and human assessment, confirm the high quality of the generated data and its effectiveness in improving performance on generative tasks. Models trained on Updesh exhibited significant gains in low and medium-resource languages, narrowing the performance gap with high-resource languages. The authors acknowledge that the composition of the synthetic reasoning data, with a focus on long-form answers, may limit gains on certain multiple-choice NLU benchmarks, indicating a need for alignment between training data formats and downstream evaluation tasks. To facilitate further research, the team intends to release the Updesh dataset, evaluation protocols, and detailed analyses.

👉 More information
🗞 The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages
🧠 ArXiv: https://arxiv.org/abs/2509.21294

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Quantum-inspired Networks Enable Robust Reasoning, Advancing Logical Consistency in Large Language Models

Quantum-inspired Networks Enable Robust Reasoning, Advancing Logical Consistency in Large Language Models

January 13, 2026
Autonomous Driving Advances with DrivoR’s Multi-Camera Feature Compression and Trajectory Scoring

Autonomous Driving Advances with DrivoR’s Multi-Camera Feature Compression and Trajectory Scoring

January 13, 2026
Extended Heun Hierarchy Advances Quantum Geometry of Seiberg-Witten Curves for Gauge Theories

Extended Heun Hierarchy Advances Quantum Geometry of Seiberg-Witten Curves for Gauge Theories

January 13, 2026