Regspeech12: Regional Corpus Documents Bengali Spontaneous Speech across Five Dialect Groups

The rich tapestry of the Bengali language, spoken by millions across South Asia and its diaspora, presents a significant challenge for computational linguistics due to its considerable dialectal variation. Md. Rezuwan, Azmol Hossain, and Kanij Fatema, alongside Rubayet Sabbir Faruque, Tanmoy Shome, and Ruwad Naswan, address this gap with a comprehensive study of regional Bengali speech. Their work documents the phonetic and morphological characteristics of five principal dialect groups, alongside variations observed in regions such as Chittagong and Sylhet, and introduces RegSpeech12, a new corpus of spontaneous speech recorded across these diverse areas. This achievement represents a crucial step towards building more accurate and inclusive Automatic Speech Recognition systems, and ultimately supports the preservation of Bengali’s linguistic heritage while advancing digital tools for its speakers.

Bangladeshi Speech Corpus, Regional and Demographic Diversity

This research details a substantial speech corpus representing diverse regions and demographics within Bangladesh, comprising over 21,000 utterances and exceeding 100 hours of speech data. A key strength lies in its geographic coverage, encompassing 15 districts and numerous subregions, allowing for the development of regionally-specific speech models. The data includes speech from male, female, and mixed-gender speakers, providing valuable information for building inclusive speech technologies. Analysis reveals significant variation in out-of-dictionary words, ranging from 32% to 67% across regions, highlighting substantial linguistic diversity, particularly in Sylhet. While acknowledging potential limitations like data imbalance and recording quality variations, this corpus represents a valuable contribution to speech technology and linguistic research in Bangladesh.

New Bengali Dialect Speech Corpus Constructed

Researchers have created a new speech corpus to address a gap in computational linguistics resources for the Bengali language, specifically regarding dialectal diversity. To remedy the lack of comprehensive regional coverage, the team compiled over 241 hours of speech, collected from 61 native speakers across 8 divisions and 34 districts of Bangladesh. The corpus includes both carefully recorded speech and spontaneous speech sourced from online platforms, capturing a wide range of accents and contemporary language use. Detailed phonetic analysis, using the International Phonetic Alphabet, revealed significant pronunciation differences across regions, demonstrating the nuanced linguistic landscape of Bangladesh. This newly created corpus, combined with the detailed phonetic analysis, provides a valuable resource for developing regional Bengali speech recognition systems and preserving the linguistic diversity of the language.

Bengali Speech Corpus Captures Regional Linguistic Diversity

This research delivers a comprehensive speech corpus documenting linguistic diversity across 12 regions and 99 sub-regions of Bangladesh, capturing 237 unique conversational topics. Data collection strategically targeted geographically diverse districts to reflect linguistic boundaries, and researchers meticulously addressed challenges in data collection, including maintaining protocol adherence and ensuring audio quality. Manual validation identified and rectified imbalances in the corpus, prompting supplementary data acquisition to achieve a more balanced representation of regional dialects. This rigorous approach results in a high-quality, linguistically diverse dataset for advancing automatic speech recognition technology for the Bengali language.

This work successfully constructs a comprehensive speech corpus exceeding 100 hours, documenting 12 distinct regional dialects of Bangladeshi Bangla, and represents the first publicly available resource specifically designed for automatic speech recognition focused on these regional variations. The team meticulously collected and processed speech samples under a uniform protocol, enabling the creation of a valuable resource for enhancing speech-to-text models. The dataset’s potential extends to various applications, including agentic AI, meta-learning, text-to-speech systems, and automated transliteration. Future work will focus on detailed linguistic analyses of dialectal differences and the standardization of transcriptions to facilitate automatic translation between regional dialects and standard Bengali.

👉 More information
🗞 RegSpeech12: A Regional Corpus of Bengali Spontaneous Speech Across Dialects
🧠 ArXiv: https://arxiv.org/abs/2510.24096

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Renormalization Group Flow Irreversibility Enables Constraints on Effective Spatial Dimensionality

Renormalization Group Flow Irreversibility Enables Constraints on Effective Spatial Dimensionality

December 20, 2025
Replica Keldysh Field Theory Unifies Quantum-Jump Processes in Bosonic and Fermionic Systems

Replica Keldysh Field Theory Unifies Quantum-Jump Processes in Bosonic and Fermionic Systems

December 20, 2025
Quantum Resource Theory Achieves a Unified Operadic Foundation with Multicategorical Adjoints

Quantum Resource Theory Achieves a Unified Operadic Foundation with Multicategorical Adjoints

December 20, 2025