Automatic Speech Recognition (ASR) technology increasingly relies on models trained on extensive multilingual datasets, offering a pathway to support languages with limited digital resources. This is particularly relevant for languages like Bangla, where the availability of transcribed speech data presents a significant challenge to developing effective speech recognition systems. Md Sazzadul Islam Ridoy, Sumi Akter, and Md. Aminur Rahman, from Ahsanullah University of Science and Technology, investigate the performance of two prominent ASR models, OpenAI’s Whisper and Facebook’s Wav2Vec-BERT, on the Bangla language in their study, ‘Adaptability of ASR Models on Low-Resource Language: A Comparative Study of Whisper and Wav2Vec-BERT on Bangla’. Their research, utilising datasets such as Mozilla Common Voice-17 and OpenSLR, focuses on systematic fine-tuning and optimisation to assess performance based on metrics including Word Error Rate (WER) and Character Error Rate (CER), alongside computational efficiency.
Recent evaluations of automatic speech recognition (ASR) models reveal significant advances in Bangla speech recognition, though performance consistently encounters limitations due to restricted linguistic resources. A comprehensive study systematically compares two prominent models, Facebook’s Wav2Vec-BERT and OpenAI’s Whisper, specifically assessing their efficacy with the Bangla language and addressing the critical issue of data scarcity which frequently impedes progress in under-resourced languages. Researchers conduct experiments utilising both publicly available datasets, including Mozilla Common Voice-17 and OpenSLR, and a newly created Bangla speech corpus, ensuring a robust and comprehensive assessment of each model’s capabilities.
The study employs a rigorous methodology, involving fine-tuning and optimisation of hyperparameters, such as learning rate, training epochs, and checkpoint selection, to ensure a fair comparison between the models and maximise their performance on the Bangla language. Evaluation centres on key metrics including Word Error Rate (WER), which quantifies the number of incorrectly transcribed words, and Character Error Rate (CER), assessing errors at the character level, providing a detailed analysis of transcription accuracy. Computational efficiency and training time also receive consideration, providing a holistic assessment of each model’s practical applicability and resource requirements.
Results consistently demonstrate that Wav2Vec-BERT outperforms Whisper across all evaluated metrics, establishing it as the preferred model for Bangla speech recognition. Wav2Vec-BERT achieves superior accuracy in transcribing Bangla speech, while simultaneously requiring fewer computational resources, making it a more efficient and practical solution for real-world applications. This suggests that Wav2Vec-BERT’s self-supervised learning approach, which allows it to learn from unlabeled data, proves particularly effective in a low-resource setting, mitigating the challenges posed by limited labelled data.
Bangla presents a considerable challenge for ASR development due to its status as a low-resource language, meaning comparatively little labelled speech data exists for training robust models. The success of Wav2Vec-BERT underscores the value of self-supervised learning techniques, where models learn from unlabelled data before fine-tuning on limited labelled examples, effectively mitigating the data scarcity problem.
Researchers actively investigate the transferability of these findings to other low-resource languages, examining the performance of both models across a wider range of linguistic contexts to reveal the generalisability of the observed trends. Future work necessitates a deeper analysis of the specific architectural features of Wav2Vec-BERT that contribute to its superior performance in Bangla, identifying key components that inform the design of even more effective ASR models for low-resource languages.
Expanding the available training data for Bangla remains a priority, creating larger, more diverse datasets, potentially through active learning strategies, to further enhance the accuracy and robustness of ASR systems. Researchers acknowledge that while Wav2Vec-BERT currently exhibits stronger performance, both models contribute to advancing the field, offering potential solutions for building robust speech recognition systems in linguistic environments where labelled data remains limited. The findings provide valuable insights for researchers and developers working to improve speech technology for Bangla and other under-resourced languages, fostering innovation and collaboration.
This efficiency is particularly significant when deploying ASR systems in environments with constrained hardware or limited energy availability, making Wav2Vec-BERT a practical choice for resource-limited settings. The observed difference in computational cost suggests a practical benefit for real-world applications, enabling wider adoption of ASR technology in diverse contexts. This research demonstrates the potential of self-supervised learning to overcome the challenges of low-resource speech recognition, paving the way for more inclusive and accessible communication technologies.
👉 More information
🗞 Adaptability of ASR Models on Low-Resource Language: A Comparative Study of Whisper and Wav2Vec-BERT on Bangla
🧠 DOI: https://doi.org/10.48550/arXiv.2507.01931
