The common assumption that cleaning up noisy audio improves speech recognition accuracy is challenged by new research into medical transcription systems. Sujal Chondhekar, Vasanth Murukuri, and Rushabh Vasani, along with colleagues, investigated the effects of speech enhancement on state-of-the-art automatic speech recognition (ASR) models. Their systematic evaluation, using 500 medical recordings subjected to various noise conditions, reveals a surprising result: applying speech enhancement consistently decreases transcription accuracy. The team demonstrates that modern ASR systems already possess considerable internal noise resilience, and that attempting to ‘improve’ the audio through noise reduction can actually remove vital acoustic information, ultimately harming performance and potentially impacting the reliability of medical scribe applications in real-world clinical settings.
Speech enhancement techniques are often employed to improve the performance of automatic speech recognition (ASR) systems, but their effectiveness cannot be taken for granted when applied to modern, large-scale ASR models trained on diverse and noisy data. This work presents a systematic evaluation of MetricGAN-plus-voicebank denoising across four state-of-the-art ASR systems, OpenAI Whisper, NVIDIA Parakeet, Google Gemini Flash 2.0, and Parrotlet, using a dataset of 500 medical speech recordings subjected to nine different noise conditions. ASR performance is measured using semantic word error rate (semWER), a normalized word error rate metric that accounts for domain-specific normalizations. The results reveal a counterintuitive finding: speech enhancement preprocessing degrades ASR performance across all noise conditions and models tested.
Speech Enhancement Impacts on Indian Clinical ASR
The study systematically evaluated the impact of speech enhancement on modern automatic speech recognition (ASR) systems, challenging the conventional belief that preprocessing always improves performance. Researchers employed MetricGAN-plus-voicebank, a state-of-the-art speech enhancement model, to process 500 medical speech recordings captured under nine distinct noise conditions representative of typical Indian clinical environments. These recordings then served as input for four leading ASR systems: OpenAI Whisper, NVIDIA Parakeet, Google Gemini Flash 2.0, and Parrotlet-a, a model specifically trained for English speech recognition in Indian healthcare settings.
The core of the methodology involved a rigorous comparative analysis, measuring ASR performance with and without the application of MetricGAN-plus-voicebank denoising. Performance was quantified using semantic Word Error Rate (semWER), a normalized metric designed to account for domain-specific terminology and normalizations crucial for accurate medical transcription. Researchers meticulously tested 40 configurations, combining each of the four ASR systems with each of the ten noise conditions, ensuring a comprehensive evaluation of the interaction between speech enhancement and ASR performance. To facilitate reproducibility and further research, the team released their evaluation code, the dataset of 500 medical recordings, and detailed results publicly on GitHub.
This commitment to open science allows other researchers to verify the findings and extend the work to different languages, noise conditions, or ASR systems. The study’s innovative approach lies in its systematic investigation of a potentially counterintuitive phenomenon: that modern ASR systems, trained on vast and diverse datasets, may already possess sufficient internal noise robustness, rendering traditional speech enhancement techniques unnecessary or even detrimental to transcription accuracy. The precise measurement of semWER across a wide range of configurations enabled the team to demonstrate a consistent degradation in performance when speech enhancement was applied, ranging from 1.1% to 46.6% absolute semWER increase.
Noise Reduction Impairs Speech Recognition Performance
This work presents a systematic evaluation of speech enhancement techniques on modern automatic speech recognition (ASR) systems, revealing a counterintuitive finding: preprocessing audio with noise reduction consistently degrades performance. Researchers tested MetricGAN-plus-voicebank denoising across four state-of-the-art ASR systems, OpenAI Whisper, NVIDIA Parakeet, Google Gemini Flash 2.0, and Parrotlet-a, using a dataset of 500 medical speech recordings exposed to nine distinct noise conditions. Performance was measured using semantic Word Error Rate (semWER), a normalized metric accounting for domain-specific terminology.
The results demonstrate that original, noisy audio consistently achieved lower semWER scores than enhanced audio across all 40 tested configurations. Degradations ranged from 1.1% to 46.6% absolute semWER increase, indicating a substantial negative impact from the applied denoising. This suggests that modern ASR models possess inherent robustness to noise and that traditional speech enhancement methods may inadvertently remove acoustic features critical for accurate transcription.
The team meticulously measured and recorded semWER values for each model, noise condition, and processing state, providing a detailed quantitative analysis of the observed performance differences. These findings challenge the conventional wisdom that speech enhancement always improves ASR accuracy, particularly for large-scale neural ASR systems trained on diverse, noisy data. The research has direct implications for the deployment of medical scribe systems in noisy clinical environments, indicating that applying noise reduction techniques may not only be computationally wasteful but also detrimental to transcription accuracy.
Speech Enhancement Degrades ASR Performance
This study systematically evaluated the impact of a widely used speech enhancement method, MetricGAN-plus-voicebank, on four state-of-the-art automatic speech recognition (ASR) systems operating in a noisy clinical environment. Across all tested noise conditions and models, the research team observed a consistent increase in word error rate when enhanced audio was transcribed, compared to directly transcribing the original noisy audio. These findings demonstrate that, in this specific context, applying this particular enhancement technique degrades ASR performance. The results suggest that enhancement approaches designed to optimize human perceptual metrics may not align well with the representations learned by large-scale ASR models trained on noisy, real-world data.
The authors emphasize that this does not necessarily indicate speech enhancement is inherently detrimental to ASR, but rather that its effectiveness is highly dependent on the specific technique, noise characteristics, and the ASR model’s training methodology. The team acknowledges that alternative approaches, such as ASR-aware enhancement or domain-specific fine-tuning, may yield different outcomes and warrant further investigation. Future research directions include detailed analysis of the acoustic phenomena causing enhancement to fail, real-world deployment studies in clinical settings, and evaluation of multi-microphone approaches like beamforming as alternatives to single-channel enhancement. From a practical standpoint, the findings suggest that applying MetricGAN+ preprocessing by default in medical ASR pipelines should be avoided, and its impact carefully evaluated for each specific task and model.
👉 More information
🗞 When De-noising Hurts: A Systematic Study of Speech Enhancement Effects on Modern Medical ASR Systems
🧠 ArXiv: https://arxiv.org/abs/2512.17562
