Hotword Retrieval Enables More Accurate ASR Transcription Using LLMs and Task-Driven Rewards

Recent advances in automatic speech recognition leverage the power of large language models, yet accurately identifying specific terms like names or ‘hotwords’ within a vast vocabulary presents a significant hurdle. YuXiang Kong, JunFeng Hou, and Jian Tang, researchers at Alibaba Group, alongside their colleagues, address this challenge with a novel framework that combines efficient hotword retrieval with reinforcement learning. Their work introduces a two-stage system which first narrows down potential hotwords from a large list, then integrates these candidates into the language model to improve recognition, achieving substantial reductions in keyword errors while maintaining overall speech transcription accuracy. This achievement demonstrates a powerful new approach to contextual biasing, paving the way for more responsive and precise voice-activated systems and improved performance in diverse speech recognition applications.

Currently, systems achieve strong performance across diverse tasks, yet contextual biasing for named entities and hotwords under large vocabularies remains challenging. This work proposes a scalable two-stage framework that integrates hotword retrieval with LLM-ASR adaptation. The team extends the Global, Local Contrastive Language, Audio pre-trained model (GLCLAP) to retrieve a compact top-k set of hotword candidates from a large vocabulary via robustness-aware data augmentation and fuzzy matching. Subsequently, the retrieved candidates are injected as textual prompts into an LLM-ASR model and the system is fine-tuned.

Large Vocabulary Contextual Biasing for ASR

Scientists have developed a new framework that significantly improves the accuracy of speech recognition systems when identifying specific words or phrases, known as hotwords, within a large vocabulary. This work integrates a hotword retrieval system with a large language model-based automatic speech recognition (LLM-ASR) model, achieving substantial reductions in keyword error rates while maintaining high overall transcription accuracy. The core of the breakthrough lies in a two-stage process, beginning with an enhanced Global-Local Contrastive Language-Audio pre-trained model (GLCLAP). This system efficiently narrows down a vast vocabulary to a compact set of top-k hotword candidates, utilizing a data augmentation pipeline and a fuzzy matching strategy to improve recall and robustness.

Researchers constructed a robustness-aware data augmentation pipeline, removing reliably recognized hotwords from the vocabulary to reduce distractors, and then augmented training data with contextual sentences and deliberately perturbed variants of biasing words. These selected candidates are then used as prompts for the LLM-ASR model, which is further refined using a reinforcement learning technique called Rejection-Based Policy Optimization (GRPO). Experiments on hotword-focused test sets demonstrate the effectiveness of this framework, with the system achieving significant reductions in keyword error rate. The reward function used in GRPO jointly optimizes hotword recognition and overall transcription fidelity, encouraging accurate identification of hotwords when present, avoiding false positives when absent, and preserving general transcription performance. The team also implemented a Conformer-MoE encoder, replacing standard feed-forward networks with a Mixture-of-Experts structure to enhance the model’s capacity.

Hotword Recognition Enhanced with Language Models

The team developed a two-stage process, beginning with a refined method for selecting a small set of likely keywords from a much larger list, using techniques that account for potential errors in speech. These selected keywords are then used as prompts to guide a large language model-based speech recognition system, which is further refined using a reinforcement learning approach that simultaneously optimizes keyword recognition and overall transcription accuracy. The results demonstrate substantial reductions in keyword error rates on both media and medical speech datasets, while maintaining high levels of accuracy in general speech recognition tasks. This indicates that the combination of efficient keyword selection and reinforcement learning-guided adaptation effectively addresses the challenge of contextual biasing in large-vocabulary speech recognition.

Keyword Spotting with Guided Language Models

The authors acknowledge that the current framework focuses on single-language scenarios, and future work will investigate extending the system to handle multiple languages and explore even closer integration of the keyword selection and speech decoding processes. These advancements promise more reliable and accurate speech recognition, particularly in applications requiring precise identification of specific terms.

👉 More information
🗞 Contextual Biasing for LLM-Based ASR with Hotword Retrieval and Reinforcement Learning
🧠 ArXiv: https://arxiv.org/abs/2512.21828

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Flexible Ultrasound Transducers Achieve Enhanced Acoustic Performance with Novel Interfaces

Flexible Ultrasound Transducers Achieve Enhanced Acoustic Performance with Novel Interfaces

December 31, 2025
Video Segmentation Foundation Models Vulnerable, Enabling Backdoor Attacks with Less Than 5% Accuracy

Video Segmentation Foundation Models Vulnerable, Enabling Backdoor Attacks with Less Than 5% Accuracy

December 31, 2025
Quantum Breakdown Condensate Achieves Finite Entropy Density, Demonstrating Disorder-Free Glass Behaviour

Quantum Breakdown Condensate Achieves Finite Entropy Density, Demonstrating Disorder-Free Glass Behaviour

December 31, 2025