Scientists are tackling the challenge of adapting powerful Auditory Large Language Models (LLMs) to tasks where labelled data is limited. Haolong Zheng and Siyin Wang, from the University of Illinois Urbana Champaign and Tsinghua University respectively, alongside Zengrui Jin and Mark Hasegawa-Johnson et al, demonstrate a novel approach to improve performance in these low-resource scenarios. Their research highlights that simple In-Context Learning (ICL) can enhance zero-shot performance across various speech and audio tasks, and builds upon this with a new post-training method, Speech In-Context Adaptation Training (SICL-AT). This technique strengthens a model’s ability to learn from demonstrations, consistently outperforming traditional fine-tuning when data is scarce , representing a significant step towards more flexible and robust auditory LLMs.
The team’s experiments consistently reveal that SICL-AT outperforms traditional fine-tuning methods in these low-resource settings, marking a substantial advancement in the field. This innovative approach involves training the model to perform inference conditioned on audio demonstrations, effectively teaching it to utilize contextual cues rather than simply memorizing domain-specific information. The enhancement achieved through SICL-AT demonstrably generalizes to audio understanding and reasoning tasks, broadening its potential applications.
Experiments detailed in the research indicate that SICL-AT consistently surpasses direct fine-tuning when dealing with limited data, achieving performance gains across both low-resource automatic speech recognition (ASR) and audio understanding/reasoning (AU/AR) tasks. The team applied SICL-AT to two distinct model backbones, Qwen2.5-Omni and MiMo-Audio, utilizing LoRA adapters with a rank of 8 and an alpha of 32 to prevent overfitting. Furthermore, a case study revealed that the proposed method exhibits greater stability compared to directly fine-tuning with limited resources, ensuring more reliable performance in challenging conditions. This stability is crucial for real-world deployment where data scarcity is a common obstacle.
Notably, the SICL-AT methodology employs an episodic training format mirroring inference-time in-context learning, constructing prompts by concatenating in-context demonstrations with query instances. Training data included English subsets of CommonVoice, and multilingual subsets of CoVoST2, alongside the MMSU dataset for speech question answering, enabling a comprehensive evaluation of the technique’s effectiveness. By strategically leveraging high-resource data and a carefully designed. Algorithm 1, detailed in the work, summarises this training procedure, outlining the steps for preparing the data and updating the model parameters.
Crucially, the study pioneered a method of post-training adaptation, avoiding brittle direct fine-tuning on limited in-domain data. Instead of updating all model weights, the team utilised LoRA, a parameter-efficient fine-tuning technique, to modify only a small subset of parameters, preserving the pre-trained knowledge within the auditory LLM. Table 1, presented within the research, lists the specific datasets used for both the pre-training and SICL-AT stages, detailing the size and characteristics of each corpus. This approach enables the model to generalise better to unseen data distributions, particularly in low-resource scenarios.
The researchers then evaluated the performance of the SICL-AT-trained model on low-resource Automatic Speech Recognition (ASR) and Audio Understanding/Reasoning (AU/AR) tasks. Experiments consistently revealed that SICL-AT outperformed direct fine-tuning, demonstrating a significant improvement in robustness and downstream performance. A case study further highlighted the stability of the proposed method, showing that the SICL-AT-adapted model exhibited less performance degradation under domain shift compared to models fine-tuned directly on scarce in-domain data. This innovative methodology unlocks the potential for leveraging abundant out-of-domain data to enhance performance on challenging, low-resource audio tasks.
SICL-AT boosts speech model in-context learning performance
The team measured performance gains across several benchmarks, including child’s ASR, multilingual ASR, speech translation (ST), and general audio understanding/reasoning (AU/AR). Specifically, on the multilingual ASR task using the de, zh, and fr subsets of CommonVoice, the baseline model achieved a Word Error Rate (WER) of 14.25, 31.39, and 66.90% respectively, alongside accuracy scores of 54.70% and 69.74%. Incorporating SICL-AT resulted in WER scores of 11.49, 16.59, and 71.90%, with corresponding accuracy improvements to 57.70% and 69.74%. These results demonstrate a clear enhancement in ASR performance through the proposed training recipe.
Further experiments incorporated speech translation data (SICL-AT2), leading to a rise in BLEU score on non-overlap ST evaluation, indicating improved translation quality. Notably, AU/AR performance also increased for both models, despite neither ASR nor ST being overlap tasks, which is remarkable. The team recorded BLEU scores of 36.92 and 16.76 for ST with SICL-AT2, showcasing significant gains over the baseline. Adding SQA data (SICL-AT3) yielded additional gains on AU/AR, reaching a peak accuracy of 73.40%, although this came with slight degradations on ASR/ST. Tests prove that direct fine-tuning on narrowly matched data can overspecialize and hinder generalization in low-resource scenarios.
In contrast, SICL-AT consistently outperformed direct fine-tuning, even when leveraging high-resource data for pre-training. For example, fine-tuning Qwen2.5-Omni on Common Voice English improved children’s ASR, but SICL-AT still delivered stronger adaptation ability, confirming the necessity of explicitly training for ICL behavior. Measurements confirm that SICL-AT strengthens gradient-free, demonstration-conditioned adaptation, enabling robust performance in low-resource scenarios.
SICL-AT boosts LLM adaptation and inference
The findings suggest that SICL-AT consistently outperforms direct fine-tuning, particularly when labelled in-domain data is limited. This is because SICL-AT fosters robust adaptation at inference time, avoiding the over-specialization that can occur with direct fine-tuning on narrowly matched data. Researchers found that aligning post-training tasks with downstream task formats can further improve targeted capabilities, and the benefits of SICL-AT extend beyond the specific skills used during training. Authors acknowledge that their experiments were conducted using two model families and a fixed set of benchmarks, and a comprehensive analysis of inference cost with longer contexts remains to be completed. They also note that SICL performance is dependent on the quality of retrieved examples, which may be limited in truly data-scarce situations. Future work could explore the scaling of inference costs and a more detailed qualitative failure analysis to further refine the method.
👉 More information
🗞 SICL-AT: Another way to adapt Auditory LLM to low-resource task
🧠 ArXiv: https://arxiv.org/abs/2601.18904
