The ability to understand and generate audio with human-like flexibility remains a significant challenge for artificial intelligence, as current systems often require extensive task-specific training. Researchers Dong Zhang, Gang Wang, Jinlong Xue, Kai Fang, and the Xiaomi LLM-Core Team address this limitation with MiMo-Audio, a new approach to audio language modelling. They demonstrate that scaling up the amount of pre-training data, to over one hundred million hours, unlocks remarkable few-shot learning capabilities in audio tasks, allowing the model to generalise to unseen challenges like voice conversion and realistic speech continuation. The resulting MiMo-Audio-7B-Instruct model achieves state-of-the-art performance on a range of benchmarks, including speech intelligence, audio understanding, and spoken dialogue, even rivaling the capabilities of closed-source systems and representing a substantial advance in open-source audio AI.
Capabilities in the audio domain mirror recent advances in text processing, suggesting a similar potential for powerful models. Scaling pre-training data for MiMo-Audio to over one hundred million hours has unlocked few-shot learning abilities across a wide spectrum of audio tasks. Systematic evaluation demonstrates that MiMo-Audio-7B-Base achieves state-of-the-art performance on both speech intelligence and audio understanding benchmarks among openly available models. Beyond standard assessments, MiMo-Audio-7B-Base successfully generalises to tasks absent from its training data, including voice conversion, style transfer, and detailed speech editing. Furthermore, the model exhibits remarkable speech continuation capabilities, generating highly realistic content such as talk shows, recitations, livestreaming broadcasts, and debates.
The team acknowledges contributions from several groups, including Xiaomi LLM-Plus, NGK, MiChat, Mify, the Data Platform team, and CloudML. Yongzhe He served as the corresponding author for this research.
MiMo-Audio Achieves Broad Audio Understanding
Scientists have achieved a breakthrough in audio language modeling with the development of MiMo-Audio, a system demonstrating strong generalisation across a diverse set of audio tasks. The team scaled pre-training data to over one hundred million hours, revealing emergent capabilities previously unseen in open-source models. This work establishes a new standard for both speech intelligence and audio understanding, moving beyond the limitations of task-specific fine-tuning common in existing systems. Experiments reveal that MiMo-Audio-7B-Base achieves state-of-the-art performance on established benchmarks, while also generalising to tasks not present in its original training data, including voice conversion, style transfer, and speech editing.
The model demonstrates powerful speech continuation abilities, successfully generating highly realistic content such as talk shows, recitations, livestreaming broadcasts, and debates. Tests prove the system’s capacity to maintain coherence and naturalness over extended durations, a significant advancement in speech synthesis. Further refinement through instruction-tuning resulted in MiMo-Audio-7B-Instruct, which achieves state-of-the-art results on audio understanding benchmarks including MMSU, MMAU, MMAR, and MMAU-Pro. The model also excels in spoken dialogue evaluations, such as Big Bench Audio and MultiChallenge Audio, and instruct-TTS evaluations, approaching or surpassing the performance of closed-source models.
Measurements confirm that the system’s architecture, incorporating a large tokenizer with 1.2 billion parameters, achieves superior reconstruction quality and facilitates effective downstream language modeling, generating 200 tokens per second at a 25Hz frame rate. The team’s innovative approach to pre-training, combining a patch encoder, large language model, and patch decoder, enables efficient modeling of high-token-rate sequences and addresses the disparity between speech and text modalities.
MiMo-Audio Generalises Across Diverse Audio Tasks
Researchers have developed MiMo-Audio, a new audio language model that demonstrates strong generalisation across a diverse range of audio tasks. By scaling the amount of pre-training data to over one hundred million hours and employing an architecture designed to preserve all speech information, the team achieved state-of-the-art performance on both speech intelligence and audio understanding benchmarks among open-source models. Notably, the model successfully tackles tasks it was not specifically trained for, including voice conversion, style transfer, and speech editing, and exhibits impressive capabilities in generating realistic speech continuations for scenarios like talk shows and debates. Further refinement through instruction-tuning created MiMo-Audio-Instruct, which achieves leading results on audio understanding, spoken dialogue, and text-to-speech benchmarks, approaching the performance of closed-source models.
👉 More information
🗞 MiMo-Audio: Audio Language Models are Few-Shot Learners
🧠 ArXiv: https://arxiv.org/abs/2512.23808
