The challenge of effectively combining speech and text processing within a single model is addressed by Yuxuan Lou, Kai Yang, and Yang You from the National University of Singapore and Shanghai Jiao Tong University, in their development of MoST , a Mixture of Speech and Text model. Their research introduces a Modality-Aware Mixture of Experts (MAMoE) which moves beyond conventional multimodal approaches by utilising specialised pathways to process different input types. This allows the model to learn both modality-specific nuances and facilitate cross-modal understanding, ultimately improving performance across a range of tasks. MoST distinguishes itself as the first fully open-source speech-text large language model built on a Mixture of Experts architecture, trained exclusively on publicly available datasets, and demonstrably outperforms comparable models in automatic speech recognition, text-to-speech synthesis, audio language modelling and spoken question answering. This work represents a significant step towards more efficient and accessible multimodal AI systems.
Mixture of Experts for Speech and Text
Scientists have developed MoST (Mixture of Speech and Text), a novel large language model that seamlessly integrates both speech and text processing. The research team achieved this by introducing a Modality-Aware Mixture of Experts (MAMoE) architecture, designed to address limitations in current multimodal models that often treat diverse data types with identical parameters. This innovative approach utilizes specialized routing pathways, directing input tokens to modality-appropriate experts based on whether they originate from speech or text, thereby enhancing both modality-specific learning and cross-modal understanding. The core of this breakthrough lies in the MAMoE architecture, which incorporates two key components: modality-specific expert groups and shared experts.
These modality-specific groups capture patterns unique to speech or text, while the shared experts facilitate crucial information transfer between the two modalities. Experiments demonstrate that this design allows the model to develop specialized processing capabilities for each input type, improving performance on tasks requiring integration of both speech and text data. The team further refined this architecture with an efficient pipeline, adapting a pre-trained Mixture of Experts language model through strategic post-training on readily available, open-source datasets. Building upon the MAMoE architecture, the researchers developed a transformation pipeline that leverages both Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) datasets.
This pipeline strategically post-trains the model, followed by fine-tuning using a carefully curated speech-text instruction dataset, resulting in a versatile and controllable system. Notably, the entire process relies exclusively on fully accessible, open-source data, distinguishing MoST from other prominent speech-text models that utilize proprietary datasets. This commitment to open science ensures reproducibility and wider accessibility for the research community. Comprehensive evaluations across ASR, TTS, audio language modeling, and spoken question answering benchmarks reveal that MoST consistently outperforms existing models with comparable parameter counts. Ablation studies confirm the significant contribution of the modality-specific routing mechanism and shared expert design to performance gains across all tested domains. To the best of the researchers’ knowledge, MoST represents the first fully open-source speech-text large language model built on a Mixture of Experts architecture, opening new avenues for research and development in multimodal artificial intelligence.
Modality-Aware Routing in Mixture of Experts
The research team engineered MoST (Mixture of Speech and Text), a novel multimodal large language model, to seamlessly integrate speech and text processing. This work pioneers a Modality-Aware Mixture of Experts (MAMoE) architecture, diverging from conventional multimodal models that apply identical parameters across diverse inputs. Instead, MoST employs specialized routing pathways, directing tokens to modality-appropriate experts determined by input type, thereby enhancing both modality-specific learning and cross-modal understanding. The system achieves this through modality-specific expert groups capturing domain-specific patterns and shared experts facilitating information transfer.
Scientists developed an efficient pipeline adapting a pretrained Mixture of Experts language model through strategic post-training on both Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) datasets. Following this, the model underwent fine-tuning using a carefully curated speech-text instruction dataset, ensuring data efficiency. A crucial aspect of this methodology is its exclusive reliance on fully accessible, open-source datasets, enabling reproducibility and wider accessibility for the research community. This approach allows MoST to consistently outperform existing models with comparable parameter counts across a range of benchmarks.
The study’s audio processing component directly processes continuous audio waveforms using a frozen HuBERT encoder, yielding features projected to the model dimension. This contrasts with discrete tokenization approaches, preserving richer acoustic information. Audio waveforms are then synthesized using a HifiGAN vocoder, conditioned on predicted HuBERT tokens and speaker embeddings. The core of MoST’s innovation lies within the MAMoE layers, where the set of available experts is partitioned into disjoint modality-specific groups, text and audio, allowing for specialized processing of each modality.
To facilitate cross-modal understanding, the team incorporated shared experts as a parallel MLP block, processing all tokens and enabling information exchange between modalities. The modality-aware router directs each token to appropriate experts based on content and modality, utilizing a softmax function and a modality-specific mask to determine expert selection. This routing mechanism, detailed in Algorithm 1, ensures that tokens are processed by the most relevant experts, significantly contributing to performance gains across all tested domains, and establishing MoST as the first fully open-source speech-text LLM built on a Mixture of Experts architecture.
MoST Achieves Superior Speech and Text Performance
Scientists achieved a significant breakthrough in multimodal large language models with the development of MoST (Mixture of Speech and Text), a novel system integrating speech and text processing. The core of this advancement lies in the Modality-Aware Mixture of Experts (MAMoE) architecture, which utilizes specialized routing pathways to direct information tokens to modality-appropriate experts, enhancing both modality-specific learning and cross-modal understanding. Experiments revealed that MoST consistently outperforms existing models with comparable parameter counts across a comprehensive suite of benchmarks, demonstrating its superior capabilities in handling both speech and text data. This work represents the first fully open-source speech-text LLM built on a Mixture of Experts, releasing the model, training code, inference code, and training data for wider research.
Detailed evaluations across Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) tasks demonstrate MoST’s competitive performance, recording a Word Error Rate (WER) of 2.0% on the LibriSpeech-clean dataset and 3.7% on LibriSpeech-other. In more challenging cross-dataset scenarios, the team measured a WER of 6.2% on VoxPopuli-V1.0-en and 8.4% on Common Voice 15-en, showcasing strong generalization capabilities. Furthermore, MoST achieved state-of-the-art TTS performance with 6.0% WER on LS-Clean, 10.1% Character Error Rate (CER) on VoxPopuli-V1.0-en, and 11.5% CER on Common Voice 15-en, consistently surpassing all baseline models including MinMo and LLaMA-Omni2. Tests prove MoST’s effectiveness extends to audio language modeling, achieving an average accuracy of 71.94% across benchmarks including sWUGGY, sBLIMP, sTopic-StoryCloze, and sStoryCloze.
Notably, MoST attained state-of-the-art performance on sTopic-StoryCloze with an accuracy of 83.64%, demonstrating a nuanced understanding of audio-specific linguistic patterns. Measurements confirm strong performance on Spoken Question Answering (SQA) tasks, with scores of 74.8 (Speech-to-Text) and 62.6 (Speech-to-Speech) on Llama Q, and a particularly high 32.1 (Speech-to-Speech) on Trivial QA. The breakthrough delivers a best-in-class result on WebQ, achieving 58.2 (S→T) and 44.7 (S→S), significantly exceeding the performance of all other evaluated models.
MoST Achieves Strong Multimodal Performance with MAMoE
The research presented introduces MoST, a novel multimodal large language model designed to integrate speech and text processing. Central to this work is the Modality-Aware Mixture of Experts architecture, or MAMoE, which employs specialized routing pathways to direct information to experts appropriate for each input modality. This approach allows for both enhanced, modality-specific learning and improved cross-modal understanding through dedicated and shared expert groups. Evaluations across a range of benchmarks, including automatic speech recognition, text-to-speech synthesis, audio language modelling, and spoken question answering, demonstrate that MoST consistently achieves strong performance relative to existing models with similar parameter counts.
The model’s effectiveness is supported by ablation studies confirming the contribution of the modality-specific routing mechanism and the shared expert design. Importantly, MoST is the first fully open-source speech-text large language model built upon a Mixture of Experts foundation, with all code, model checkpoints and training data publicly available. The authors acknowledge that the initial expert partitioning strategy employed, an index-based 50% split, represents a relatively simple approach. Future work could explore more sophisticated methods, such as clustering or knowledge-preserving partitioning, to further refine the modality-aware Mixture of Experts architecture. The research team also highlights the potential for continued development and investigation into efficient and effective multimodal models, facilitated by the open-source release of MoST’s resources.
👉 More information
🗞 MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts
🧠 ArXiv: https://arxiv.org/abs/2601.10272
