A new system integrating wav2vec 2.0 with the Phi-4 multimodal large language model achieves a root mean square error of 0.375 in spoken language assessment, securing second place in the Speak & Improve Challenge 2025. This performance surpasses the official baseline (RMSE 0.444) and a comparable system (RMSE 0.384).
Automated assessment of spoken language proficiency represents a complex challenge, requiring systems to integrate both linguistic content and acoustic characteristics of speech. Researchers are increasingly employing neural network architectures, such as BERT and wav2vec 2.0, to address this need, though each approach possesses inherent limitations in fully capturing the nuances of oral competence. A team led by Hong-Yun Lin, Tien-Hong Lo, Yu-Hsuan Fang, Jhen-Ke Lin, Chung-Chun Wang, Hao-Chien Lu, and Berlin Chen, all from the Department of Computer Science and Information Engineering and the Institute of AI Interdisciplinary Applied Technology at National Taiwan Normal University, detail their approach to overcoming these challenges in the article, ‘The NTNU System at the S&I Challenge 2025 SLA Open Track’. Their system combines the acoustic strengths of wav2vec 2.0 with the multimodal capabilities of Phi-4, achieving a root mean square error of 0.375 and securing second place in the Speak & Improve Challenge 2025.
Acoustic-Semantic Fusion Enhances Automated Speaking Assessment
Automated speaking assessment (ASA) systems are increasingly employed to evaluate non-native language proficiency. Current approaches typically integrate acoustic and linguistic modalities, recognising the limitations of relying solely on either acoustic features or transcribed text. While models such as wav2vec 2.0 and BERT have shown promise, each possesses inherent drawbacks. BERT-based methods depend on automatic speech recognition (ASR) transcripts, potentially overlooking crucial prosodic information – the rhythmic and tonal aspects of speech. Conversely, wav2vec 2.0 excels at extracting acoustic features but lacks semantic understanding – the meaning conveyed by language. This work addresses these limitations by fusing representations from wav2vec 2.0 with the Phi-4 multimodal large language model (MLLM), creating a system that leverages the strengths of both acoustic and linguistic analysis.
The proposed system achieves a root mean square error (RMSE) of 0.375 on the Speak & Improve Challenge 2025 test set, positioning it competitively within the field. Comparative analysis demonstrates superior performance against the top-ranked system (RMSE 0.364), the official baseline (RMSE 0.444), and the third-ranked system (RMSE 0.384).
The employed score fusion strategy is critical in leveraging the complementary strengths of each model, enabling a more holistic evaluation of spoken language. By integrating acoustic features with semantic understanding, the system provides a nuanced evaluation, capturing the complexities of human communication and offering a more accurate reflection of a learner’s abilities.
Researchers initially explored various fusion techniques – including early fusion, late fusion, and decision-level fusion – before settling on a weighted averaging approach for score fusion. The weighting coefficients were optimised using a validation set, ensuring appropriately balanced contributions from each modality.
The system processes audio input by first extracting acoustic features using wav2vec 2.0, capturing information about pronunciation, fluency, and prosody. Simultaneously, the audio is processed by an ASR module to generate a text transcript, which is then fed into the Phi-4 MLLM for semantic analysis. The MLLM assesses grammatical correctness, vocabulary usage, and overall coherence, providing a comprehensive understanding of the content.
Phi-4’s ability to understand context and nuance proves particularly valuable in assessing speaking proficiency, allowing it to differentiate between subtle errors and natural variations in language use. This combined analysis allows the system to identify areas where the learner excels and areas requiring improvement.
The system’s performance was evaluated on a diverse dataset of non-native English speakers, representing a wide range of language backgrounds and proficiency levels. The dataset included both scripted and spontaneous speech samples, ensuring accurate assessment in various speaking contexts. Researchers carefully analysed the results to identify potential biases and ensure fair performance across all learner populations.
Further research should address the challenges of data scarcity and imbalance, particularly for under-represented language learner populations. Developing robust methods for mitigating adversarial attacks remains crucial for ensuring the reliability and security of ASA systems.
Ultimately, the goal is to create an ASA system that can provide personalised feedback to learners, helping them to improve their speaking skills and achieve their language learning goals. This requires a system that is not only accurate but also reliable, fair, and accessible.
The development of this enhanced acoustic-semantic fusion system represents a step forward in the field of automated speaking assessment. By combining the strengths of acoustic and linguistic analysis, this system provides a more comprehensive and accurate evaluation of speaking proficiency.
This research demonstrates the potential of combining different modalities to create more intelligent and effective AI systems. By leveraging the strengths of both acoustic and linguistic analysis, this system provides a more nuanced and comprehensive understanding of human communication, with implications beyond language learning for applications such as speech recognition, natural language understanding, and human-computer interaction.
👉 More information
🗞 The NTNU System at the S&I Challenge 2025 SLA Open Track
🧠 DOI: https://doi.org/10.48550/arXiv.2506.05121
