Researchers are tackling the challenge of efficiently representing audio and text together, a crucial step for building powerful audio-language models. Gokul Karthik Kumar, Ludovick Lepauloux, and Hakim Hacid, all from the Technology Innovation Institute in Abu Dhabi, introduce WavLink, a new model which cleverly integrates the popular Whisper audio encoder with a learnable global token and a text encoder. This innovative approach overcomes limitations in existing audio-text embedding models, which haven’t fully utilised Whisper’s capabilities, and delivers state-of-the-art performance with significantly smaller embedding sizes , up to eight times smaller, in fact , without sacrificing accuracy. Their work, detailed through rigorous experimentation with training methods and data, promises to advance the scalability and effectiveness of audio-language technologies, as demonstrated by strong results on the AIR-Bench benchmark.
The research team achieved state-of-the-art retrieval performance through a systematic exploration of key design choices, including pretrained text encoders, various loss functions, training methodologies, and diverse data mixtures. Furthermore, a novel two-stage training recipe, combined with Matryoshka-style supervision, dramatically improves scalability, enabling the creation of embeddings that are eight times smaller with minimal compromise in accuracy. This Matryoshka supervision trains embeddings to remain effective even when truncated to lower dimensions, facilitating multi-resolution representations and enhancing efficiency.
Experiments demonstrate WavLink’s competitive performance on the challenging AIR-Bench benchmark, excelling in both Multiple Choice Questions (MCQs) and zero-shot classification tasks. Specifically, WavLink surpasses prior CLAP variants in audio retrieval tasks using datasets like AudioCaps and Clotho, while simultaneously achieving accuracy comparable to much larger audio-LLMs, such as Qwen2-Audio and Falcon3-Audio, on complex question-answering scenarios. Notably, this work establishes that sub-100 dimensional embeddings, enabled by Matryoshka supervision, can maintain competitive performance levels, a significant breakthrough in compact audio-text representation.
Scientists Method
The research team addressed a methodological divide between audio-LLMs and embedding models by adapting Whisper, typically used for frame-level features in LLMs, for generating single, compact audio representations. Instead of the standard 1500 frame-level tokens produced by Whisper for a 30-second clip, WavLink outputs a single representation, significantly reducing storage and similarity search costs. The study pioneered a novel approach by appending a learnable classification token, acls, to the hidden states H0 derived from log-Mel features X, processed by a convolutional front-end, and propagating this extended sequence through Whisper’s Transformer stack. The final state of this token then serves as the pooled audio representation, za, effectively condensing the audio information.
Text inputs are encoded using either CLIP or ModernBERT, yielding pooled text features zt, and both modalities are subsequently mapped to a shared embedding space via linear projectors followed by l2 normalization, creating ua and ut respectively. This innovative architecture enables the creation of compact audio-text embeddings without sacrificing crucial audio information. The team further enhanced scalability through a two-stage training recipe incorporating Matryoshka-style supervision for multi-resolution embeddings, allowing for 8x smaller embeddings with minimal performance degradation. Researchers harnessed the power of both CLIP loss, utilizing InfoNCE with a learnable temperature τ, and SigLIP loss, a sigmoid-based variant applying Binary Cross Entropy, to optimize the embedding space. WavLink’s performance was rigorously evaluated across retrieval tasks using AudioCaps and Clotho, zero-shot classification with VGGSound, ESC-50, and US8K, and multiple-choice question answering using AIR-Bench, demonstrating competitive accuracy compared to larger audio-LLMs like Qwen2-Audio and Falcon3-Audio.
WavLink achieves compact audio-text embeddings efficiently
The research addresses a methodological divide between audio-Large Language Models and embedding models, traditionally utilising different audio encoders despite advancements in both fields. Experiments revealed that WavLink successfully produces a single audio representation, significantly reducing storage and similarity search costs compared to the 1500 frame-level tokens typically generated by Whisper. Results demonstrate that WavLink surpasses prior CLAP variants in retrieval tasks, achieving state-of-the-art performance. Measurements confirm that WavLink achieves competitive accuracy on multiple-choice Question Answering using the AIR-Bench benchmark, even when compared to much larger audio-LLMs such as Qwen2-Audio and Falcon3-Audio.
The WavLink model specifications vary in size, with the Large model containing 761 million parameters (637 audio + 123 text), supporting dimensions of 768, 384, 192, and 96. Smaller models, the Small (152M parameters) and Base (84M parameters) also support reduced dimensions of 512, 256, 128, and 64, demonstrating scalability. Tests prove that sub-100 dimensional embeddings, enabled by Matryoshka supervision, can retain competitive performance, a groundbreaking achievement in the field. The Matryoshka loss adaptation trains embeddings to remain useful even when truncated to smaller dimensions, utilising a multi-level contrastive loss across sliced embeddings, d1 d2 … dK, to produce nested embeddings at multiple scales.
WavLink achieves compact, state-of-the-art audio embeddings for versatile
Through systematic investigation of various design choices, including pretrained text encoders, loss functions, training modes, and data mixtures, researchers identified configurations achieving state-of-the-art performance in audio retrieval. A two-stage training process, coupled with Matryoshka-style supervision, enhances scalability, allowing for 8x smaller embeddings with minimal impact on performance. WavLink demonstrates competitive results on benchmarks like AIR-Bench with multiple-choice questions and zero-shot classification tasks, alongside strong performance on VGGSound. The authors acknowledge limitations related to the scope of datasets used and suggest future work could extend WavLink to multilingual audio-text alignment and explore the global token mechanism for audio-language models to reduce computational cost and improve generalisation. These findings underscore the potential of Whisper beyond speech recognition, offering an efficient approach to representation learning in the audio-text domain.
👉 More information
🗞 WavLink: Compact Audio–Text Embeddings with a Global Whisper Token
🧠 ArXiv: https://arxiv.org/abs/2601.15118
