Respiratory sound analysis holds immense promise for objective and accurate screening of conditions like asthma, but current diagnostic methods often rely on subjective listening skills. Theodore Aptekarev, Vladimir Sokolovsky, and Gregory Furman, all from Ben Gurion University of the Negev, Israel, alongside Aptekarev’s affiliation with HSE University, Russia, present compelling research demonstrating the power of transformer architectures in this field. They adapted the Audio Spectrogram Transformer (AST) and developed a multimodal Vision-Language Model (VLM) to analyse respiratory sounds and integrate patient data, achieving remarkable results , with the AST reaching 97% accuracy in asthma detection and significantly surpassing existing benchmarks.This work not only confirms the effectiveness of self-attention mechanisms for acoustic screening, but also paves the way for more holistic and clinically informed diagnostic tools, potentially revolutionising pulmonary disease management.
The study establishes a new benchmark for acoustic screening, surpassing both internal CNN baselines and typical external benchmarks with remarkable performance. Specifically, the team fine-tuned the AST model, initialized from publicly available weights, on a medical dataset containing hundreds of recordings per diagnosis, resulting in approximately 97% accuracy, a 97% F1-score, and an impressive ROC AUC of 0.98 for asthma detection.
This achievement significantly outperforms their previously established CNN baseline, which utilized a DenseNet201 architecture, and highlights the effectiveness of self-attention mechanisms in processing acoustic signals. Furthermore, the researchers explored multimodal diagnosis by integrating spectrograms with structured patient metadata, including sex, age, and recording site, using a compact Moondream-type VLM. The VLM experiment yielded 86-87% accuracy, demonstrating comparable performance to the CNN baseline while uniquely showcasing the ability to incorporate crucial clinical context into the diagnostic process. This multimodal approach mirrors the holistic assessment performed by clinicians, synthesizing acoustic data with patient-specific information for a more comprehensive evaluation.
The work opens exciting possibilities for developing holistic diagnostic tools that can improve asthma screening, particularly in settings where access to specialized care is limited. Experiments confirm that the AST model’s performance stems from its ability to effectively process and interpret the complex patterns within respiratory sound spectrograms, while the VLM’s success underscores the value of combining acoustic analysis with patient demographics. Recordings were gathered from individuals aged 0, 47 years, encompassing healthy volunteers and patients with confirmed respiratory diagnoses, notably asthma diagnosed according to GINA guidelines. AST, initialized from publicly available weights, underwent fine-tuning on the medical dataset, processing hundreds of recordings per diagnosis to optimise performance. The team then pioneered a multimodal Vision-Language Model (VLM) that integrates spectrograms with structured patient metadata, specifically sex, age, and recording site, to generate a JSON-formatted diagnosis. This VLM utilises a compact Moondream-type model, enabling simultaneous processing of spectrogram images and textual prompts.
Experiments meticulously harmonised original audio files, stored as WAV, MP3, or M4A, to a consistent WAV format during preprocessing, utilising quality labels and automated defect flags to exclude substandard recordings. Recordings, averaging 25 seconds in duration, were captured at four anatomical points, mouth, trachea, right second intercostal space, and right paravertebral area, using diverse hardware ranging from specialized systems with external microphones to mobile phones, all validated for amplitude-frequency linearity in the 100, 3000Hz range. The AST achieved approximately 97% accuracy, with an F1-score around 97% and ROC AUC of 0.98. These measurements confirm the effectiveness of self-attention mechanisms for acoustic screening of respiratory conditions, surpassing both an internal CNN baseline and typical external benchmarks. The team meticulously evaluated the AST model’s performance on a strictly controlled dataset, ensuring a fair comparison and isolating the impact of the model architecture.
Further research involved a multimodal Vision-Language Model (VLM) that integrates spectrograms with structured patient metadata, including sex, age, and recording site. The VLM reached 86-87% accuracy, performing comparably to the CNN baseline while uniquely demonstrating the capability to incorporate clinical context into the diagnostic process. This compact Moondream-type model outputs a JSON-formatted diagnosis, streamlining the interpretation of results and potentially aiding in clinical decision-making. Data shows that combining acoustic information with patient characteristics enhances the model’s ability to provide a holistic assessment.
Measurements confirm that the AST model’s high performance stems from its ability to effectively process and interpret audio spectrograms, visual representations of sound frequencies over time. The study utilized an anonymized database of 1371 subjects, ranging in age from 0 to 47 years, including both healthy individuals and patients with confirmed asthma diagnoses according to GINA guidelines. Researchers meticulously collected and annotated the data at the Regional Children’s Clinical Hospital of Perm Krai, adhering to strict ethical protocols and the Declaration of Helsinki. Results demonstrate the potential of these advanced deep learning techniques to revolutionize respiratory sound analysis, offering an objective and accurate tool for asthma screening and diagnosis. The breakthrough delivers a promising pathway towards improved telemedicine and access to specialized care, particularly in remote regions where traditional diagnostic methods may be limited. AST achieved approximately 97% accuracy and a 97% F1-score in detecting asthma, substantially exceeding existing benchmarks and the performance of a prior CNN baseline. Furthermore, the VLM, while achieving comparable accuracy (86-87%) to the CNN baseline, successfully incorporated clinical context, patient sex, age, and recording site, into the diagnostic process, outputting structured JSON diagnoses.
This capability suggests potential for integration into clinical decision support systems, mirroring how clinicians utilise contextual information. The study confirms the effectiveness of self-attention mechanisms in capturing complex acoustic patterns indicative of asthmatic breathing, even from short audio clips. The key finding is that AST currently offers superior diagnostic accuracy for asthma screening via audio analysis. However, the successful implementation of the VLM highlights the potential of multimodal approaches for developing more comprehensive diagnostic tools. Authors acknowledge that the VLM’s performance did not surpass the CNN baseline in this instance, indicating further refinement is needed to fully leverage the benefits of multimodal integration. Future research could focus on optimising the VLM architecture and exploring additional clinical metadata to enhance its diagnostic capabilities, potentially leading to more nuanced and accurate assessments of respiratory conditions.
👉 More information
🗞 Transformer Architectures for Respiratory Sound Analysis and Multimodal Diagnosis
🧠 ArXiv: https://arxiv.org/abs/2601.14227
