The effective distillation of complex medical texts into concise summaries presents a considerable challenge for artificial intelligence. Recent advances utilising large language models (LLMs) demonstrate promise, yet performance frequently diminishes when encountering specialised terminology or novel concepts – instances where the model’s pre-existing vocabulary proves inadequate. Researchers at the Indian Institute of Technology Kharagpur – Gunjan Balde, Soumyadeep Roy, Mainack Mondal, and Niloy Ganguly – investigated this limitation, focusing on the impact of ‘out-of-vocabulary’ (OOV) words on LLM performance. Their work, detailed in ‘Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings’, benchmarks LLMs under challenging conditions and explores vocabulary adaptation strategies to enhance summarisation accuracy and relevance, validated through both quantitative analysis and expert human evaluation.
Large language models (LLMs) exhibit increasing proficiency in natural language processing, yet their performance often declines when applied to specialised domains such as biomedicine. This limitation stems from the prevalence of out-of-vocabulary (OOV) words – terms the model hasn’t encountered during training – and the nuanced language characteristic of scientific literature, necessitating strategies for effective adaptation. Researchers investigated the combined benefits of continual pretraining and vocabulary adaptation to improve LLM performance on biomedical text summarisation tasks, revealing substantial gains in accuracy and relevance.
LLMs, trained on extensive corpora of general text, frequently encounter unfamiliar terminology when processing biomedical literature, hindering their ability to generate coherent and accurate summaries. To overcome this, researchers explore methods to expand the model’s vocabulary and refine its understanding of biomedical concepts, focusing on continual pretraining and vocabulary adaptation as key strategies. Continual pretraining involves further training the LLM on a corpus of biomedical text, allowing it to learn the specific language patterns and terminology prevalent in the domain.
Vocabulary adaptation directly addresses the issue of OOV words by expanding the model’s vocabulary to include biomedical terms, enabling it to better process and understand specialised texts. The research focuses on the development and evaluation of a novel vocabulary adaptation method, Scaf-Fix, alongside an investigation into different continual pretraining strategies, including End-to-End and Two-Stage approaches. These techniques aim to enhance the model’s ability to capture critical information, reduce the tendency to extract sentences from the beginning of documents (known as LEAD-bias), and generate more accurate and relevant summaries of biomedical literature. The study employs established metrics, such as Rouge-L – a measure of overlap between generated and reference summaries – to quantitatively assess performance, and incorporates human evaluation by medical experts to validate practical benefits.
The research demonstrates that Scaf-Fix consistently achieves higher Rouge-L scores across both the Evidence-Based Medicine (EBM) and PubMedQA datasets compared to baseline models. This improvement is particularly pronounced when dealing with texts containing a substantial proportion of OOV words, reaching levels of 17.72% and 47.09% in the tested datasets. Scaf-Fix effectively mitigates LEAD-bias and enhances the capture of crucial information, such as key biomarkers, leading to more informative and accurate summaries. The study highlights the importance of addressing vocabulary mismatch in biomedical text summarisation, demonstrating that expanding the model’s vocabulary to include specialised terms significantly improves performance.
Continual pretraining strategies, End-to-End and Two-Stage, further refine model performance, optimising the model for the specific nuances of the biomedical domain. End-to-End pretraining involves training the model from scratch on a combined corpus of general and biomedical text, while Two-Stage pretraining involves first pretraining the model on general text and then fine-tuning it on biomedical text. The research confirms the value of continued training on new data, exploring different combinations of these techniques to maximise performance.
Human evaluation by medical experts validates the quantitative results, demonstrating that the proposed techniques genuinely improve the quality of biomedical text summarisation. The consistent alignment between quantitative and qualitative results strengthens the validity of the research findings.
Future research should explore the transferability of Scaf-Fix to other specialised domains beyond biomedicine, investigating its effectiveness in areas such as law, finance, and engineering. Exploring more advanced vocabulary adaptation strategies, such as subword tokenisation – breaking words into smaller units – and contextual embeddings – representing words based on their surrounding context – could further improve the model’s ability to handle OOV words and capture nuanced meanings.
Expanding human evaluation studies to encompass a wider range of medical experts and clinical scenarios would strengthen the validation of these techniques, ensuring that the proposed approach is effective in real-world applications. Developing automated methods for evaluating the quality of biomedical summaries, such as using knowledge graphs and ontologies, could reduce the reliance on human evaluation and accelerate the development of new text summarisation systems. The research contributes to the growing body of knowledge on natural language processing and its application to the biomedical domain, paving the way for more effective and efficient information retrieval and knowledge discovery in healthcare.
In conclusion, the research demonstrates the significant benefits of combining continual pretraining and vocabulary adaptation for biomedical text summarisation. The proposed techniques, particularly Scaf-Fix, consistently outperform baseline models, generating more accurate, relevant, and informative summaries. The findings highlight the importance of addressing vocabulary mismatch and adapting LLMs to the specific language patterns of specialised domains. The research provides a valuable framework for developing and evaluating text summarisation systems in biomedicine and other domains, ultimately contributing to the advancement of knowledge discovery and information retrieval in healthcare and beyond.
👉 More information
🗞 Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings
🧠 DOI: https://doi.org/10.48550/arXiv.2505.21242
