Endotracheal suctioning (ES), a vital yet invasive clinical procedure, currently lacks robust automated training tools, particularly concerning skill development and risk mitigation in unsupervised environments. Researchers Hoang Khang Phan (Ho Chi Minh city University of Technology), Quang Vinh Dang (University of Massachusetts – Amherst), and Noriyo Colley (Hokkaido University) et al. present a novel framework leveraging Large Language Models (LLMs) for video-based activity recognition and explainable feedback generation. This work is significant because it moves beyond simple recognition to provide trainees with natural language guidance, translating complex technical data into actionable insights. Their LLM-centered approach demonstrably outperforms conventional machine learning and deep learning models, achieving a 15-20% improvement in accuracy and F1 score, and establishes a scalable foundation for improved nursing education and patient safety.
This research addresses a significant gap in training and assessment, particularly in settings like home care and education where consistent expert supervision is limited. The team achieved this by creating a unified LLM-centered system capable of analysing video data to identify procedural steps and offer interpretable guidance to trainees. The core innovation lies in utilising an LLM not just for activity recognition, but also for explainable decision-making, translating complex technical assessments into accessible natural language feedback.
This work centres on a video-based approach where the LLM functions as the central reasoning module, performing both spatiotemporal activity recognition and detailed analysis of the procedure depicted in the video data. Researchers benchmarked this LLM-based system against conventional machine learning and deep learning methods, demonstrating a substantial performance improvement of approximately 15-20% in both accuracy and F1 score. Beyond simply identifying actions, the framework incorporates a pilot student-support module, built upon anomaly detection and Explainable AI (XAI) principles, which automatically highlights both correct actions and areas needing improvement. Experiments show the LLM effectively identifies the constituent steps of ES, such as preparation, catheter insertion, suction application, and withdrawal, by analysing skeletal keypoints derived from video footage.
The system’s ability to provide interpretable feedback is a key advancement, offering targeted suggestions for skill refinement and enhancing training efficiency. This automated feedback mechanism moves beyond traditional, subjective human observation, offering a scalable and objective method for assessing procedural competence. Collectively, these contributions establish a scalable, interpretable, and data-driven foundation for advancing nursing education and improving patient safety. The research establishes a pathway towards automated skill assessment, data-driven clinical training, and real-time safety alerts designed to prevent procedural errors and mitigate patient harm. This LLM-based approach not only improves recognition accuracy but also enhances transparency, fostering trust in the system’s assessments and recommendations. The research team engineered a system that employs video-based pose estimation to capture the kinematics of nursing staff during ES, analysing spatiotemporal features derived from skeleton keypoints to recognise procedural steps such as preparation, catheter insertion, suction application, and withdrawal. To address challenges with occluded body parts in pose data, the study implemented interpolation techniques, mirroring approaches used by Ngo et al, to mitigate noise and missing values. This interpolation improved F1 scores from 42% using raw skeletal data to 46%, although the team recognised the need for further performance gains.
The study pioneered the use of LLMs not merely for recognition, but also for explainable decision analysis and natural language feedback generation, translating complex technical insights into accessible guidance for trainees. Researchers harnessed the LLM as a central reasoning module, achieving an approximate 15-20% improvement in both accuracy and F1 score compared to baseline models. Beyond simple recognition, the team constructed a student-support module based on anomaly detection and Explainable AI (XAI) principles, providing automated, interpretable feedback that highlights correct actions and suggests targeted improvements. To augment limited training data, the work explored techniques inspired by Dobhal et al, investigating the potential of LLMs, specifically GPT-4o, as data augmentation agents through prompt engineering.
This approach aimed to generate synthetic data to enhance model training, achieving a 1% increase in F1 score, from 55% with random sampling to 56%. Furthermore, acknowledging the limitations of single viewpoints, the team drew inspiration from multi-angle video acquisition strategies, achieving an F1 score of 61% compared to 51% with single-angle approaches, though this necessitated complex multi-camera systems. The resulting LLM-based approach establishes a scalable, interpretable, and data-driven foundation for advancing nursing education and improving patient safety.
LLM framework improves endotracheal suctioning recognition accuracy
Scientists have developed a new Large Language Model (LLM)-centered framework for video-based activity recognition, specifically targeting endotracheal suctioning (ES), an essential yet invasive clinical procedure. The research addresses the lack of automated training and feedback systems for ES, particularly in settings with limited supervision. Experiments revealed that the LLM-based approach significantly outperforms baseline models, achieving an improvement of approximately 15-20% in both accuracy and F1 score. This breakthrough delivers a scalable and interpretable foundation for advancing nursing education and enhancing training efficiency.
The team measured spatiotemporal activity recognition and explainable decision analysis from video data using the LLM as a central reasoning module. Furthermore, the LLM verbalizes feedback in natural language, translating complex technical insights into accessible guidance for trainees. Data shows the framework’s ability to accurately identify the constituent steps of ES, such as preparation, catheter insertion, suction application, and withdrawal, through analysis of skeleton keypoints derived from video footage. The system’s performance was quantified by improvements in both accuracy and the F1 score, critical metrics for evaluating the effectiveness of activity recognition models.
Beyond simple recognition, the study incorporated a pilot student-support module leveraging anomaly detection and Explainable AI (XAI) principles. Tests prove this module provides automated, interpretable feedback, highlighting both correct actions and areas for targeted improvement. Measurements confirm the system’s capacity to analyze activity execution patterns and generate meaningful feedback, supporting performance assessment and skill refinement. The framework’s ability to detect and interpret procedural nuances represents a significant step towards objective and continuous quality monitoring in clinical training.
Researchers employed video-based pose estimation to capture the fine-grained kinematics of nursing staff during the procedure, analysing spatiotemporal features from skeleton keypoints. Previous work, such as that by Ngo et al, achieved an F1 score of 42% using raw skeletal data, improving to 46% with interpolation, but the team’s LLM-based approach surpasses these results. Dobhal et al demonstrated a 1% F1 score increase, from 55% to 56%, using LLM-generated synthetic data, while the current work achieves a 15-20% improvement over baseline models, demonstrating a substantial advancement in recognition accuracy. This research demonstrates the framework’s superior performance compared to conventional machine learning and deep learning approaches, achieving approximately a 15-20% improvement in both accuracy and F1 score. The LLM functions as a central reasoning module, capable of identifying spatiotemporal activities and providing explainable decision analysis from video data. Furthermore, the system translates complex technical insights into accessible, natural language feedback for trainees, offering automated and interpretable guidance on correct actions and areas for improvement.
This work represents a proof of concept for a new generation of Human Activity Recognition (HAR) systems, leveraging the contextual reasoning of LLMs for applications in human-computer interaction. By combining semantic understanding with visual data, the framework potentially enables zero-shot learning and nuanced interpretation of behaviours, with implications for personal healthcare, smart environments, and robotics. The authors acknowledge limitations in the scope of the pilot study and suggest that further research is needed to refine subsequent prototypes. However, the successful identification and explanation of student errors, coupled with the ability to verbalise technical metrics into understandable feedback, establishes a scalable and interpretable foundation for advancing nursing education and enhancing training efficiency, ultimately contributing to improved patient care.
👉 More information
🗞 A Unified XAI-LLM Approach for EndotrachealSuctioning Activity Recognition
🧠 ArXiv: https://arxiv.org/abs/2601.21802
