Large Vision-Language Models Achieve Surgical Tool Detection with Holistic Understanding

Surgical tool detection represents a critical challenge in developing effective artificial intelligence for operating theatre support. Nakul Poudel, Richard Simon, and Cristian A Linte, all from the Rochester Institute of Technology, investigated whether large vision-language models (VLMs) could accurately identify instruments within complex surgical scenes. Their research addresses a significant gap in current AI systems, which often struggle with the multimodal nature of surgery and lack comprehensive scene understanding. By evaluating models like Qwen2.5, LLaVA1.5, and InternVL3.5 on the GraSP robotic surgery dataset, and comparing their performance to the Grounding DINO baseline, the team demonstrate the potential of VLMs , particularly Qwen2.5 , to advance surgical AI and improve both tool recognition and localisation.

Their research addresses a significant gap in current AI systems, which often struggle with the multimodal nature of surgery and lack comprehensive scene understanding.

Multimodal VLMs for surgical scene understanding are gaining

Scientists have increasingly recognised Artificial Intelligence (AI) as a transformative force in surgical guidance and decision-making. However, the unimodal nature of most current AI systems limits their ability to achieve a holistic understanding of surgical workflows. Recent advances in large vision, language models (VLMs) that integrate multimodal data processing offer strong potential for modeling surgical tasks and providing human-like scene reasoning and understanding. Despite their promise, systematic investigations of VLMs in surgical applications remain limited. Our results demonstrate that Qwen2.5 consistently achieves superior detection performance in both configurations among the evaluated VLMs.
Furthermore, compared with the open-set detection baseline Grounding DINO, Qwen2.5 exhibits stronger zero-shot generalization and comparable fine-tuned performance. Notably, Qwen2.5 shows superior instrument recognition, while Grounding DINO demonstrates stronger localization. Surgical environments are inherently complex, high-risk, and cognitively demanding. As procedures increasingly shift toward minimally invasive and robot-assisted techniques, the integration of AI has emerged as a transformative force. Surgical AI systems provide critical intraoperative guidance and decision support, enhancing both surgical precision and patient safety.

However, most current surgical AI systems are unimodal and task-specific, with limited modeling of interactions among surgical tasks. Developing such capabilities is important for achieving comprehensive scene understanding, augmenting entire surgical workflows, and enabling future autonomous robotic assistants [2, 3]. Recent advances in vision-language models (VLMs) take multimodal input, i. e. ,0.5 and 0.1, respectively. The error categories are: Missed GT Error (Miss), Background Error (Bkg), Both Classification and Localization Error (Cls and Loc), Duplicate Detection Error (Dup), Classification Error (Cls), and Localization Error (Loc). Grounding DINO, an open-set object detection model, serves as the baseline. In the zero-shot setting, LLaVA performed poorly.

VLM Evaluation for Surgical Tool Detection is crucial

This work pioneers a systematic analysis of VLMs specifically within the surgical domain, moving beyond general AI capabilities to focus on a vital clinical application. To define the problem, the study framed surgical tool detection as predicting a set of instrument instances, denoted as O = {(ci, bi)}Ni=1, where ci signifies the instrument category and bi represents the corresponding bounding box within an input surgical image I ∈ RH×W×3. These frames contained between 1 and 5 instrument instances, encompassing seven categories, Bipolar Forcep, Prograsp Forcep, Large Needle Driver, Monopolar Curved Scissor, Suction Instrument, Clip Applier, and Laparoscopic Grasper, with a total of 9,031 instances across the training and test sets. Experiments employed two distinct settings: zero-shot inference, where off-the-shelf VLMs directly classified and localized instruments without task-specific training, and a fine-tuned setting utilising Rank-8 LoRA adaptation for 5 epochs.

The team leveraged the Swift framework on an NVIDIA A100 (40GB) GPU to facilitate these computationally intensive experiments, ensuring efficient processing and analysis of the visual data. Furthermore, the study innovatively addressed the challenge of evaluating models lacking confidence scores by adopting the TIDE framework. This approach decomposes errors into six interpretable categories, Missed GT, False Positive, Category Error, Localization Error, Similar Category Error, and Background Error, providing a nuanced understanding of model behaviour beyond a single mAP score. Specifically, the team used IoUp, tf, and tb thresholds of 0.5 and 0.1 respectively, to quantify the overlap between predicted and ground truth bounding boxes, enabling a detailed error-level analysis and revealing the strengths and limitations of each VLM.

Qwen2.5 excels at surgical tool detection

Experiments revealed that Qwen2.5 consistently outperformed the other VLMs in detection performance across both configurations, demonstrating its superior capabilities in understanding surgical scenes. Results demonstrate that Qwen2.5 exhibited stronger zero-shot generalization compared to the open-set detection baseline, Grounding DINO, while achieving comparable performance after fine-tuning. The team measured instrument recognition rates, finding that Qwen2.5 excelled in accurately identifying instruments, whereas Grounding DINO showed greater proficiency in localising them within the surgical field. Data shows the dataset comprises seven instrument categories, Bipolar Forcep, Prograsp Forcep, Large Needle Driver, Monopolar Curved Scissor, Suction Instrument, Clip Applier, and Laparoscopic Grasper, with a total of 9,031 instrument instances across both training and test sets.

The distribution of these instances varied, with Bipolar Forceps being the most prevalent at 2,503 instances, and Clip Appliers being the least common with only 102. Measurements confirm that Qwen2.5’s superior performance stems from its ability to effectively integrate visual and textual information, enabling a more holistic understanding of the complex surgical environment. The work highlights the potential for these VLMs to support intraoperative guidance and decision-making, enhancing surgical precision and patient safety. Future research will focus on expanding the capabilities of these models to encompass a wider range of surgical tasks, including surgical phase recognition and action recognition, ultimately paving the way for autonomous robotic assistants in the operating room.

Qwen2.5 and Grounding DINO excel synergistically

This research establishes that Qwen2.5 excels in instrument recognition, while Grounding DINO demonstrates superior tool localization accuracy. The authors acknowledge limitations related to the specific dataset used and the scope of the evaluated tasks, noting that further investigation is needed to assess performance across diverse surgical procedures and complexities. Future work should explore leveraging Qwen for broader surgical task analysis, including phase, action, and step recognition, potentially paving the way for comprehensive surgical workflow understanding.

👉 More information
🗞 Evaluating Large Vision-language Models for Surgical Tool Detection
🧠 ArXiv: https://arxiv.org/abs/2601.16895

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Kondo-Assisted Néel Order Achieves Magnetic Phase Transition in Spin-(1/2,1) Model

Kondo-Assisted Néel Order Achieves Magnetic Phase Transition in Spin-(1/2,1) Model

January 28, 2026
Cubesat Missions Detect 360 Gamma-Ray Transients with GRBAlpha & Vzlusat-2

Cubesat Missions Detect 360 Gamma-Ray Transients with GRBAlpha & Vzlusat-2

January 28, 2026
Automated Pipeline Achieves Graph-Based Intelligence for Performance-Driven Building Design

Automated Pipeline Achieves Graph-Based Intelligence for Performance-Driven Building Design

January 28, 2026