Scene graphs offer a powerful way to represent the complex relationships within surgical environments, and a new scoping review charts the rapid progress of this technology. Angelo Henriques, Korab Hoxha, and Daniel Zapp, all from the Technical University of Munich, alongside Peter C. Issa, Nassir Navab from the same institution, and M. Ali Nasseri from the University of Alberta, systematically mapped the field, revealing both significant advancements and a critical gap in research approaches. Their analysis demonstrates that while internal surgical views rely heavily on real-world video data, external modelling predominantly uses simulations, hindering translational progress. This work establishes scene graphs as a vital tool for surgical analysis, including workflow recognition and safety monitoring, and for creating advanced surgical simulations, paving the way for intelligent systems that promise to enhance surgical safety, efficiency, and training.

Edmonton, Canada. Scene graphs provide structured relational representations crucial for decoding complex, dynamic surgical environments. This scoping review systematically maps the evolving landscape of scene graph research in surgery, charting its applications, methodological advancements, and future directions. The analysis reveals rapid growth, yet uncovers a critical ‘data divide’: research focusing on the surgical view almost exclusively uses real-world 2D video, while external-view modelling relies heavily on simulated data. Methodologically, the field has advanced from foundational graph neural networks to specialized foundation models, which now significantly outperform generalist large vision-language models in surgical contexts.

Surgical Scene Graph Applications and Views

Scene graphs are increasingly utilized in surgery to represent scenes as interconnected networks of objects and their relationships. This allows artificial intelligence systems to understand the surgical environment and perform tasks such as identifying instruments, tissues, and actions during surgery, assessing surgical skill, providing real-time guidance, and analyzing the operating room environment. Research categorizes these applications into two main groups: internal view, focusing on the surgical field within the patient’s body, and external view, focusing on the broader operating room environment. Internal-view applications commonly utilize techniques like triplet detection to identify interactions between instruments and tissues, workflow recognition to understand the sequence of surgical actions, and safety assessment to evaluate critical views and potential errors.

Researchers frequently employ datasets such as CholecT45, CholecT50, and CholecT80, utilizing convolutional neural networks, graph convolutional networks, transformers, and diffusion models to improve accuracy and efficiency. Recent work explores the use of large language models and multi-modal approaches, combining video with audio and other data sources, to enhance surgical understanding and compress models for efficient processing. Furthermore, generative frameworks are being developed to synthesize realistic surgical videos. External-view applications focus on understanding the operating room environment, generating semantic scene graphs, and creating realistic simulations for training purposes.

Research in this area utilizes datasets like 4D-OR and MM-OR, incorporating multi-view RGB-D video and other data modalities. Large language models and foundation models are also being explored for zero-shot performance, enabling systems to understand the operating room environment without specific training. A comprehensive list of abbreviations used in the field aids in understanding the terminology.

Surgical Scene Graphs, Data Gaps, and Foundation Models

This work demonstrates the rapidly evolving landscape of scene graph research within surgery, charting its applications and methodological advancements. Researchers systematically mapped the field and revealed a notable ‘data divide’: internal-view surgical analysis predominantly utilizes real-world 2D video, while external-view modelling relies heavily on simulated data, highlighting a translational research gap. The study details a clear progression from foundational graph neural networks to specialized foundation models, which now significantly outperform generalist large vision-language models in surgical contexts, establishing scene graphs as a cornerstone for both analytical tasks, such as workflow recognition and automated safety monitoring, and generative tasks like controllable surgical simulation. Data sources directly influence the complexity of resulting scene graphs, with standard 2D RGB video from endoscopes and laparoscopes being the most common modality for internal-view analysis.

Increasingly, researchers are employing RGB-D data for applications requiring true 3D spatial understanding, enabling the construction of 4D (3D + time) scene graphs that track dynamic spatial interactions. The latest trend involves integrating diverse data streams into multimodal frameworks, exemplified by datasets like MM-OR, which fuse multi-view RGB-D video with audio, speech transcripts, and robotic system logs. This comprehensive approach enables more robust and comprehensive scene understanding by leveraging complementary information. Progress in the field is intrinsically linked to the availability of specialized datasets, evolving from repurposed videos to large, purpose-built multimodal benchmarks, demonstrating a commitment to advancing the technology and providing resources for future research.

Surgical Scene Graphs, Data and Future Directions

This scoping review charts the rapid emergence of scene graphs as a powerful tool for understanding the complex environments of surgery, demonstrating a shift from simple object detection towards more sophisticated relational reasoning. Researchers have progressed from initial graph neural networks to specialized foundation models, achieving significant advancements in surgical data science and enabling applications such as workflow recognition and surgical simulation. The core strength of these systems lies in their ability to encode relationships between surgical entities, offering an interpretable foundation for advanced artificial intelligence. The analysis reveals a notable ‘data divide’ within the field, with internal-view research relying predominantly on real-world video data, while external-view modelling frequently utilizes simulated data, highlighting a translational research gap.

Despite ongoing challenges related to data scarcity and real-time performance, researchers are actively addressing these limitations through techniques like transfer learning and model distillation. To fully realize the potential of surgical scene graphs, future work should prioritize the development of actionable, closed-loop systems, focusing on clinical grounding and personalization through integration with patient-specific data. Further research should also explore causal reasoning and establish robust frameworks for clinical validation and the creation of interactive generative environments for surgical training and planning.

👉 More information
🗞 Decoding the Surgical Scene: A Scoping Review of Scene Graphs in Surgery
🧠 ArXiv: https://arxiv.org/abs/2509.20941

Tags:

Data Annotation Graph Neural Networks scene graphs scoping review surgical environments surgical simulation Translational Research workflow recognition

Scene Graphs in Surgery Demonstrate a Data Divide Between 2D Video and 4D Modeling

Surgical Scene Graph Applications and Views

Surgical Scene Graphs, Data Gaps, and Foundation Models

Surgical Scene Graphs, Data and Future Directions

Rohail T.

Latest Posts by Rohail T.:

Accurate Quantum Sensing Now Accounts for Real-World Limitations

Quantum Error Correction Gains a Clearer Building Mechanism for Robust Codes

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently