NVIDIA’s Holoscan Platform Revolutionizes Healthcare with AI, Reducing Latency by 35%

Nvidia'S Holoscan Platform Revolutionizes Healthcare With Ai, Reducing Latency By 35%

Artificial Intelligence (AI) and Machine Learning (ML) technologies are revolutionizing healthcare diagnostics and treatments, but their integration into medical devices can lead to unpredictable latency due to GPU resource contentions. NVIDIA’s Holoscan platform, a real-time AI system for streaming sensor data and images, offers a solution. The platform allows for the creation of real-time pipelines for AI-based analysis and visualization of streaming data and medical images. However, the simultaneous execution of multiple AI applications can still pose challenges. A novel design approach combining CUDA MPS for spatial partitioning and a load-balancing technique is proposed to address these issues, showing significant performance improvements in empirical evaluations.

How is AI Revolutionizing Healthcare Diagnostics and Treatments?

Artificial Intelligence (AI) and Machine Learning (ML) technologies have been integrated into medical devices, leading to a significant transformation in healthcare diagnostics and treatments. Medical device manufacturers are eager to maximize the benefits of AI and ML by consolidating multiple applications onto a single platform. These technologies have gained traction in recent years by enhancing medical procedures ranging from disease diagnostics to surgical interventions. They enable automated real-time monitoring and provide valuable insights to medical professionals and physicians, leading to early diagnosis of diseases and reduced patient recovery times.

However, the concurrent execution of several AI applications, each with its own visualization components, leads to unpredictable end-to-end latency primarily due to GPU resource contentions. To mitigate this, manufacturers typically deploy separate workstations for distinct AI applications, thereby increasing financial, energy, and maintenance costs. This article addresses these challenges within the context of NVIDIA’s Holoscan platform, a real-time AI system for streaming sensor data and images.

What is NVIDIA’s Holoscan Platform?

Holoscan is NVIDIA’s scalable edge-computing platform for AI-enabled sensor processing. It empowers medical device manufacturers with the ability to create real-time pipelines for AI-based analysis and visualization of streaming data and medical images. Holoscan consists of optimized libraries for IO data processing, inference, and graphical rendering on the NVIDIA GPUs. It provides a modular and intuitive dataflow programming model and is compatible with both ARM and x86-based hardware equipped with GPUs, offering medical device vendors the freedom to select their architecture.

As the field of AI continues its rapid evolution, developers aspire to harness the potential of NVIDIA Holoscan SDK for integrating multiple AI applications with visualization capabilities into their systems. This is especially true for edge-computing domains like medical devices, as compute workloads enabled by CUDA in this domain cannot leverage the full potential of massive parallelism offered by today’s GPUs.

What are the Challenges in Performance Predictability?

Real-world constraints such as space limitations, power consumption, financial cost considerations, maintenance, and regulatory requirements necessitate utilizing a minimized set of hardware as possible, like a single workstation with GPUs. However, these constraints often pose challenges to performance predictability. The heterogeneous nature of the simultaneous AI-based compute and graphical rendering workloads create resource contention on a GPU. Simultaneous execution of multiple AI and visualization workloads leads to increased maximum latency. To avoid this issue, device manufacturers employ distinct workstations for separate AI applications, increasing the economic burden on themselves, hospitals, and ultimately patients.

How Does the Novel Design Approach Address These Challenges?

The article proposes a novel design approach that combines CUDA MPS (Multiprocess Service) for spatial partitioning between compute workloads and a load-balancing technique to isolate compute CUDA kernel and graphics tasks onto distinct GPUs. Additionally, an admission control policy is used to prevent SM-oversubscription by concurrent compute tasks. This pragmatic design is straightforward to implement and minimizes heavy context-switch overheads, mitigating the resource contention within a GPU for medical AI workloads.

What are the Results of the Empirical Evaluation?

Empirical evaluation using a set of end-to-end latency determinism metrics reveals substantial performance improvement with the proposed design. For instance, the proposed design reduces maximum latency by 21-30% and improves latency distribution flatness by 17-25% for up to five concurrent endoscopy tool tracking AI applications compared to a single-GPU baseline. Against a default multi-GPU setup, the optimizations decrease maximum latency by 35% for up to six concurrent applications by improving GPU utilization by 42%. This paper provides clear design insights for AI applications in the edge-computing domain, including medical systems where performance predictability of concurrent and heterogeneous GPU workloads is a critical requirement.

Publication details: “Towards Deterministic End-to-end Latency for Medical AI Systems in
NVIDIA Holoscan”
Publication Date: 2024-02-06
Authors: Soham Sinha, Shekhar Dwivedi and Mahdi Azizian
Source: arXiv (Cornell University)
DOI: https://doi.org/10.48550/arxiv.2402.04466