Confidential LLM Inference Achieves Practicality with CPU and GPU TEEs, Delivering 8% to 20% Performance

September 26, 2025 by Rohail T.

Confidential LLM Inference Achieves Practicality with CPU and GPU TEEs, Delivering 8% to 20% Performance

Large Language Models (LLMs) are rapidly becoming essential tools, yet their use in sensitive areas like healthcare and finance remains limited by security concerns, particularly when handling private data and valuable training sets. Marcin Chrapek, Marcin Copik, Etienne Mettaz, and Torsten Hoefler, all from ETH Zurich, address this challenge by investigating Trusted Execution Environments (TEEs) as a means of securing LLM operations. Their research comprehensively evaluates the performance and cost of running complete LLM inference pipelines within both CPU and GPU TEEs, utilising Intel’s TDX and SGX technologies alongside H100 Confidential Compute GPUs. The team’s findings reveal minimal performance impacts, under 10% throughput and 20% latency overhead, and demonstrate that CPU TEEs can, in certain scenarios, offer a more cost-effective or secure solution than GPUs, representing a significant step towards practical confidential LLMs.

Proprietary datasets and their heightened security requirements often hinder adoption in privacy-sensitive sectors such as healthcare and finance. Scientists validated the practicality of this approach by evaluating these compute-intensive workloads entirely within CPU and GPU TEEs. On the CPU side, an in-depth study ran full Llama2 inference pipelines, including 7B, 13B, and 70B parameter models, inside Intel’s TDX and SGX, accelerated by Advanced Matrix Extensions (AMX). Experiments show that full Llama2 inference pipelines, including 7B, 13B, and 70B parameter models, can run within CPU TEEs with under 10% throughput reduction and 20% latency increase. Through these experiments, the team derived 12 key insights into confidential LLM hosting, providing practical guidelines for both users and cloud providers. Through detailed analysis, the researchers derived twelve key insights regarding TEE performance, including considerations for CPU architecture like NUMA effects and the benefits of large page sizes.

They also successfully implemented a Retrieval-Augmented Generation (RAG) pipeline within a TEE, demonstrating its operational feasibility. The findings indicate that TEEs offer a viable path toward protecting LLM inference, positioning them as a foundational component for future confidential AI systems.

👉 More information
🗞 Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs
🧠 ArXiv: https://arxiv.org/abs/2509.18886

Tags:

Advanced Matrix Extensions Confidential Computing confidential LLMs H100 GPUs Intel SGX Intel TDX Large Language Models Llama2 throughput overhead Trusted Execution Environments

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Rohail T.

Latest Posts by Rohail T.:

Quantum Networks Promise Unhackable Communications and Super-Accurate Sensors

New Software Accelerates Complex Calculations by up to 500times

Rapid Quantum Control Technique Boosts Signal Transfer across Wider Frequencies