Large Language Models (LLMs) are rapidly becoming essential tools, yet their use in sensitive areas like healthcare and finance remains limited by security concerns, particularly when handling private data and valuable training sets. Marcin Chrapek, Marcin Copik, Etienne Mettaz, and Torsten Hoefler, all from ETH Zurich, address this challenge by investigating Trusted Execution Environments (TEEs) as a means of securing LLM operations. Their research comprehensively evaluates the performance and cost of running complete LLM inference pipelines within both CPU and GPU TEEs, utilising Intel’s TDX and SGX technologies alongside H100 Confidential Compute GPUs. The team’s findings reveal minimal performance impacts, under 10% throughput and 20% latency overhead, and demonstrate that CPU TEEs can, in certain scenarios, offer a more cost-effective or secure solution than GPUs, representing a significant step towards practical confidential LLMs.
Proprietary datasets and their heightened security requirements often hinder adoption in privacy-sensitive sectors such as healthcare and finance. Scientists validated the practicality of this approach by evaluating these compute-intensive workloads entirely within CPU and GPU TEEs. On the CPU side, an in-depth study ran full Llama2 inference pipelines, including 7B, 13B, and 70B parameter models, inside Intel’s TDX and SGX, accelerated by Advanced Matrix Extensions (AMX). Experiments show that full Llama2 inference pipelines, including 7B, 13B, and 70B parameter models, can run within CPU TEEs with under 10% throughput reduction and 20% latency increase. Through these experiments, the team derived 12 key insights into confidential LLM hosting, providing practical guidelines for both users and cloud providers. Through detailed analysis, the researchers derived twelve key insights regarding TEE performance, including considerations for CPU architecture like NUMA effects and the benefits of large page sizes.
They also successfully implemented a Retrieval-Augmented Generation (RAG) pipeline within a TEE, demonstrating its operational feasibility. The findings indicate that TEEs offer a viable path toward protecting LLM inference, positioning them as a foundational component for future confidential AI systems.
👉 More information
🗞 Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs
🧠 ArXiv: https://arxiv.org/abs/2509.18886
