Torchchat Enables LLMs like Llama 3.1 to run on Laptops, Desktops, and Mobile Devices

Researchers Ali Khosh, Jesse White, and Orion Reblitz-Richardson have introduced torchchat, a library that enables seamless and high-performance running of large language models like Llama 3 and 3.1 on laptops, desktops, and mobile devices. This innovation builds upon previous work using native PyTorch 2.0 to run language models with great performance using CUDA.

Torchchat expands this capability by supporting more target environments, models, and execution modes, while providing essential functions like export, quantization, and deployment in an easy-to-understand manner. The library is organized into three areas: Python, C++, and mobile devices, utilizing technologies like PyTorch’s AOTInductor backend and ExecuTorch to enable on-device inference.

Initial performance results show exceptional speed for Llama 3 8B on various configurations, including Apple MacBook Pro M1 Max and Linux x86 with CUDA. The researchers invite the community to explore torchchat’s capabilities, provide feedback, and contribute to its development, ultimately aiming to unlock the full potential of generative AI and language models on any device.

Accelerating Local Inference of Large Language Models with TorchChat

TorchChat is an open-source library that enables seamless and performant running of large language models (LLMs) such as Llama 3 and 3.1 across various devices, including laptops, desktops, and mobile phones. This library builds upon the previous work on using native PyTorch 2.0 to run LLMs with great performance using CUDA. TorchChat expands on this by supporting more target environments, models, and execution modes, as well as providing essential functions like export, quantization, and deployment in an easy-to-understand manner.

The project is organized into three main areas: Python, C++, and mobile devices. The Python component provides a REST API that can be accessed via a command-line interface (CLI) or through a web browser. The C++ component generates a desktop-friendly binary using PyTorch’s Ahead-of-Time (AOT) Inductor backend. For mobile devices, TorchChat utilizes ExecuTorch to export a .pte binary file for on-device inference.

Performance Benchmarking of TorchChat

The performance of TorchChat has been benchmarked for Llama 3 across various configurations. The results are presented in the following table:

ModeDTypeLlama 3 8B Tokens/Sec
Arm Compilefloat165.84
Arm Compileint81.63
Arm Compileint43.99
Arm AOTIfloat164.05
Arm AOTIint81.05
Arm AOTIint43.28
MPS Eagerfloat1612.63
MPS Eagerint816.9
MPS Eagerint417.15

Additionally, the performance of TorchChat has been evaluated on Linux x86 and CUDA platforms using an Intel(R) Xeon(R) Platinum 8339HC CPU @ 1.80GHz with 180GB RAM + A100 (80GB). The results are presented in the following table:

ModeDTypeLlama 3 8B Tokens/Sec
x86 Compilebfloat162.76
x86 Compileint83.15
x86 Compileint45.33
CUDA Compilebfloat1683.23
CUDA Compileint8118.17
CUDA Compileint4135.16

TorchChat demonstrates exceptional performance for Llama 3 8B on mobile devices, including iPhone and Android platforms. Furthermore, early work on supporting Llama 3 8B has been included in collaboration with ExecuTorch.

Future Directions and Community Contributions

The developers of TorchChat are excited about the potential of this library to empower the PyTorch community to run LLMs locally and on constrained devices. They encourage users to clone the TorchChat repository, explore its capabilities, and provide feedback as they continue to iterate quickly. The community is invited to contribute across a broad range of areas, including additional models, target hardware support, new quantization schemes, or performance improvements.

In the near future, even stronger performance is expected through Core ML, MPS, and HTP. The TorchChat team looks forward to unlocking the full potential of generative AI and LLMs on any device.

More information
External Link: Click Here For More
Quantum News

Quantum News

As the Official Quantum Dog (or hound) by role is to dig out the latest nuggets of quantum goodness. There is so much happening right now in the field of technology, whether AI or the march of robots. But Quantum occupies a special space. Quite literally a special space. A Hilbert space infact, haha! Here I try to provide some of the news that might be considered breaking news in the Quantum Computing space.

Latest Posts by Quantum News:

Scientists Guide Zapata's Path to Fault-Tolerant Quantum Systems

Scientists Guide Zapata’s Path to Fault-Tolerant Quantum Systems

December 22, 2025
NVIDIA’s ALCHEMI Toolkit Links with MatGL for Graph-Based MLIPs

NVIDIA’s ALCHEMI Toolkit Links with MatGL for Graph-Based MLIPs

December 22, 2025
New Consultancy Helps Firms Meet EU DORA Crypto Agility Rules

New Consultancy Helps Firms Meet EU DORA Crypto Agility Rules

December 22, 2025