Researchers Ali Khosh, Jesse White, and Orion Reblitz-Richardson have introduced torchchat, a library that enables seamless and high-performance running of large language models like Llama 3 and 3.1 on laptops, desktops, and mobile devices. This innovation builds upon previous work using native PyTorch 2.0 to run language models with great performance using CUDA.

Torchchat expands this capability by supporting more target environments, models, and execution modes, while providing essential functions like export, quantization, and deployment in an easy-to-understand manner. The library is organized into three areas: Python, C++, and mobile devices, utilizing technologies like PyTorch’s AOTInductor backend and ExecuTorch to enable on-device inference.

Initial performance results show exceptional speed for Llama 3 8B on various configurations, including Apple MacBook Pro M1 Max and Linux x86 with CUDA. The researchers invite the community to explore torchchat’s capabilities, provide feedback, and contribute to its development, ultimately aiming to unlock the full potential of generative AI and language models on any device.

Accelerating Local Inference of Large Language Models with TorchChat

TorchChat is an open-source library that enables seamless and performant running of large language models (LLMs) such as Llama 3 and 3.1 across various devices, including laptops, desktops, and mobile phones. This library builds upon the previous work on using native PyTorch 2.0 to run LLMs with great performance using CUDA. TorchChat expands on this by supporting more target environments, models, and execution modes, as well as providing essential functions like export, quantization, and deployment in an easy-to-understand manner.

The project is organized into three main areas: Python, C++, and mobile devices. The Python component provides a REST API that can be accessed via a command-line interface (CLI) or through a web browser. The C++ component generates a desktop-friendly binary using PyTorch’s Ahead-of-Time (AOT) Inductor backend. For mobile devices, TorchChat utilizes ExecuTorch to export a .pte binary file for on-device inference.

Performance Benchmarking of TorchChat

The performance of TorchChat has been benchmarked for Llama 3 across various configurations. The results are presented in the following table:

Mode	DType	Llama 3 8B Tokens/Sec
Arm Compile	float16	5.84
Arm Compile	int8	1.63
Arm Compile	int4	3.99
Arm AOTI	float16	4.05
Arm AOTI	int8	1.05
Arm AOTI	int4	3.28
MPS Eager	float16	12.63
MPS Eager	int8	16.9
MPS Eager	int4	17.15

Additionally, the performance of TorchChat has been evaluated on Linux x86 and CUDA platforms using an Intel(R) Xeon(R) Platinum 8339HC CPU @ 1.80GHz with 180GB RAM + A100 (80GB). The results are presented in the following table:

Mode	DType	Llama 3 8B Tokens/Sec
x86 Compile	bfloat16	2.76
x86 Compile	int8	3.15
x86 Compile	int4	5.33
CUDA Compile	bfloat16	83.23
CUDA Compile	int8	118.17
CUDA Compile	int4	135.16

TorchChat demonstrates exceptional performance for Llama 3 8B on mobile devices, including iPhone and Android platforms. Furthermore, early work on supporting Llama 3 8B has been included in collaboration with ExecuTorch.

Future Directions and Community Contributions

The developers of TorchChat are excited about the potential of this library to empower the PyTorch community to run LLMs locally and on constrained devices. They encourage users to clone the TorchChat repository, explore its capabilities, and provide feedback as they continue to iterate quickly. The community is invited to contribute across a broad range of areas, including additional models, target hardware support, new quantization schemes, or performance improvements.

In the near future, even stronger performance is expected through Core ML, MPS, and HTP. The TorchChat team looks forward to unlocking the full potential of generative AI and LLMs on any device.

More information
External Link: Click Here For More

Tags:

Desktop ExecuTorch inference Laptop Llama LLMs Mobile PyTorch Quantization Torchchat

Quantum News

Torchchat Enables LLMs like Llama 3.1 to run on Laptops, Desktops, and Mobile Devices

Accelerating Local Inference of Large Language Models with TorchChat

Performance Benchmarking of TorchChat

Future Directions and Community Contributions

Latest Posts by Quantum News:

HKUST Team Creates AI System for Quantitative Microscopy Image Analysis

NASA Increases Artemis Program Missions, Aims for Annual Lunar Landings

QED-C Announces Research Advances in Quantum Control Electronics