Google and NVIDIA have collaborated to optimize Google’s new open language models, Gemma, across all NVIDIA AI platforms. Gemma, which can be run anywhere, is designed to reduce costs and speed up work for specific use cases. The models are accelerated by NVIDIA’s TensorRT-LLM, an open-source library for optimizing large language model inference. This allows developers to target over 100 million NVIDIA RTX GPUs globally. Gemma can also run on NVIDIA GPUs in the cloud, including on Google Cloud’s A3 instances. NVIDIA’s H200 Tensor Core GPUs, which Google will deploy this year, will soon support Gemma.
Google’s Gemma Optimized for NVIDIA GPUs: A Collaborative Effort
NVIDIA and Google have recently announced a joint effort to optimize Google’s new open language models, Gemma, across all NVIDIA AI platforms. Gemma, a state-of-the-art lightweight language model with 2 billion and 7 billion parameters, can be run anywhere, thus reducing costs and accelerating innovative work for domain-specific use cases. The collaboration aims to enhance the performance of Gemma when running on NVIDIA GPUs in various environments, including data centers, the cloud, and local workstations with NVIDIA RTX GPUs or PCs with GeForce RTX GPUs.
The optimization process was facilitated by NVIDIA TensorRT-LLM, an open-source library specifically designed for optimizing large language model inference. This collaboration allows developers to target the installed base of over 100 million NVIDIA RTX GPUs available in high-performance AI PCs globally.
Gemma on NVIDIA GPUs: Cloud and Local Applications
Developers can run Gemma on NVIDIA GPUs not only locally but also in the cloud. This includes Google Cloud’s A3 instances based on the H100 Tensor Core GPU and soon, NVIDIA’s H200 Tensor Core GPUs. The latter features 141GB of HBM3e memory at 4.8 terabytes per second and is set to be deployed by Google within the year.
Running Gemma locally provides several advantages. For instance, it allows for faster results as the model runs directly on the device. Moreover, it ensures user data privacy as the data stays on the device and does not need to be shared with a third party or require an internet connection.
NVIDIA’s Ecosystem of Tools: Enhancing Gemma’s Performance
Enterprise developers can further enhance Gemma’s performance by leveraging NVIDIA’s rich ecosystem of tools. This includes NVIDIA AI Enterprise with the NeMo framework and TensorRT-LLM, which can be used to fine-tune Gemma and deploy the optimized model in their production applications.
Additional information for developers, including several model checkpoints of Gemma and the FP8-quantized version of the model, all optimized with TensorRT-LLM, is available to help rev up inference for Gemma.
Gemma and Chat with RTX: A New User Experience
NVIDIA is also planning to add support for Gemma to its tech demo, Chat with RTX. This application uses retrieval-augmented generation and TensorRT-LLM software to provide users with generative AI capabilities on their local, RTX-powered Windows PCs.
Chat with RTX allows users to personalize a chatbot with their own data by easily connecting local files on an RTX PC to a large language model. This feature, combined with the local running of the model, provides a fast, secure, and personalized user experience. Users can experience Gemma 2B and Gemma 7B directly from their browser on the NVIDIA AI Playground.
External Link: Click Here For More
