Argonne National Laboratory (ANL) and Intel are developing a massive AI model, AuroraGPT, with one trillion parameters. The model is being trained on ANL’s Aurora supercomputer, powered by Intel’s Ponte Vecchio GPUs. The project, also known as “ScienceGPT,” will feature a chatbot interface for researchers and could be used in fields such as biology, cancer research, and climate change. The training process, which could take several months, is being facilitated by Microsoft’s Megatron/DeepSpeed to manage memory requirements. The project aims to accelerate scientific AI development and revolutionise scientific computing.
Argonne National Laboratory’s AI Project: AuroraGPT
ANL and Intel are working together with other research institutions in the United States and globally to advance the development of scientific AI. The goal is to integrate a vast amount of text, code, scientific findings, and research papers into a versatile model that can expedite scientific discoveries. The model is also expected to feature a chatbot interface, allowing researchers to ask questions and receive immediate responses.
Potential Applications of AuroraGPT
The potential applications of AuroraGPT are extensive, spanning various scientific fields such as biology, cancer research, and climate change. The integration of chatbots into the scientific research process could streamline and enhance the pursuit of knowledge across these crucial areas. However, the model’s potential for generating images and videos remains uncertain. The inference capability will also be crucial as scientists interact with the chatbot and continually input new information.
Training Process and Challenges
Training a complex model like AuroraGPT requires significant time and computing resources just like ChatGPT and other LLMs (Large Language Models) which has created headlines around the planet. ANL and Intel are currently in the early stages of hardware testing before initiating full-scale training. The training process, which could take several months, will start with 256 nodes and eventually scale up to include all 10,000 nodes of the Aurora supercomputer.
One of the significant challenges in training large language models is the memory requirements, which often necessitate distribution across multiple GPUs. To address this, AuroraGPT is using Microsoft’s Megatron/DeepSpeed to enable parallel training and optimise performance.
Initial Testing and Future Goals
Initial testing of the one-trillion parameter model is being conducted on a cluster of 64 Aurora nodes. This number of nodes is somewhat lower than typical for large language models, due to Aurora’s unique design. Intel has worked closely with Microsoft to fine-tune both software and hardware, with the ultimate goal of extending training to include the entire 10,000-plus node system. Additionally, linear scaling is a key aspiration, aiming to achieve improved performance with each additional node.
The Impact of AuroraGPT
The development of AuroraGPT by Argonne National Laboratory, in collaboration with Intel and global research institutions, represents a significant advancement in the world of scientific AI. This project has the potential to revolutionise research methodologies and accelerate scientific discoveries across multiple domains. It also presents promising opportunities for businesses and institutions involved in AI and high-performance computing markets.
“It combines all the text, codes, specific scientific results, papers, into the model that science can use to speed up research.” – Ogi Brkic, Vice President and General Manager for Data Center and HPC Solutions.
“Brkic emphasized that Intel’s Ponte Vecchio GPUs have demonstrated superior performance compared to Nvidia’s A100 GPUs in another Argonne supercomputer called Theta, boasting a peak performance of 11.7 petaflops.” – Ogi Brkic, Vice President and General Manager for Data Center and HPC Solutions.
- Argonne National Laboratory (ANL) is developing a large-scale AI model named AuroraGPT, which has one trillion parameters.
- The model is being trained on ANL’s Aurora supercomputer, which uses Intel’s Ponte Vecchio GPUs.
- Intel and ANL are working with global research labs to speed up scientific AI development.
- AuroraGPT, also referred to as “ScienceGPT,” will have a chatbot interface for researchers to use for insights and answers.
- The AI model could be used in various scientific fields, including biology, cancer research, and climate change.
- The training process, which could take several months, will scale from 256 to 10,000 nodes.
- Challenges such as memory requirements for large language models are being addressed using Microsoft’s Megatron/DeepSpeed.
- Intel is aiming for linear scaling to improve performance as the number of nodes increases.
- Ogi Brkic, Vice President and General Manager for Data Center and HPC Solutions, has highlighted the potential of this model to speed up research.
- The project is still in the early stages, with full-scale training yet to begin.
