A lot can change in four years.
In January 2021, Advanced Research Computing, a unit within the VA Tech Division of Information Technology, introduced Infer, its first graphical processing unit (GPU) cluster capable of handling significant artificial intelligence (AI) inference tasks.
Through AI inference, scientists use a machine learning model, which has been trained on existing data, to infer something about a new data set. This method can be used to make predictions about upcoming events or even to interpret previously incomprehensible things such as what a cow is “saying” when it moos.
Those abilities were remarkable then, but over the past four years, research leveraging AI — not to mention enormous data sets such as large language models — has rapidly expanded at Virginia Tech, calling for far more GPU power than Infer offers.
“The production of data, and more recently, the development of AI, has seen explosive growth and is driving new levels of synthesis and analytics across all domains,” said Alberto Cano, associate vice president for research computing.
Cano oversees Advanced Research Computing, which provides high-performance computing resources and expertise for the Virginia Tech research community. “While Infer provided a capable resource for handling moderate data sets for AI and machine learning computations, GPUs capable of handling large language models models weren’t readily available in 2021.”
Falcon offers power, speed, and flexibility
Enter Falcon, Virginia Tech’s newest GPU cluster, which came online in December and is now available for use by faculty and students through Advanced Research Computing.
Falcon is composed of 52 nodes, each with four GPUs, for a total of 208 GPUs. To maximize flexibility in the types of research that can be conducted using Falcon, there are two types of GPUs within the cluster.
Thirty-two are equipped with NVIDIA A30 GPUs with 24 gigabytes of memory per GPU. These support double floating point, or FP64 calculations, that provide ultraprecise outputs with minimal rounding error and are used in scientific research applications such as fluid dynamics or product design. The other 20 nodes are equipped with NVIDIA L40S GPUs with 48 gigabytes of memory per GPU. This very high memory is required for large language models (LLMs).
Falcon’s interconnect fabric — the technology that allows data to move within and between the GPUs — is also a significant upgrade in terms of increased speed and reduced latency. It is the first of Advanced Research Computing’s clusters to use next-generation data rate InfiniBand, a network communications technology that features very high data transfer speeds and very low latency. Each of Falcon’s nodes are equipped with double data rate connectors, which provide throughout speeds of 200 gigabits per second per node — that’s up to 20 times faster than what was possible on Infer — a capability that greatly reduces the time it takes to load or transfer large data sets.
“Maintaining a cutting-edge HPC [high performance computing] infrastructure is vital for advancing both our university’s mission and the broader scientific community. With the introduction of NVIDIA A30 and L40s GPUs, we are not just upgrading our systems, we are unlocking unprecedented opportunities for research in AI, data science, and complex simulations. This new generation of GPUs empowers our researchers to tackle larger, more intricate problems, accelerating discoveries and driving innovation to new heights,” said Cano.
All this translates to a powerful, versatile GPU cluster that is fit to meet the needs of research teams at Virginia Tech now and into the future.
Following its installation this past summer, Advanced Research Computing worked with some research teams who served as early testers of Falcon. Several of these researchers are graduate students in computer science who are part of the Virginia Tech Learning on Graphs Lab (VLOG), which is directed by Dawei Zhou, assistant professor of computer science.
“Falcon’s computational power will be vital for my research on uncertainty quantification in large language models, enabling efficient experiments and analysis. As part of the VLOG Lab, I’ll leverage Falcon to enhance AI trustworthiness, advancing reliable methods for LLMs and graph neural networks,” Tuo Wang said.
“Falcon will significantly enhance my research in protein disorders by enabling high-performance computational simulations and large-scale analysis of disordered protein regions and interactions. This will allow for more detailed and accurate modeling, facilitating deeper insights into the dynamic behavior of intrinsically disordered proteins,” said Xinyue “Susan” Zeng.
Sina Mostafanejad, software scientist in the Department of Chemistry and for the Molecular Sciences Software Institute, also has been an early test user of Falcon. “The new NVIDIA L40S and A30 GPUs in the Falcon cluster will boost our research in training and testing foundation large language models for chemistry and will help accelerate new scientific discoveries in computational molecular science,” Mostafanejad said.
More information about Falcon as well as other high performance computational resources can be found at arc.vt.edu.
By Kit Hayes