Few technologies have grown as fast or increased in usage as quickly as artificial intelligence (AI) and machine learning (ML) technologies. In fact, in their recently-released report, 2023 Generative AI in the Enterprise, technology publisher and training provider, O’Reilly, claimed, “We’ve never seen a technology adopted as fast as generative AI—it’s hard to believe that ChatGPT is barely a year old.”
This growth is primarily driven by AI’s promise to streamline workflows, increase efficiency, and improve operations across enterprises. Technologists anticipate that AI will help to eliminate redundant, repetitive tasks in offices, improve medical diagnosis, and even enable the self-driving revolution in transportation.
One of the companies that is spearheading this AI revolution is NVIDIA, which has pioneered many of the hardware and software technologies that are making modern AI applications possible.
Jason Black (JB), the Director of Cloud and Edge Data Center Engineering at NVIDIA, will be the keynote speaker at an upcoming Thought Leadership Summit sponsored by Compu Dynamics and InterGlobix focused on demystifying AI infrastructure and its implications for the data center industry.
We recently sat down with JB to explore how NVIDIA is leading the advancement of AI, how data centers are evolving to meet the requirements of AI workloads, and what attendees can expect to learn at the upcoming Summit.
The Modern Data Center Journal (MDCJ): Can you tell our readers about the role of NVIDIA’s technologies in enabling advanced AI applications? How is NVIDIA driving the AI revolution?
JB: Over the past few years, NVIDIA has worked to identify new ways in which we can enable our customers and users to better advance their AI and ML strategies and programs. We’ve done that through the creation of end-to-end hardware and software solutions that facilitate the creation, training, and use of AI and ML applications.
We’ve built these solutions through our own innovation and strategic acquisitions, including the acquisition of Mellanox and their proprietary Infiniband technology. We’ve taken that Infiniband solution and rolled it into our east-west compute fabric to ensure that the low latency network needed for AI and ML is achievable and repeatable. We also acquired Cumulus Networks, which gave us access to a leading network operating system that can run on top of our switches.
Through the acquisitions of Mellanox and Cumulus Networks, we gained the ability to provide east-west fabric, as well as the north-south network and control plane. This includes management traffic, and storage.
“…to adapt to the power and cooling footprint of new architectures, the industry is exploring ways to densify racks so that the footprint doesn’t have to expand horizontally. Instead, it can expand vertically within the rack. These efforts have driven the industry from 12KW in a rack to 40 or 50KW in a rack.” – Jason Black
When you combine these systems with the applications that we provide, we’re capable of offering a very repeatable, high-performance AI and ML solution. This is truly an end-to-end system that incorporates all components from the switching fabric to the compute clusters which consist of highly tuned compute nodes – inclusive of the NICs, GPUs, DPUs, and CPUs.
In one example, these components integrate into what we refer to as a “SuperPOD” that can offer repeatable and consistent performance for AI and ML applications.
MDCJ: NVIDIA doesn’t just make the hardware that powers AI solutions, it also provides AI software and AI consulting services. Can you tell our readers about these services and how they help organizations take the leap into AI?
JB: My focus is on the solutions that empower the applications and allow them to run, but I know that we’ve partnered with enterprises and government agencies to help them create AI and ML applications and solutions that can benefit and improve their operations. For example, I know about one exciting use case we helped develop in conjunction with the auto manufacturers and auto industry.
Leveraging our AI and ML solutions, infrastructure, and knowledge, we were able to help auto manufacturers create digital twins of their assembly lines and manufacturing facilities. The resulting digital twins – of digital 3D models – allow the manufacturers to test changes to the assembly line in the digital space before rolling them out in the real world. They also allow them to identify deficiencies or problems with their proposed assembly lines before physically constructing them.
This solution was built with auto manufacturers and is now offered as a tool for the broader automotive industry. As you can imagine, it saves manufacturers considerable time and money by eliminating rework and costly mistakes in the planning and development of manufacturing facilities.
MDCJ: NVIDIA’s hardware is becoming more advanced, sophisticated, and powerful. Does this impact the data centers in which this hardware runs? What is changing, and how does this impact the data center?
JB: In order to adapt to the power and cooling footprint of new architectures, the industry is exploring ways to densify racks so that the footprint doesn’t have to expand horizontally. Instead, it can expand vertically within the rack. These efforts have driven the industry from 12KW in a rack to 40 or 50KW in a rack.
“Liquid cooling is – without a doubt – the wave of the future. It’s the only way we will enable the power footprints required by the advancement we’ve seen in CPUs and GPUs.” – Jason Black
However, these increased power densities within the rack necessitate a shift in how we cool data centers. It’s forcing us to evolve away from air cooling – which is no longer capable of meeting the cooling needs of these increasingly dense racks. Instead, the industry is adopting liquid cooling, which is more effective and efficient and capable of meeting these more significant cooling requirements.
We’re working with data center providers to ensure what they’re building is in line with the requirements of our solutions and the demands of the end customer. Customers don’t want to spread their nodes out across a large footprint. They don’t want a single node in each rack – to have 128 nodes in 128 racks – that simply isn’t efficient. Instead, they want to use less space and increase the density within the rack – which makes liquid cooling essential.
It simply doesn’t make sense to limit what is capable within the envelope by failing to evolve from air cooling to liquid cooling.
MDCJ: You’re going to be speaking at an upcoming event focused on liquid cooling. Is it safe to say that liquid cooling is the future for data centers?
JB: Liquid cooling is – without a doubt – the wave of the future. It’s the only way we will enable the power footprints required by the advancement we’ve seen in CPUs and GPUs. Because of that, I don’t see air cooling being used in the future except for cooling simple CPU racks.
MDCJ: In your opinion, what considerations do data center owners and operators need to keep in mind when designing and constructing a liquid-cooled data center for AI applications?
JB: It needs to be a triangular approach, with all stakeholders working together to understand the requirements and what is needed in the data center. It can’t just be about what the hardware manufacturer is developing or the data center provider is building. There also needs to be a focus on the end users – the hyperscalers, the cloud giants, and the social media companies – and their future requirements.
Companies like NVIDIA and the colocation providers need to have discussions with the hyperscalers and social media giants to get a better understanding of their growth plans and how they’re using technology. Will their 32 nodes quickly expand to 128? What does that roadmap and timeframe look like?
“…I know about one exciting use case we helped develop in conjunction with the auto manufacturers and auto industry. Leveraging our AI and ML solutions, infrastructure, and knowledge, we were able to help auto manufacturers create digital twins of their assembly lines and manufacturing facilities.” – Jason Black
We need to know what they’re trying to achieve with the hardware they’re deploying and the space in which they’ll deploy it. “Day One” doesn’t always look like “Day N,” and we must plan for that. Roadmaps need to align between what the hardware vendor is supplying to the industry, what the customers think they’ll need to meet growth and scale, and the data center space made available by data center providers.
MDCJ: Who should attend this upcoming liquid cooling event? What can/should they expect to learn?
JB: Honestly, I would suggest that technologists looking to advance their capabilities through AI/ML or looking to be more efficient should attend. Anyone that is looking to utilize this technology, get a better understanding of where the technology is today, and see where it’s going as it relates to space, power and cooling.
AI is one of the fastest-growing technologies because it applies to all industries. AI is a horizontal line, and every industry is a vertical line that intersects with it. There are use cases in government and private enterprises – across all industries. What is discussed at this upcoming event will be interesting, relevant, and important for technologists from every industry and sector.