In the last article on The Modern Data Center, my associate Steve Altizer explored the rise of artificial intelligence (AI) and explained how AI has become one of the hottest technologies among IT solution providers, hyperscalers, and the business community. But, as Steve discussed, the rise of AI comes with questions, concerns, and repercussions. And one of the largest and most immediate impacts that AI will have is in the data center.
AI has massive benefits for its users – from generative AI writing code and marketing language to custom AI applications improving processes and streamlining operations. However these AI applications need to be run on servers in a data center, and the cabinets of servers that run advanced AI applications are not the same as the cabinets or racks that run in traditional “production” data centers.
While air may not be able to remove heat as quickly and efficiently as AI data centers require, liquid can.
The racks in production data centers draw – on average – 8 KW to 10 KW of energy. AI and high-performance computing (HPC) cabinets can draw 50kW – 75kW per cabinet or more. That’s because they need to be incredibly high-density to deliver the computing power necessary to run advanced AI applications. Unfortunately, this incredible density comes with a cost.
AI has no chill
The early 2000s saw the rise of virtualization – which was effectively the precursor to the cloud. During this time, the density of racks increased from approximately 1 KW or 2 KW to approximately 8 KW to 10 KW. A similar step function in rack densities is happening again – and it’s straining data center infrastructure the same way the rise of virtualization did two decades ago.
As rack densities rise, we see substantial increases in the power consumed by each rack and the amount of heat that each rack produces. As we continue to see the densities rise due to AI applications and solutions becoming more widely adopted, data center operators and providers will soon face a problem – air cooling will no longer be adequate to meet data center cooling requirements.
And there are two reasons for that. First, thermodynamics tells us that air can only absorb heat so efficiently and quickly. Second, running the compressors and fans necessary for data center air cooling takes a large amount of power itself – valuable power that could have been used to run these increasingly dense racks and cabinets. An ACEEE whitepaper from 2015 estimated the amount of energy used by fans as high as 25 percent of the total power consumed running racks of servers, not including the power used by the room air conditioners.
But if air cooling is no longer adequate for the needs of modern AI data centers, and has such a high energy overhead, what options and alternatives exist? And what would need to be done to embrace these alternative types of cooling in the data center?
Four flavors of liquid cooling
While air may not be able to remove heat as quickly and efficiently as AI data centers require, liquid can. Liquid cooling also requires far less power to operate, reducing the overall cost of data center operation and keeping cooling from drawing valuable power from the cabinets.
Liquid cooling is clearly superior for the AI data centers of the future, but the technology is still in its infancy. As a result, there are a number of disparate liquid cooling approaches and technologies that are being assessed by data center operators and providers. Let’s look at four of the most common approaches, and why – or why not – the industry may choose to standardize on them.
1) Rear door heat exchangers
In our experience, the most adopted of the four liquid cooling technologies that we’re going to explore are rear door exchangers. These are liquid cooling appliances that attach to the back of each cabinet.
One reason why rear door exchangers are so widely utilized is their ability to quickly, easily, and inexpensively be deployed in traditional data centers. The concept scales well in the traditional production data centers that exist today – working well in large rooms with row upon row of cabinets. Also, since the exchangers – which are similar in appearance to very hi-tech radiators – attach to the back of the cabinet, there is no lengthy or invasive installation process.
However, just because a solution is widely adopted doesn’t mean it’s the best. Rear door heat exchangers have their drawbacks. Of the four liquid cooling approaches that we’ll explore in this article, they’re the least effective. They also still require fans to move the hot air through the servers away from the GPUs and CPUs to the rear door heat exchanger at the back of the rack. This solution sometimes also involves having supplemental fans attached to the heat exchangers – driving up the energy overhead penalty even more.
2) Direct-to-chip liquid cooling
If having liquid too far away from the GPUs and CPUs to effectively remove heat was a problem with rear door heat exchangers, direct-to-chip liquid cooling is an effective solution. Instead of placing the liquid at the back of the cabinet and relying on moving air to carry the heat from the CPUs and GPUs to the heat exchanger, this approach to liquid cooling brings the liquid directly to the chips, themselves.
Direct-to-chip liquid cooling involves replacing the traditional heat sinks installed on CPUs and GPUs with heat sinks that utilize pumped liquid for cooling. This places the liquid in direct proximity to the source of heat, eliminating much (but not all) of the need to move air. However, it also comes with challenges.
The installation of direct-to-chip liquid cooling requires an extensive network of piping to carry the liquid directly into the rack and to the chip. And since each and every CPU and GPU needs its own liquid-cooled heat sink, specially manufactured IT equipment must be purchased, as retrofitting off-the-shelf commercial equipment would be impractical.
Personally, my money is on immersion cooling being the long-term solution for cooling the AI data centers of the future. The cost to implement is relatively low, there’s no need for moving air, and there is a massive energy savings benefit.
3) Flooded chassis
While some equipment manufacturers are making solutions that bring liquid to the rack or the chip, the makers of flooded chassis solutions are thinking big. They’re daring to ask, “Why not just flood the whole server with liquid?”
The flooded chassis approach to liquid cooling employs a leak-proof, aluminum case for server components that is subsequently “flooded” with a dielectric fluid. This effectively submerges the server components in a cooling liquid.
While this solution provides the upside of extremely efficient heat removal, there are drawbacks. The number of providers making flooded chassis solutions is limited – potentially creating bottlenecks. Also, this requires almost as much fluid piping as direct-to-chip liquid cooling, since liquid needs to be directed to each server.
4) Immersion cooling
If the manufacturers of flooded chassis solutions are thinking big, immersion cooling is a result of thinking even bigger. In this approach to liquid cooling, the servers are submerged in large tanks of dielectric fluid. This eliminates the need for fans to cool the IT equipment, greatly reduces the amount of requisite plumbing, and greatly reduces the hardware costs of embracing liquid cooling.
But thinking THIS big comes with its own set of challenges. First off, the design and construction of the data center need to change. Large tanks and the added weight of dielectric fluid need to be accounted for in the design and construction of the white spaces. Then, there’s the mess.
Data centers have always (ideally) been clean spaces. Immersion cooling makes it much more difficult to keep a clean data center. Every time a server needs maintenance or components replaced, it needs to be pulled out of the dielectric fluid – effectively taking oily, slippery dielectric fluid out of the tanks and potentially making work areas quite messy. No doubt data center technicians will find practical ways to deal with this over time, but there will be a learning curve.
But which approach is superior? And what does that mean for the data center?
The jury is out…
In this newest reincarnation, liquid cooling is still a relatively unfamiliar and emerging cooling solution for data centers, and the industry has yet to standardize on one of these four approaches. If pressed, I would say that rear door heat exchangers may remain the most widely adopted solution in the near future because of their ability to enable inexpensive and noninvasive liquid cooling retrofits of existing data centers.
However, the medium- to long-term future is much more wide open. The fact that rear door heat exchangers still require moving air and a lot of power to function could limit their use in the long term.
As rack densities rise, we see substantial increases in the power consumed by each rack and the amount of heat that each rack produces.
Personally, my money is on immersion cooling being the long-term solution for cooling the AI data centers of the future. The cost to implement is relatively low, there’s no need for moving air, and there is a massive energy savings benefit. The only real challenge is a logistical one – identifying a way to maintain servers without making a mess.
Air cooling cannot meet the intense cooling requirements of the AI data centers of the future. The rack density and power requirements of today’s advanced AI solutions need more from their cooling solutions. So, regardless of which cooling solution the industry chooses as the winner of this race, that solution will invariably involve liquid cooling.