Last month, Compu Dynamics, in collaboration with InterGlobix, delivered a Thought Leadership Summit focused on artificial intelligence (AI), machine learning (ML), and the impact of these technologies on the data center and IT industries. This event brought together leaders from across the IT and data center industries to discuss the use cases and benefits of AI and ML, the advancements in technology driving its adoption, and the challenges these industries will face as the demand for and adoption of AI skyrocket.
What was abundantly clear in every panel discussion, keynote address, and fireside chat during the event is that AI is still an emerging technology – an iceberg that we’re only beginning to see the tip of above the water. Every speaker and attendee agreed that we’ve only begun to scratch the surface of what AI is capable of and that the functionality and adoption of AI will only increase as technology advances.
“I think people are still trying to figure it out…it’s like the first pitch of the baseball game,” Jason Carolan of Flexential said. “I think what’s sort of cool now…with generative AI or conversational AI… it’s now much more accessible.”
The excitement over AI and ML is warranted. The attendees at the Summit thought AI could benefit society and make a real difference in our lives in numerous ways. Fireside chat participant Allen Meeks explained how his company, MOX Networks, was working to help cure disease using AI, Big Data, and data analytics. He also laid out three other impactful ways AI could solve large societal problems, “Number one is probably agriculture – to increase yields and feed more people. Next is improving manufacturing through automation, and [finally] finance for efficiency.”
However, as we push the boundaries of what is possible with AI and look to solve increasingly larger societal problems with the technology, the compute power needed to train and execute AI applications will only increase. We’re already seeing this in today’s data centers, with rack power densities rising exponentially to feed the massive compute requirements of modern AI and ML applications. As panelist Jim Julson of CoreWeave, explained:
“If you had asked [me] two years ago about 40 kW racks, I would have said that’s insane. And now, just as we’re getting designs down for the physical layer and the deployment models for liquid cooling for a 40 kW rack, there’s a new deployment model for 80 kW. Then there’s a 110 kW rack. It’s moving faster than the people deploying it can keep up with or react to…”
“We [initially adopted] water cooling for efficiency, but [now] we’re doing water cooling out of necessity….because I need to capture absolutely everything I can just to give myself more power to run.” – Jason Black
Unfortunately, this growth and rapidly rising rack power densities are accelerating the data center industry toward a proverbial brick wall. Numerous times during the Summit, panelists and speakers referenced massive bottlenecks that threaten to derail the continued expansion of the data center industry that powers modern AI applications. Worse, these bottlenecks involved critical infrastructure that the industry’s power players have little or no control over.
Watts the problem?
As Jim Julson’s comments illustrate, the power densities of AI racks have long since surpassed the 8 kW-10 kW of traditional production data centers. Today, companies investing in infrastructure to power advanced AI and ML applications are flirting with 40 kW-80 kW, and the demand for more power is constantly rising.
As discussed previously on The Modern Data Center Journal, this has led to changes in how data centers are designed and cooled. It’s also driving up the power each data center campus consumes.
The seemingly insatiable appetite for AI is sparking the construction of more power-hungry data centers and data center campuses. As panelist Matthew Archibald of nVent pointed out, “…just drive up and down [State Route] 28. There are 15 data centers under construction [and] not one of them is small.”
The end result of this Agustus Gloop-esque power gluttony is overtaxed power grids and local municipal utilities that are unable to keep pace with the demand for power from data centers.
As Jason Carolan explained, “I was reading the other day [that] the U.S. will bring about 25 GW -30 GW [of power] online this year. [The data center industry] consumed 9 GW in Q2. In Q3 [they consumed] 7 GW, and I think Q4 was about 7 GW. When you add all that up, it becomes clear that power is a problem we need to solve…”
It wasn’t just Jason who shared this opinion. Practically every speaker, panelist, and attendee considered power to be one of the largest limiting factors for the growth of AI and the data center industry that powers it.
“…most of the long-haul networks that were built – networks that cover more than 400 miles and several regions – were done in the early 90s or the early 2000s.” – Allen Meeks
Benjamin Lundquist of CoreSite lamented that “…power is going to be – in the long term – the choke point for AI because we can only generate so much energy.” And Jason Carolan wondered if we’d need AI to figure out a solution to the power situation being caused by AI.
However, until AI does come up with a way to solve the AI issue, the participants at the summit suggested a few other options.
Multiple participants claimed that the power shortage facing the data center industry required a fresh approach to how power is used within the data center. Suppose there is only so much power available to run the data halls and the mechanical equipment needed to operate the data center. In that case, data center owners and operators simply need to identify ways to minimize the amount of power used for things like cooling and maximize the amount of energy available to the data hall.
As Jason “JB” Black of NVIDIA wondered, “…let’s say you have a 10 MW data center and spend 4 MW on cooling. If you’re power-capped at 10 MW and there’s no more pipe, how about we spend a megawatt and a half on cooling?” One way that data center owners and operators can accomplish this is by moving from air cooling to liquid cooling.
Multiple participants noted that liquid cooling, by its nature, simply uses less energy than traditional air cooling. Also, it’s more efficient – capable of effectively removing more heat from the data hall than air. As JB explained, “We [initially adopted] water cooling for efficiency, but [now] we’re doing water cooling out of necessity….because I need to capture absolutely everything I can just to give myself more power to run.”
Another solution involved moving away from power-guzzling, monolithic data centers to smaller, more disaggregated data centers that could split the compute load for modern AI applications. These smaller, disaggregated data centers could effectively divide the power requirements of an AI application across multiple different power grids, reducing the strain that today’s “compute clusters” place on just one area’s power infrastructure.
“If you had asked [me] two years ago about 40 kW racks, I would have said that’s insane. And now, just as we’re getting designs down for the physical layer and the deployment models for liquid cooling for a 40 kW rack, there’s a new deployment model for 80 kW.” – Jim Julson
However, for this to be possible, data needs to be shared quickly across long distances, creating another infrastructure challenge for the data center industry.
Increasing our daily dose of fiber
Unfortunately, the long-haul fiber that would be necessary to connect geographically separated data centers is either unavailable or incapable of meeting the advanced requirements of modern AI applications. As Allen Meeks explained, “…most of the long-haul networks that were built – networks that cover more than 400 miles and several regions – were done in the early 90s or the early 2000s.”
And the advanced age of these long-haul fiber networks dramatically impacts their ability to carry large amounts of data over great distances. “…theoretically, I should be getting 800 Gig per carrier…but on a long-haul route, I max out at about 350-400 Gig,” Allen claimed. “The fiber is just old, and the loss of dB per kilometer is too high for these systems to work. So, while the technology is getting better, the fiber is still old and getting older every day.”
While this seems like a relatively simple problem to fix – just run new long-haul fiber routes – it’s easier said than done. There are few contractors in America still remaining that run networks over these long distances, and many of them face funding, zoning, and legislation challenges when trying to build out new long-haul fiber networks. As Allen explained:
“It’s very capital intensive. It takes a lot of time and labor. And, of course, we’re fiber optics, so we get lumped in with the tech businesses, so the financing offers that we get are three to five years. [We have to convince them that] we are a utility. They should think of us like a toll road and give us 100-year financing.”
Bumpy road ahead
While the Summit featured incredible conversations about AI use cases, the technologies driving AI, and the challenges facing AI adoption, no easy, large-scale solutions were offered that could fix these critical infrastructure shortcomings.
Power has been, and will remain, one of the largest challenges facing data center expansion and the construction of new data centers to power the AI revolution. The long-haul fiber networks needed to build data centers outside the traditional data center markets or disaggregate compute loads outside existing “compute clusters” simply aren’t up to the task.
These critical infrastructure concerns will require collaboration between industry and government leaders, and significant investment to fix. However, should solutions be identified, there will be few roadblocks standing in the way of continued AI adoption and a new, brighter future powered by AI.