Episode XXII: AI Accelerators’ Performance vs Sustainability

Fatih Nar
Open 5G HyperCore
Published in
5 min readJun 7, 2024

Author: Fatih E. NAR

In recent years, AI accelerators have become the superheroes of artificial intelligence, saving the day in everything from sophisticated data centers to edge computing devices. With their parallel processing superpowers, they speed up and streamline large datasets and complex algorithms. But, like any hero, they have their kryptonite: performance degradation over time, especially under heavy usage.

Figure-1 Burned-End BBQ

Heat: The Arch-Nemesis

Heat is the villain in our AI accelerator saga. These components, when processing massive workloads, turn into tiny furnaces. If not properly managed, they face several deadly consequences:

>> Thermal Throttling: Many AI accelerators have built-in thermal throttling mechanisms to avoid a meltdown. When they get too hot, they slow down to cool off, like taking a breather during a workout, leading to a noticeable performance drop.

Figure-2 AI Accelerators in Texas Summer (Link)

>> Component Degradation: Persistent high temperatures can wear down semiconductor materials faster than you can say “silly-con”. Over time, this can slow processing speeds and increase the chance of failure.

>> Increased Power Consumption: High temperatures can cause leakage currents in transistors, leading to higher power consumption and even more heat. This creates a vicious cycle that is harder to escape than an on-demand TV show binge (Slow Horses is My Last Redemption 😎).

Figure-3 Platform Observability Capabilities for GPU Observability (Link)

Site note: In order to have an idea about the utilization levels, power consumption and temperatures of the accelerators, it is crucial to have a proper observability solution in place, not only for observing their state and also plug-in such information points to a autonomous operational workflow for corrective actions.

Load-Driven Distortion: Stress on the Circuit Superheroes

AI accelerators handle immense computational loads, and this stress can lead to:

1. Microcracks: Constant thermal expansion, contraction, and heavy loads can cause tiny cracks in the silicon. These microcracks can mess up electrical pathways, causing performance degradation and potential failure.

2. Electromigration: High currents can cause electromigration, in which metal atoms in the interconnect decide to go on a road trip. This thins conductive pathways and increases resistance, eventually leading to circuit failure.

Higher currents in denser circuits accelerate electromigration, causing faster degradation of the chip’s interconnects.

3. Intra-Short Circuits: Stress and distortion can cause unintended conductive paths within the chip, leading to immediate failures or pesky intermittent issues that degrade performance over time.

4. Higher Power Density: Packing more transistors into a smaller space increases the heat dissipation challenge, impacting the accelerator’s lifespan. Smaller nodes like 4Nm and 4NmP generate more heat per unit area, increasing thermal design power (TDP).

Side note: Thermal design power (TDP) measures the maximum heat a computer component, such as a processor or graphics card, can dissipate under normal operating conditions. TDP can also estimate the maximum power the GPU will draw under typical, high-load conditions. It’s not the absolute maximum power the GPU could ever draw but rather a guideline for system builders and consumers.

Figure-4 Setting TDP Cap for NVIDIA GPU

Increased density can create hot spots on the chip, leading to uneven thermal expansion and stress.

Figure-5 4N Evolution Brings Increase in Thermal Design Power (TDP)

Other Mischievous Factors

  1. Manufacturing Variability: Like snowflakes, no two chips are exactly the same. Minor imperfections can become significant under heavy use, leading to performance variability and a shorter lifespan.

-> There is a reason “Founders Editions” are more popular than OEM ones.

2. Electrostatic Discharge (ESD): ESD events are like mini lightning strikes that can zap AI accelerators. Effective grounding and ESD protection measures are crucial to protecting these sensitive components.

3. Software Efficiency: Poorly optimized algorithms are the fast food of AI accelerators. They cause excessive strain on the hardware, speeding up wear and reducing lifespan. Endless epochs with unimaginable batch sizes and chunks of steps would torch your accelerators.

It is not necessarily true that, all longer trainings with bigger datasets make AI models better!

Power Demand and Investment: The Bigger Picture

With the growing footprint of AI accelerators, the demand for power (to run & cool’m) from the grid reaches ultimate levels nationwide, as a result -> the global data center power demand is set to more than double by 2030.

This power demand growth will require utility investments of $50 billion in new power generation capacity, with a 60/40 split between gas and renewables, driving ~3.3 bcf/d incremental natural gas demand by 2030.

Figure-6 Datacenter Power Consumption Projections (Ref: Link)

Goldman Sachs forecasts a 15% CAGR in data center power demand from 2023–2030, driving data centers to make up 8% of total US power demand by 2030, up from about 3%.

This will necessitate approximately 47 GW of incremental power generation capacity, with about 60% gas and 40% renewable sources, requiring significant capital investment in US power generation capacity through 2030.

Figure-7 Power Profiles & AI’s Role in Consumption (Ref: Link)

Final Thoughts: The Road -42- Ahead

Understanding the factors that impact the lifespan and performance of AI accelerators is crucial for businesses and our lives, which depend increasingly on AI enrichments. We may extend the life of these critical components by addressing heat issues, managing load-driven stress, and considering other factors.

Figure-8 Elon’s Prediction (Not all comes true as we all know)

Moreover, AI-driven data center growth highlights investment opportunities in utilities, renewable generation, and industrial sectors. Balancing the benefits of advanced node designs with their impact on lifespan will be crucial for the sustainable development of high-performance AI accelerators and the supporting power infrastructure.

So, while AI accelerators continue to revolutionize technology, making our lives easier and saving us time at work and in our daily lives, managing their performance and impact on the power grid will keep us all on our toes.

PS: Do not buy used gpus from unknown sellers for low prices, unless you want a hot summer with no power. 🔥

--

--