The CUDA Advantage: How NVIDIA Came to Dominate AI And The Role of GPU Memory in Large-Scale Model Training

Aidan Pak
9 min readJun 28, 2024

--

With a commanding 85% share of the market and software-level gross margins, NVIDIA’s dominance in the AI chip industry is nothing short of remarkable. However, this success has not gone unnoticed. Last week, the Department of Justice and the Federal Trade Commission reached a deal to launch an antitrust investigation into NVIDIA’s dominant role. As the company continues its impressive run, it begs the question: What differentiates NVIDIA’s chips from the competition, and can they maintain their competitive advantage in the future?

Background on Neural Networks and The Role of GPUs in Training

Neural networks are learned functions that model complex relationships between input and output data. They are trained by batch-feeding vast amounts of data through the network and iteratively tasking it with predicting the output. The prediction computed in each iteration is compared to the expected output to compute a loss, which measures the error in the model. This process of feeding data forward through the neural network to compute a loss is called the forward pass. Then, the loss obtained from the forward pass is propagated back through the network with the goal of minimizing the error by adjusting the model’s parameters. This is done through a process called the backward pass where gradients are computed with respect to each parameter, which measure how much the parameter contributes to the overall error, and applying an optimization algorithm (ie: gradient descent) to minimize the error by optimally updating each parameter.

Source: Medium

Training deep neural networks is a computationally demanding process due to the immense number of matrix operations involved. Calculating activations and optimizing the model’s parameters require computing dense matrix multiples, which demand a significant number of independent operations due to the vast number of individual elements that need to be processed to achieve the final result. Moreover, the self-attention mechanism introduced in the transformer architecture significantly increased the number of matrix operations involved in training. For reference, training GPT-4, a 1.8 trillion parameter transformer, required approximately 2.15 x 10²⁵ floating point operations (FLOPS). However, the beauty in matrix multiplication is that the element-wise operations within the larger computation can be computed concurrently, making matrix multiplication highly parallelizable. This means that the process of training neural networks can be significantly accelerated by parallelizing the independent operations across many processors.

Source: Medium

Unlike CPUs, GPUs are specialized chips equipped with thousands of cores, allowing them to execute millions of threads simultaneously. Using networking technology like NVLink and Infiniband, many GPUs can be “strung” together into clusters, increasing the number of available cores working on a given task. This allows clusters of GPUs to significantly accelerate the process of training deep neural networks. To train GPT-4, OpenAI deployed 25,000 NVIDIA A100 GPUs, reducing the training time to about 90–100 days. Using a single A100 GPU, which can perform 19.5 trillion flops per second, it would take approximately 35,000 years to train the entire model assuming perfect compute utilization (which is infeasible).

The Memory Wall

Thanks to the remarkable advancements in peak GPU computational throughput driven by Moore’s Law, the efficiency of computing matrix multiples is no longer the primary bottleneck when training large-scale deep learning models. Instead, memory and networking constraints introduced by the sheer size of today’s models and the necessity for frequent data transfers leave GPUs relatively inefficient during training. Even with heavy optimizations, Model FLOPs Utilization (MFU) of 35% is considered best-in-class for training trillion-parameter models.

For large transformers, the model weights and optimizer states alone can occupy tens of terabytes of memory, far exceeding the capacity of on-chip memory such as SRAM and registers. As a result, model weights are distributed across GPUs during training and stored in a specialized form of off-chip memory known as High Bandwidth Memory (HBM). While HBM offers significantly higher bandwidth than traditional off-chip storage (DRAM), the latency of data transfers between HBM and processors leaves GPU cores often idle, waiting for data to transfer in and out of memory. As a result, GPUs are constrained by low compute utilization with the potential to optimize performance by minimizing data movement and maximizing data reuse during key operations.

The NVIDIA Advantage

The root of NVIDIA’s GPU monopoly is derived from its parallel computing platform, CUDA, which allows NVIDIA GPUs to achieve higher compute utilization rates than competitive architectures. CUDA is a robust programming interface that includes a compiler, driver, runtime environment, and a full toolkit, enabling developers to program GPUs to accelerate applications. With CUDA, developers can define functions for parallel execution and manage the memory hierarchy to optimize application performance.

Source: NVIDIA

The Evolution of CUDA: How the Platform Became the Standard for Accelerated Computing

In the decade following the release of CUDA in 2006, NVIDIA worked hard to meet the growing demands for accelerated computing in various fields outside of graphics. This led NVIDIA to develop specialized libraries for these applications, including cuDNN for deep learning and cuBLAS for basic linear algebra. These libraries, which are still prominent today, are collections of operator kernels (i.e., functions) for common operations, such as matrix multiplication and convolutions. Writing these low-level functions can be extremely difficult because it requires the programmer to have a deep understanding of the underlying chip architecture and memory hierarchy to ensure that the kernel is well-optimized for high utilization while maintaining low latency.

In 2015 and 2016, excitement around the application of neural networks led to the emergence of deep learning frameworks TensorFlow and PyTorch for defining, training, and deploying deep learning models. These frameworks introduced higher-level abstractions to operator kernels and allowed developers to focus on model architecture and training algorithms without needing to program the underlying operations to parallelize efficiently on hardware.

Recognizing the adoption within the AI community and the superior performance of NVIDIA GPUs for accelerating parallelizable workloads, PyTorch and TensorFlow provided extensive support for CUDA as the preferred backend for deploying deep neural networks. This meant that developers, in collaboration with NVIDIA, wrote optimized CUDA kernels so that each common deep learning operator could be efficiently mapped into a collection of highly efficient low-level functions that can be executed exclusively on NVIDIA GPUs.

As more model architectures and optimization techniques were introduced, these frameworks steadily implemented more operators, which translated to more refined CUDA kernels. This marked the start of a strong network effect: as more developers and organizations invested time and resources in writing laborious CUDA kernels, the platform’s capabilities grew exponentially. By the time deep learning was gaining widespread adoption in the later half of the 2010s, CUDA had firmly established itself as the standard for GPU acceleration, with each common operator mapped to a highly optimized set of kernels that could be compiled and executed on NVIDIA hardware. With the majority of the deep learning ecosystem explicitly optimized for CUDA, the proprietary nature of the software layer created a powerful form of vendor lock-in to NVIDIA GPUs.

Although deep learning frameworks offered support for competitive programming platforms, such as AMD’s ROCm, these platforms failed to gain significant adoption mainly due to a causality dilemma. Underperformance led to minimal incentive for frameworks to write more kernels to support them, which in turn hampered further development and optimization efforts. While open platforms like OpenCL promised cross-platform portability for various hardware, they too faced this dilemma and consistently underperformed CUDA in real-world applications.

Beating NVIDIA?

Today, the strength of NVIDIA’s moat lies in the difficulty in writing kernels and the proprietary nature of CUDA, which exclusively allows NVIDIA hardware to extract better performance than competitive architectures. Even if competitive chips can offer comparable peak FLOPS per second, optimized CUDA kernels for deep learning enables NVIDIA GPUs to achieve better FLOPS utilization rates which leads to overall greater performance. To beat NVIDIA, competitors must not only engineer superior hardware but also ensure compatibility with a more performant software ecosystem.

The analogy to beating NVIDIA is similar to trying to build a competitive video streaming platform to YouTube. Over decades, YouTube’s value proposition was strengthened as users posted terabytes of video content. To build a competitive service, a streaming platform must not only create a better user experience but also build out a comparable and ever-growing content library. For a competitor to beat NVIDIA, the task is even more daunting as the challenging part is not only to build a better chip but also to build out a similarly exhaustive and adopted software platform. Supporting a software ecosystem of this scale and flexibility is impossible for a single development team and requires community adoption and consistent effort to ensure that the software is well-optimized to support the wide-ranging and constantly evolving set of deep learning architectures.

Competition on the Horizon

In the current stage of the AI infrastructure build-out, GPUs have merely been a CAPEX line item for big tech companies, who are looking to massively scale their compute capabilities to over 100,000 GPUS. In 2024 alone, Amazon, Google, Microsoft, and Meta are expected to spend a combined $210 billion on CAPEX, with a large portion designated to NVIDIA GPUs. Meta plans to purchase 350,000 H100s while Microsoft plans to amass some 1.8 million GPUs by the end of 2024 to train the next frontier model.

Source: Semi-Analysis

Despite the enormous projected spend on NVIDIA infrastructure, Big Tech has been actively exploring ways to reduce their dependency on NVIDIA as their sole supplier of AI hardware. One strategy the cloud hyperscalers have been exploring is designing their own custom ASICs (Application-Specific Integrated Circuits) for inference. These fixed-function hardware components are much cheaper than GPUs and offer additional performance by being specifically optimized for particular AI tasks. ASICs such as AWS Inferentia chips have shown incredible performance benchmarks when accelerating AI inference, which primarily consists of just one type of operation: matrix multiplication. However, ASICs have largely failed to gain adoption due to their lack of programmability. Training and reinforcement learning require performant computation of a more diverse and evolving set of operations, making the lack of programmability of ASICs detrimental to their adoption for AI more broadly.

Detailed in Semi-Analysis’s post, the most promising approach to overcome the CUDA monopoly is innovation at the compiler level to democratize performance across various hardware vendors. At the forefront of these efforts is PyTorch 2.0, which has integrated compiler technologies that translate high-level code into formats compatible with a wide range of AI accelerators. Specifically, PyTorch 2.0 has integrated OpenAI Triton, a Python-based intermediate compiler that can convert models written in frameworks like PyTorch into an intermediate representation (IR) that can be optimized for various GPUs. Triton bypasses CUDA and makes GPU optimizations widely accessible by allowing developers to write GPU kernels in a Python-like syntax. While still in its early stages and lacking widespread adoption, OpenAI Triton and other open compiler technologies have the potential to democratize AI compute by making it possible for hardware vendors to write CUDA-level optimizations for popular deep learning frameworks.

Source: PyTorch

Summary

Over the years, the deep learning ecosystem has become exclusively optimized for CUDA, allowing NVIDIA hardware to differentiate from competition through industry-leading optimizations. This competitive advantage has prohibited new hardware vendors from effectively penetrating the AI ecosystem and led to a reinforcing network effect. However, NVIDIA’s dominant position and incredible pricing power have spurred efforts within the AI community to disrupt NVIDIA and democratize AI compute.

The biggest threat to NVIDIA’s position going forward is open technologies capable of translating high-level framework code into portable representations across a diverse set of hardware. While matching NVIDIA’s performance will take years, tools such as AI compilers, graph optimizers, and model converters that can balance performance across hardware, has the potential to challenge NVIDIA’s pole position. The questions that remain, though, is when will these technologies be competitive, and how much of the AI infrastructure build-out will have already occurred by then?

--

--