On AI, the GPU Shortage, and the Potential of Decentralised Inference and Training Networks

Marthe Naudts
Venture Beyond
Published in
18 min readFeb 1, 2024

--

Training large language models requires powerful GPU microchips, particularly A100s and H100s produced by Nvidia. Due to various supply constraints, these are nearly all reserved for large cloud providers and incumbent tech giants, leaving an underserved long tail of AI start-ups, developers, and researchers. At White Star Capital, particularly since releasing our industry report on Data-driven Transformation, we have received an influx of pitches from start-ups proposing to create marketplaces of distributed GPUs, from data centers with idle GPU capacity. Most of these compete by 1) securing the most supply and/or 2 ) undercutting cloud provider pricing, sometimes through token incentives. This commoditisation makes it difficult to foresee a billion-dollar business winning in this space. In this piece, Part I explores this problem set, and Part II will posit two better ways in which companies can compete by creating a new value layer. Firstly, by more efficiently clustering weaker hardware to emulate cutting-edge chips in short supply, and, secondly, by creating trust between disparate and anonymous entities in the marketplace through ensuring and communicating confidential and verifiable compute mechanisms.

Part 1: The State of Play

Software may have eaten the world, but microchips digested it.

Otherwise known as semiconductors, chips are the grids of millions or even billions of transistors that process the 1s and 0s comprising that very software. Every iPhone, TV, email, photo, YouTube video, device and asset in our online world is powered by these tiny switches.

Fabricating and miniaturising semiconductors has been the single greatest engineering challenge of modern society. Both Moore’s and Rock’s Law, alongside ongoing debates about their limits, continue to drive the industry in its strive for more and more processing power.

Nvidia’s H100s and A100s are the industry’s latest and most powerful parallel processing GPUs, and are the main chips responsible for powering the transformer models behind the recent AI boom. Faced with skyrocketing demand and supply bottlenecks, even hyper scalers like Amazon Web Services (AWS) and Google Cloud Platform (GCP) cannot keep up with demand.

This leaves a multi-billion dollar opportunity for SaaS and marketplace businesses connecting disparate idle compute capacity like CoreWeave and Lambda, or more creative decentralised solutions seen in the likes of Gensyn, Together.ai and Akash.

Given the influx of start-ups trying to disrupt this space, this two-part piece is intended to explore the following question:

Can the challenges of securing, coordinating, and verifying disparate compute supply make this a winner-takes-all market, or is this one where many start-ups will emerge to form a redundant middle layer, competing to secure a small piece of the fundamentally commoditised and highly divisible GPU pie?

The State of Play

CPUs and the Von Neumann Bottleneck

For decades, as our appetite for computation exceeded what Moore’s Law could deliver engineers considered their biggest challenge to fabricate increasingly smaller transistors to increase core processing unit (CPU) processor speeds. Today, cutting-edge chips have up to 5.3 trillion MOS transistors. 60 years ago, that number was just four.

Figure 1: Semi-log plot of transistor counts for microprocessors against dates of introduction

But in recent years, as the time it takes for information to travel between the CPU and memory chips has reduced, engineers find themselves facing a new bottleneck — the very architecture of the computing system.

Most general-purpose computers are based on Von Neumann design architecture, which, at its most simplistic, has the following components:

  • A processing unit (with an arithmetic logic unit and processor registers)
  • A control unit (with an instruction registers and a program counter)
  • Memory (that stores data and instructions)
Figure 2: Simplified CPU Architecture
Figure 3: CPU Serial Operations

Computers with Von Neumann architectures are unable to do an instruction fetch and a data operation at the same time because they share a common ‘bus’. Doing an operation like 1+2 would first involve retrieving 1 from memory, then 2, and then performing the addition function on them, before storing the result in memory again. This is because the single bus can only access one of the two classes of memory at a time, so the CPU is continually forced to wait for needed data to move to or from memory. The throughput (data transfer rate) between the CPU and memory is therefore limited compared to the amount of memory.

CPU processing speed has increased much faster than the throughput between them, so this structural bottleneck has now become the primary problem. Moreover, as speed and memory continue to increase, the severity will only increase with each new generation of CPU.

This is particularly problematic for Artificial Intelligence (AI), which requires training models with billions and even trillions of parameters. This could never feasibly be processed through our current CPU architecture.

GPUs and the Promise of Parallel Processing

GPUs circumvent the Von Neumann bottleneck by processing calculations in parallel. Each GPU consists of a large number of smaller processing units known as cores, each of which can execute its instructions independently of the others. Nvidia, originally designed these for graphical workloads like 3D graphics and games, so that unlike Intel’s microprocessors or other general-purpose CPUs, realistic images could be rendered much more quickly by determining the shade for each pixel in parallel.

Today, whilst Intel and others have developed their own, Nvidia’s GPUs are particularly desirable due to their tensor cores which are more relevant for numerical rather than graphical workloads.

In 2012 in Toronto, developers of AlexNet, a convolutional neural network, realised they could use the parallel processing abilities of GPUs to train their machine learning model to label datasets of images. They entered and won the ImageNet Large Scale Visual Recognition Challenge, with an order of magnitude more accuracy than any other prior models, marking a turning point for the deep learning industry.

The watershed, however, was a 2017 paper titled ‘Attention is All You Need’ written by researchers at Google wanting to improve the convolutional neural network architecture behind text translations. Traditional machine learning (ML) models use inference to expose models to large sets of labelled data to make predictions based on their understanding of that data, which works for applications where this data can be clearly defined and labelled (such as the image classification challenge that AlexNet was tackling). But, translating from one language into another when the word order doesn’t exactly match requires a contextual understanding of the entire input.

Google’s researchers proposed a transformer model which turns characters into tokens and then weights them. Using an evolving set of mathematical techniques called attention, transformer models can attend to different areas of the input text to detect subtle meanings and relationships in even distant data elements in a sequential series like the words in this sentence.

Now, models were able to understand written text in a significantly more sophisticated way, and could learn new tasks without large labelled datasets. Today’s large language models, which need to learn the patterns and structures of an entire language to generate output, would not be possible using recurrent neural networks as they would process tokens sequentially, taking up a simply unfeasible amount of time and computational resources. The parallel processing in GPUs, conversely, makes these transformer comparisons possible at scale.

And, as it turns out, without even any changes to the structure these transformer-based models scale very well with more data. Compared to older models, which were struggling to generate anything useful with sub-10m m parameters, ChatGPT1 was trained on 120m parameters. This pales in comparison to GPT’s 1.5bn, GPT3’s 175bn, and GPT4’s rumoured 1.7tn parameters.

So, what’s powering this revolution?

Nvidia’s GPUs.

Whilst there are other brand GPUs and chips aimed at AI workloads designed by the likes of AMD, Nvidia faces little competition in the market for high-performance GPUs being used to train machine learning models/neural networks. Its latest and greatest GPU is called the H100, which is currently sold for c.£20–25k (and often significantly more in secondary markets). H100s weigh 70 pounds, are comprised of 35,000 parts, and require AI to design and robots to assemble. Generally speaking, companies are buying them in boxes of 8 (8-GPU HGX H100s with SXM cards), which cost approximately £200–250k. They are 9x faster for AI training than their second fastest rival, Nvidia’s A100s. It is hard to understate how much more advanced they are than any other chip on the market for LLM training and inference.

Figure 4: An H100 weighs 70 lbs

So, we’ve landed in an exciting but precarious position, in which tech giants and start-ups alike are scrambling for the very limited set of cutting-edge GPUs and their associated CUDA software designed solely by Nvidia and produced solely by their manufacturing partner, Taiwan Semiconductor Manufacturing Co (TSMC).

Figure 5: TSMC’s Kumamto fab in Japan

TSMC’s fabrication facilities (fabs) are the only ones in the world currently capable of manufacturing these chips, with most fab capacity reserved over 12 months in advance. When chips are designed, they are designed and developed with a particular production process in mind, and it takes time to optimise a production process to maximise the yield. Therefore, the chip is designed in parallel with the contract with a fab for the production of a chip. For now, most of those contracts are with TSMC, thereby forming a natural bottleneck to scaling supply to meet the booming demand as this requires constructing additional fabs. This is no small feat- its current planned fab in Arizona, for example, encompasses a 1,000-acre area, requiring 21,000 construction workers and a price tag of over £30bn.

Nvidia as Kingmakers, Hyperscalers as Kings

Whilst the supply bottlenecks mean there are genuine shortages in supply, the race to integrate AI into applications also means speed is of the essence. Cutting-edge chips are orders of magnitude faster than their predecessors and could make the difference between the life or death of companies facing AI-powered competitors. Combined, these two dynamics have led to the perception that access to chip supply is a moat.

This places Nvidia in the position of kingmaker, often choosing to allocate chips based on the company’s preferences (e.g. the provider’s end customers, or their potential competitive threat).

Nvidia’s supply is predominantly going to the biggest hyper scalers (Azure, GCP, AWS, Oracle) who offer cloud computing to the biggest or the most advanced AI-powered companies (who’s growth are in turn key to Nvidia’s long-term success). For example, Azure supplied OpenAI with 1000 Nvidia V100 GPUs for training OpenAI, and AWS sourced and provided 4000 Nvidia A100s to Stability.AI. It is hard to gauge exact numbers, but it is estimated that Google Cloud has secured approximately 25k H100s, whilst Azure is likely to have between 10–40k.

In the meantime, long-tail players like AI start-ups and researchers are left out. Marketplaces like Lambda, Vast.ai, Gensyn, and Bittensor, therefore suggest that this supply bottleneck is a multi-billion dollar opportunity, which they can solve for by sourcing and coordinating idle GPU compute capacity in crypto mining facilities, independent data centres, and consumer GPUs.

These companies pitching to create marketplaces of idle cutting-edge chips must compete with these hyper scalers for supply by positioning themselves to Nvidia. There are a few angles which they could do so:

  • Competitive safety: Many hyperscalers are developing their own chips and software. AWS, for example, is creating its own Titan models and Bedrock software running atop its homegrown Inferentia and Trainium chips. Meanwhile, Nvidia is also increasingly refocusing on building its own data centres, generating £11.4bn in revenue in Q3 2023, a 41% sequential gain and an incredible 279% quarter-on-quarter lift. As memory constraints to chips shift their focus from the chip layer to clustering at the server layer, Nvidia has started describing the data centre as a whole as the new ‘supercomputer.’ Thus as the two partners become increasingly competitive, they may both turn to decentralised marketplaces as tech-enabled OEMs. These start-ups therefore need…
  • Novel distribution: Access to new customers that would not work with/want to diversify exposure to hyperscalers. The physical locations of data centres is pertinent for decentralised offerings here — governmental bodies could prefer domestic hosts, notably for sensitive data; international bodies meanwhile might actively pursue maximal distribution across countries for geo-political neutrality. On the private end of the spectrum, high-frequency traders or emergency services may be optimising for latency and therefore want to access servers that are closest to their operators.

In reality, though, most end up competing on:

  • Price: often through tokenised payments. This normally takes the form of issuing tokens and requiring participants to buy and stake them to join the network, meaning the (token) price should fluctuate according to demand like a true free market. This enables them to avoid the cold-start problem of marketplaces as these tokens incentivise supply to join whilst passing the cost onto the token investor, whilst simultaneously offering better pricing to consumers who pay less for tokens than they would for fiat-based networks.
  • Unfortunately, historical decentralised physical infrastructure networks (decentralised storage, rendering, or RPC networks) that are denominated in cryptocurrency payments have nearly always collapsed as retail and institutional token holders realise they are financing the discount, they sell and thereby remove appreciating pressure on the price, leaving supply to realise that their token holdings will never supersede their fiat-denominated costs on the hardware side and therefore supply rationally leaves the network.
  • We should therefore always seek out hybrid models in which the marketplace operates with standard fiat payments, and tokens are only used as an additional reward scheme for early suppliers and function as dividend payments.

Therefore, due to the commoditised and highly divisible nature of the GPU market, it is hard to see a billion-dollar player emerging here unless they have a novel and exclusive relationship with Nvidia or customers that Nvidia finds desirable; otherwise, they are left to compete only on pricing the supply they secure.

This may work in a blockchain — based network context if and only if the payment mechanism is strictly bifurcated into a fiat-based fixed payment, and the cryptocurrency reward scheme functions purely as an investment asset that receives dividends as the marketplace becomes more successful. But even if this is the case, we should be wary of the implications, given the low margins of marketplace start-up models.

Instead, we will be on the lookout for technological innovation on the supply coordination side. In Part II, I will explain two areas of such differentiation vectors- clustering distributed GPUs, and ensuring confidential and verifiable compute.

Part 2: Opportunities in Differentiation

So we finally come to the crux of what I want to explore in this two-part piece.

Can the challenges of securing, coordinating, and verifying disparate compute supply make this a winner-takes-all market, or is this one where many start-ups will emerge to form a redundant middle layer, competing to secure a small piece of the fundamentally commoditised and highly divisible GPU pie?

1. Clustering GPUs Across the Straggling Data Centres

Managed hosting data centres are large data centre facilities that rent out rack space and bandwidth, whilst taking on the operational and financial burden of server hosting such as the cooling infrastructure and associated energy costs. Whilst cloud providers are designed to scale, legacy data centres which house the servers face fluctuating demand and therefore often are sitting on underutilised capacity.

No one has suffered more than crypto mining data centres, particularly those dedicated to mining Ethereum. Since the September 2022 Merge, Ethereum’s transition away from proof-of-work consensus left mining facilities redundant. Fortunately, unlike Bitcoin miners which typically use ASICs, Ethereum miners use general-purpose GPUs including those from Nvidia, which have a more liquid secondary market. Hive, for example, is an Ethereum miner which suffered huge losses in the immediate quarters after the Merge, and has since redirected its GPUs to support high-performance compute workloads through its HIVE Performance Cloud. Back-of-the-envelope maths suggests that at the time of the merge, with the total Ethereum hash rate at 1.03 pH/s and the average Nvidia GeForce RTX 3090 Ti hash rate at 108.75 mH/s, dividing the two leaves us with approximately 9.3m GPU units becoming available.

Ethereum Hash Rate Post-Merge

Marketplaces who propose to legacy data centres with idle GPUs that they could help reorchestrate servers and take over distribution to meet exploding AI demand will be met with open arms. These centres may have yet to secure the latest GPUs, but there are plenty of customers who do not need the latest and greatest. For an early-stage start-up, an academic researcher, or a public institution, speed is less important. In short, an old GPU will do the same job as an H100, it will just take a much longer time.

Clustering lower-performance GPUs to emulate the ability of a cutting-edge chip may be the best pitch to a) customers if they can secure cheaper access, b) suppliers if they can sell idle capacity which is a cost drain, and c) investors because it provides a much-needed technology layer that ensures product stickiness.

Distributed Computing Through Parallelising and Clustering Workloads

Due to RAM memory constraints, even using H100s for compute-intensive workloads like deep learning and hyperparameter tuning will require execution distribution across multiple GPUs.

Suffice it to say that clustering chips and distributing workload is an engineering challenge. A number of solutions have emerged for inter-server distribution, notably parallelising workloads through sharding model parameters across GPUs.

Some examples of this software and hardware needed for clustering include:

  • An interconnect solution, such as Ethernet or Infiniband, to shuttle data between the nodes
  • A distributed training protocol, such as PyTorch or Tensorflow. PyTorch is an open source ML framework based on the Python programming language and the Torch library. It in turn implements DistributedDataParallel (DDP) which is an algorithm that enables data parallel training. With DDP, every single GPU across every single machine will get a copy of the model and a subset of the data. The model trains through a forward and backwards pass, and then it will sync the gradients across all the GPUs. Once every process has synced all the gradients, then all the optimisers in each GPU will update the weights.
  • A clustering API for cluster management, such as the Message Passing Interface (MPI) or Rays APIs. Developed by Anyscale, Ray’s APIs parallelise any Python code and handle all aspects of distributed execution, including orchestration, scheduling, and auto-scaling
  • Processes to manage the influx of data to ensure the GPUs are continuously utilised, thus enhancing their efficiency. Read our Data driven Transformation Report for more details on the future of data-mesh architecture and data lakes.

Start-ups with relationships with datacentres and crypto miners could therefore focus on making this parallelisation as easy as possible, through developing or aggregating software on the front-end, and verifying the correct hardware on the supplier side. If they can own this developer relationship, they can then expand into the entire DevOps tooling stack for distributed AI computing, which would be a very attractive moat. Our Data-driven Transformation report features in depth analysis on the innovation needed on the data infrastructure, management, and tooling needed to handle distributed AI workloads and datasets.

Part II. Building The Trust Layer

Businesses are only successful if they manage to attract users and retain them, while growing their lifetime value. And, for peer-to-peer and distributed marketplaces, winners like Uber, Vinted, and Deliveroo often succeeded by adding an extra value layer to their interface which increases trust between the two sides of the exchange. Trust that designer items are verified, that drivers are safe, that delivery times are accurate, and that payment is immediate, disputable, and refundable.

In the case of outsourcing compute power, there are three unique trust assumptions that marketplace engineering, financial, and sales teams will need to ensure.

Verifiable compute

Outsourcing computational tasks from relatively weak devices to powerful computation services is very common. The entire internet is run on so-called cloud computing, which is easy to verify because either ‘the proof is in the pudding’ so to speak, or because the work is reproducible by splitting the state-independent work. But with training large language models, reproducing the work would defeat the entire purpose of saving computational effort. This is because ML problems are state-dependent, meaning each layer in a deep learning model is using the output of the previous layer, so validation by replication would involve doing all the work again. Moreover, this then relies on trusting that the verifying party is also honest, and verifying this would instigate an infinite chain of replication.

Blockchains can be used to create staking and slashing incentive games to force all participants to be honest through rationality. Tromero, for example, is creating ‘proof-of-work’ systems resembling Bitcoin mining and consensus models. Others like Gensyn have explored new methods of proof (such as probabilistic proof-of-learning and graph-based pinpoint protocols), to then confirm on-chain. Either way, because companies like Vast.ai, Lambda, and Fluidstack have managed to exist without this verification piece, these teams will win through the ability of their sales teams to communicate the viability and significance of these complicated mathematical proof solutions to their customers.

Confidential compute

Most GPU marketplace start-ups target ML researchers at academic institutions and AI start-ups as initial customers. But, when presented with decentralised networks, almost all ML researchers I have spoken with expressed concerns about trusting unknown providers with their datasets and the containers of code to remotely run their models. When the model was for an educational purpose, they were typically trained on publicly available or insensitive information, such as online content. But most AI start-ups are competing on novel datasets, which often means unique access to either private enterprise data, or consumer data including sensitive PII like health or financial records. Moreover, outsourcing training also involves sending a container of code which outlines the rules and weights by which to train that dataset. This is, fundamentally, intellectual property. If the customer is a start-up, this IP is all they have.

Marketplaces can therefore carve out a competitive edge by building Out software that ensures, either through encryption and proof methods for data at rest or transit, or remote hardware monitoring for data in use, that data has not been tampered with. They can also be selective with their supply, only sourcing from reputable data centres verified through extensive KYC and AML checks on hardware providers.

Figure 6: Confidential Computing 101

The ultimate guarantee of security for data in use would be to only use fully homomorphic encryption (in which computers can run code without ever decrypting it, which comes with some latency sacrifices), or hardware-based trusted execution environments (TEEs). In the latter case, confidential data is only released once the TEE is determined as trustworthy. Luckily, Nvidia is implementing support for TEEs at the hardware layer in its latest chips, so we can expect this to become the default offering over time.

KYC and Liability

For all marketplaces, suppliers also need to make various trust assumptions about their consumers — mostly about their ability to pay but also, in cases of physical services or an exchange of goods, that they are a verified and safe person. In the case of decentralised compute networks that assure confidentiality described above, the supplier has no idea what code it is running, and therefore needs to trust that malicious entities are not using their systems for harmful purposes like fraud, terrorism, or organised crime. They also want to know that they are not running a DDOS or other security attack on their systems. Marketplaces could therefore either use a third-party KYC provider, or assume liability themselves and compete on providing legal protection and insurance against this.

Final thoughts

In sum, due to the unique features of the massive and growing GPU market, there is a lot of potential for unmet demand and underutilised supply to be matched by a decentralised and distributed marketplace. However, the winners will be those that are adding a value layer beyond securing high-end H100/A100 chip supply and undercutting hyperscaler pricing. Cryptocurrencies will only work in this context if they are entirely distinct from the direct payments, and exclusively used as a dividends-based rewards system.

There are two broad areas in which I think this competition will/should fall.

  • Firstly, cluster coordination: GPU marketplace start-ups should compete with superior clustering of weaker GPUs, both through an improved front-end UX/UI for ML developers and resource orchestration for the data centres that may not have the right interconnect hardware and software solutions in place to reach this customer pool.
  • Secondly, security: due to the two parties being unknown to one another, companies can make meaningful differences to different customer and supplier pools by ensuring, insuring and assuring each party that they can trust the other. This applies to verifiable and confidential computing mechanisms, as well as simple KYC checks that may or may not use blockchains as proof systems.

This is an exciting engineering, cryptographic, and mathematical challenge, and we’re excited to see the role of blockchains in the various emerging solutions. If you are building in this space, or would like to discuss any of the above, please reach out to me at marthe@whitestarcapital.com.

References

This piece was written with the help of very appreciated conversations with over a dozen start-ups, operators, and AI/ML researchers from Oxford and Kings College London.

By far my biggest and most recommended sources are Chip War by Chris Miller, this article on Hacker News, and the Acquired podcast series on Nvidia. Other material is linked throughout the text.

The information provided here does not constitute investment advice, financial advice, trading advice, or any other sort of advice and you should not treat any of the website’s content as such. White Star Capital does not recommend that any asset or cryptocurrency should be bought, sold, or held by you. Do conduct your own due diligence and consult your financial advisor before making any investment decisions.

--

--