How to democratize super computing? GPUs-as-a-service

Peter Odonovan
3 min readDec 7, 2023

--

NVIDIA H100 GPUs have in the last couple of years become a cornerstone in the field of AI. Primarily, their architecture is exceptionally well-suited for the parallel processing demands of AI and machine learning algorithms. Unlike traditional CPUs that are designed for sequential processing, H100s can handle thousands of threads simultaneously, making them incredibly efficient in the execution of fundamental AI-related tasks that facilitate deep learning and neural network training. NVIDIA has leant into this market opportunity of a lifetime, continually innovating and tailoring their GPUs to better accommodate AI workloads. The CUDA platform, for example, provides a software environment that allows developers to use C++ (amongst others) to write programs that can perform computation in parallel on the GPU, significantly easing the process of coding AI algorithms and optimizing their performance. However, the exceptional capabilities of the H100s have also made them highly sought after, leading to a real scarcity in the consumer market. Several factors contribute to this imbalance: research institutions and tech companies are continuously vying for the latest and most powerful GPUs, cryptocurrency mining and the global semiconductor shortage, and scalping to name a few. The surging demand and lack of supply yet again appears to benefit those who already hold the power i.e. the Googles and Microsoft’s of the world. Now, I am definitely one of the skeptics in the camp of believing that the vast majority of venture backed AI companies discovered in the last couple of years will soon be added to the heap of hype-cycle carcusses. However, what I do believe is that there should be a level playing field in which to stress test that assumption, and it appears to me that the hardware that AI depends upon is more of a door than a window at present.

With this in mind, I believe the emergence of the GPU-as-a-service business model, or GaaS, will be one of the most necessary and potentially super profitable evolutions of our journey to separate the legit from the ummm not legit venture backed AI companies. Anyone that has followed Tech for the last 15 years will have seen as-a-service gradually get added to just about everything, and why shouldn’t it? Providing services on an as-needed basis through the cloud allows customers the benefit of choice and accessibility when it comes to volume and pricing, and markets have rewarded those providers accordingly. With GaaS I suspect it will be no different. GaaS will enable the provider to make the powerful technology mentioned above accessible to a wider range of users, including small to medium-sized enterprises and startups, which might not have the resources for such a significant upfront investment. This service model allows customers to leverage H100s capabilities for tasks like large-scale AI training, advanced data analytics, and complex simulations, without the need for substantial capex. GaaS also offers enhanced flexibility and scalability, which are crucial in today’s fast-paced and ever-evolving technological landscape. Customers can scale their usage up or down based on their current project needs, ensuring they only pay for what they use, which is particularly beneficial for projects with variable computational requirements. This scalability helps in managing costs more effectively while still having access to top-tier resources. Additionally, the service provider’s expertise in managing and maintaining these high-end GPUs ensures that customers can enjoy the benefits of the latest technology without the associated challenges of hardware management, updates, and troubleshooting. One of the big downsides to anything provided ‘as-a-service’ is that the customer is giving up control of the asset in favor of costs and flexibility. However, in this situation, assuming that the primary GaaS providers will likely be those with 1) the most comprehensive cloud platforms and 2) the most H100’s, i.e. public, highly regulated and mature businesses like AWS, Azure, Google Cloud, I think this might actually be a good thing.

All in all, GaaS is going to remove one of the primary bottlenecks I see facing AI startups and companies without AI as a current core competency from being able to truly test their respective investment theses behind the use of the technology. I don’t know how you go about getting your hands on 100,000 NVIDIA H100s, but if someone can (let’s be honest, this is definitely going to be Azure, AWS, or Google Cloud), then they better do us all a favor and release the GaaS!

--

--