GPU should be as effortless as electricity

Steve Golik
Geek Culture
Published in
6 min readJun 10, 2021

Imagining accelerated computing as a simple, affordable, remote service

This article is the third in a series of three. Here are 1 and 2.

The world’s GPU capacity is mostly on standby — Photo by Jen Theodore on Unsplash

GPU-accelerated computing is a critical resource for rising technologies, but macro trends are pointing to a capacity shortfall. High prices naturally follow.

Generally, modern computing environments are built around flexibility — virtual machines, containers, network-attached storage, software-defined networks. But GPUs and other accelerators remain stubbornly rigid, and incorporating them into otherwise-dynamic infrastructure is cumbersome and limiting.

So despite GPUs being an increasingly costly component, today’s limitations keep overall utilization of our GPU capacity today around 10–15%.

This “hidden world of GPU inefficiency” is not a widely discussed topic.

Under-discussion of GPU under-utilization is under-standable

Why is this a near-secret?

Reason 1: The practical social dynamics within a business org, which Timothy Prickett Morgan mentions here:

“…[GPU] inefficiency has been tolerated even if it is not talked about much. Nobody tells the CEO or CFO that a supercomputer, having impressive peak theoretical performance, is actually under-utilized.

Imagine how that conversation might go:

IT Director: You know how we’ve been spending a lot on GPU? We need more. In fact, we’ll need to keep growing our spend on GPU like this [points] over the next few years to support the business plan.

CFO: Ouch. There’s no way I can justify all of that. You can have half of what you’re asking for.

IT Director: [starting to walk away] by the way, our GPUs only run at 10–15% utilization. But there’s nothing I can do about that, it’s just how things work.

CFO: You’re fired.

Reason 2: The status quo of technical constraints, i.e. “there’s nothing I can do about that, it’s just how things work”.

We rarely spend mindshare on a problem that has no imaginable solution. This would be like early humans grumbling “it’s so dark at night” before we learned how to make fire. Darkness at night just was, and GPU underutilization just is — because a solution is only now emerging that promises to break us out of the status quo.

Questioning the status quo of tight coupling

To review from last time, the key limitation to GPU utilization is the short leash of PCIe. This physical tether — between the application space and the GPU accelerating that application — makes the potential resource balancing pool basically nil.

We all know that a car’s engine is one of its most valuable parts — and we know that at this moment there are probably thousands of car engines sitting unused within a mile of us. This is indisputably a colossal waste, but it is an unquestioned reality that a car’s engine is locked one-to-one with the rest of the car.

If I said “you should buy a car without an engine — there’s plenty of car-acceleration capacity just sitting around unused!” you would be highly skeptical at best.

So much underutilized acceleration capacity! — Photo by Carles Rabada on Unsplash

To break out of the one-to-one lock-in model for GPU requires a similar leap of imagination.

But what I am saying is that the world — including you and your company — will indeed be running far more cars (GPU-hungry applications that drive business value) using fewer engines (GPUs), thus empowering our teams to create far more business value.

How? We are going to make GPUs poolable, shareable, and easily accessible.

For this transformational step-change to be viable, our solution must do four things:

  1. Break free from the PCIe short leash
  2. Provide functionality indistinguishable from normal GPU usage
  3. Allow multiple workloads to use the same GPU
  4. Use software only

Here’s why these are must-dos:

  1. Breaking free from PCIe

Without removing the tight coupling between the application host and the GPU, we can’t support remote access over a network, and we can’t create viable (large) pools of capacity-providers and capacity-consumers to allow maximum resource-balancing — and thus high utilization.

2. Being a GPU from the application’s perspective

Imagine instead of behaving like a normal GPU, our technology came with gotchas: it only works for some APIs, it only works on some operating systems, it only works as part of a specific larger deployment environment, it requires complex setup and maintenance, applications that use it have to implement an SDK… you get the idea. Any of these gotchas would make it a restricted “point solution” and would condemn it to a narrower existence — without the universal applicability that would make it ubiquitous and high-value.

3. Oversubscribing a GPU with multiple workloads

If we were only extending the reach of GPU, we would still be solving an interesting problem by allowing “thin” clients to reach a remote GPU. But if we don’t allow many GPU-consuming clients to use the same GPU-providing server, we can’t raise utilization by overlaying spiky workloads — leaving inefficient whitespace in our utilization graph.

4. Software only

If our solution were to involve hardware, it would not be easy to trial, deliver, or deploy. It would be far more likely to be limited to a certain device, operating system, environment, or chip vendor. As a product, it would be slower to develop and iterate. And finally, the cost of building and delivering each additional unit would be high, which would make it difficult to find everybody-wins pricepoints for ourselves and our customers.

So we’ve identified the trends, the problems, and what an ideal solution must deliver — what do we actually build? Naturally I’m biased, but we feel like we’re on the right track at Juice Labs. Our solution:

Break free from PCIe:

Replace the PCIe connector with a software-defined elastic pipeline that uses standard networking, thus decoupling the client (the OS+application space where GPU is needed) from the server (from which the GPU capacity is supplied)

To make our pipeline performant, we significantly lighten the load of data passing across it using compression and (where possible) caching.

“Be” a GPU:

Replace the usual GPU driver on the client — the piece of software that lets the OS+application space communicate with the GPU — with our own installable driver that “looks identical”, but instead routes workloads to our pipeline.

Support all common graphics and compute APIs in our driver and our pipeline that an application might use to communicate with a GPU.

Multiple clients per server:

Build logic into our server install that allows the server GPU to service multiple clients.

Software only:

Notice we didn’t introduce any new hardware! Nothing up our sleeves here.

Delivering on this vision

Putting all these pieces together, we’re providing a drop-in, no-code, no-added-hardware solution that allows native applications to access the GPU they need from anywhere — on any operating system, in any deployment environment; for any device, computer, or data center; in the cloud, on-premise, or at the edge — while pooling and sharing capacity to raise utilization near 100%.

We believe this paradigm will become universal, not just for empowering companies to drive more business value in larger-scale deployments:

A wide pool of virtual or physical clients (blue) served dynamically by virtual or physical GPU servers (green)

…but literally anywhere accelerated-computing is useful, like at the edge for mobile and IoT:

In this new world of unprecedented access and efficiency, the rigidness and expense disappears — and GPU capacity becomes an efficient, easily-accessed, affordable utility.

Steve Golik is co-founder of Juice Labs, a startup with a vision to make computing power flow as easily as electricity.

--

--