Disaggregated SmartNICs/DPU/IPUs?

Jakub Kicinski
7 min readMar 6, 2023

--

A little disclaimer first. I am not working on or with any IPU-like device, and haven’t done so for at least 3 years. All of the opinions below are based on pure speculation.

Second, terminology. I’ll use the term IPU (Infrastructure Processing Unit) going forward because it is the most honest, and least marketing-driven term.

History

I believe that the history of IPUs starts around 2013. Public cloud providers were looking for ways to move expensive portions of the hypervisor off the main CPU and into the hardware. CPU cycles used by the hypervisor are CPU cycles which can’t be sold to the customer (in a well designed system one can charge customer for the hypervisor processing, but let’s not complicate the story). Network processing is a great offload target, it’s not particularly complex in principle, but it is expensive to do on the CPU at high packet rates.

NICs had supported virtualization and switching at that point already. The legendary Intel 82599 AKA Niantic supported SR-IOV, VMDq and switching (as in an approximation of L2 forwarding) and has begun shipping in 2009. It was in fact used by AWS. While Niantic was designed in an earlier era, and was clearly inadequate for overlay networks with high VM density scaling it and adding VxLAN/GENEVE support would not have been a huge challenge. Instead of iterating on SR-IOV+forwarding NICs, however, the industry took a turn towards general purpose CPUs inside the NIC. An IPU is a computer inside a computer.

I have never worked for a public cloud provider, so I don’t know why the IPU had won. Let me speculate, I’d love to hear where I’m completely wrong :)

First, let’s start with some human psychology — reason #1: forward compatibility. The world was evolving fast and people assumed it will keep evolving. Standardization of VxLAN begun in August 2011, Geneve in February 2014. The advent of overlay networking was a major shake up. Adding layers to the stack, not recognized by contemporary devices and at the scale of public clouds must had been a major challenge. Nobody knew how long it’s going to take until the new protocol landscape settles, and whether it will settle at all. Which made flexibility very appealing.

Reason #2: software complexity. Software implementations are much more flexible and easy to innovate in — but let’s focus on the negative :) 2010 is also when Software Defined Networking took off. While the goal of the movement was (to my understanding) to have better control of hardware forwarding — the biggest success of SDN was Open vSwitch (OVS) which is forwarding packets in software. OVS was much faster than Linux bridging and supported all the new protocols. Naturally it became the go-to technology for many cloud stacks. While I don’t know if hyperscalers used OVS, they likely had similar technologies. This transformed the hardware requirements from the simple “SR-IOV with EVPN” which would have worked to “OVS offload” because OpenStack uses OVS. OVS offload is far more complex, the datapath is really granular and the control path interacts with the forwarding plane excessively. A lift-and-shift of an OVS-like SDN software stack to an IPU is much more appealing than offloading just the datapath.

Hardware offloads typically favour simplicity, so moving the control path to the device goes against the engineering best practices. I certainly thought it was a bad idea when I first heard of it. I was missing a crucial point at the time — reason #3: isolation. Moving the control plane to the NIC separates it from the CPU on which customer code runs. This means that the NIC can protect itself and the network from customers escaping virtualization. Once we have a general purpose CPU on the NIC we can move other parts of the hypervisor / infrastructure code there, too, giving it the same protection. (This is why I like the term IPU — it is just a CPU where the infrastructure code runs.)

Reason #4: storage. I’m not a storage expert but storage often is accessed over the network rather than attaching large disks to each machine. When storage is accessed over the network moving the storage processing to the IPU is fairly obvious. Modern IPUs can present themselves to the guest as network devices and storage devices. The guest thinks it talks to a virtio or NVMe drive, while really it’s talking to the IPU which then fetches the data over the network.

Reason #5: bare metal instances. Once we offload all the infrastructure code from the main CPU we can give control of the entire CPU to the customer. The customer can use their own virtualization stack or simply enjoy the full bare metal performance of the CPU!

Whether bare metal instances are a reason, the reason, or even the only reason and the other reasons are a smoke screen — only cloud providers know. But this brings me to my final point — reason #6: internal process (cough.. politics?). Lift and shift of existing software to a device which is essentially just another CPU is likely quite appealing (especially at the PowerPoint level), but more importantly its much easier to integrate into processes of a software company. Delivering both competitive software and competitive hardware is hard, the management of software projects is very different to that of silicon projects. Pure hardware is also far more exclusionary — a team is either working on the device or not, while a software-heavy device will host code and have mind-share in far more teams. I’m not saying that IPU has won because it’s easier for management to stomach, but conversely I suspect some other internal/acquired hyperscaler NIC designs had suffered due to the lack of the right enzymes.

Future

With Google deploying Intel IPUs we can safely assume that the IPUs are here to stay. I think that they must evolve, however, because in a way they are built on a lie.

Earlier, I mentioned the most commonly quoted advantage of IPU vs running the hypervisor on the main CPU — conserving the CPU cycles which we could otherwise sell. I can’t see how this argument survives close scrutiny. The IPU is just another CPU, and it’s not cheaper or less power hungry. Amazon sells ARM server instances which are (used to be?) basically the same silicon as their Nitro IPU. Clearly a core of the IPU was to them pretty close to a core they could rent out.

Differently put, I suspect that IPUs are a major source of stranded capacity. The IPU must be able to process the traffic of the most demanding application, for majority of workloads it will be underutilized. Both Intel’s E2000 and nVidia’s BlueField-3 have 16 high-performance ARM cores, 32/48GB of DRAM. Which translates to an average Xeon-D based mirco-server connected to every server.

Keep in mind that server capacity differs, and changes over time. The cloud providers likely use a single IPU design across all servers, likely for 4 or 5 years. Even if the balance was perfect at the time of definition of requirements (2+ years before production) — it will certainly not be at the tail end of deployment.

Connecting multiple servers to a single IPU (like Meta does in its Yosemite microserver designs) is an option, to average the workload out, but I doubt it’d scale beyond 4, maybe 8 hosts.

A better solution would be to disaggregate the IPU within a rack.

There are various ideas centred around the proximity of servers within a rack — Intel’s Rack Scale Design, disaggregated memory, and various others sprinkled with CXL pixie dust. While most “rack scale” projects focus on pooling of resources — disaggregating IPUs makes a special kind of sense. Both sides of the equation are the same type of resource — a CPU.

Going back to the Amazon example, for illustration — a Gravitron/Nitro pair (respectively the ARM server and IPU) are basically two identical CPUs connected over a PCIe bus — one controlled by the customer and one by the provider. The PCIe bus is internally a credit controlled packet network, not too dissimilar to InfiniBand for example. Instead of connecting the two one-to-one we could connect an entire rack (50?) of them via a switch. We can then pick which of the 50 act as IPUs and rent the rest out to customers.

The latency would increase from PCIe’s ~0.5usec to a few usec but that’s likely imperceptible to majority of workloads. Each server would get a very simple NIC (let’s call it IOG for “I/O Gateway”) to expose to the device and its queues to the operating system, as well as implement the security policies. To give the same security guarantees as an IPU, IOG would make sure that a customer machine can talk exclusively to “its IPU” machine while letting the provider machines communicate outside of the rack. We can go further and integrate the IOG into the CPU package as a chiplet.

The IOG may support simpler use cases (basic overlay forwarding) directly (without the need to hairpin via the companion CPU) as an optimization.

This design effectively removes the IPU from the picture. There is no special CPU type/device which we must support in the fleet. We can stick the IOG into any machine, and use real CPUs (designed by a CPU design teams rather than NIC teams) to do the processing. Everybody wins. Well, maybe not NIC vendors.

--

--