OpenAI SRE and scaling explained easy.

6 min readMar 13, 2023

OpenAI, one of the world’s leading AI research organizations, has announced that it has scaled a single Kubernetes cluster to 10,000 nodes, up from 2,500 nodes in 2019. This is a remarkable achievement, and the fastest scaling of its kind due to user demand. OpenAI has shared the lessons learned during this scaling process, so that others in the Kubernetes community can benefit from them.

Before diving into the details, it’s important to understand OpenAI’s workload. Their applications and hardware are different from those at a typical company. Large machine learning jobs run most efficiently when they have access to all of the hardware resources on each node. As a result, a single pod occupies an entire node, and resource contention isn’t a factor for scheduling. OpenAI’s clusters have full bisection bandwidth, so there are no rack or network topology considerations.

The workloads themselves are ever-changing, as OpenAI’s research teams iterate quickly. Thus, they need a sustainable system that also allows them to respond quickly when things change.

One of the biggest challenges in scaling their infrastructure was networking. As the number of nodes and pods increased, OpenAI found that Flannel had difficulties scaling up the required throughput. They switched to using the native pod networking technologies for their IP Configurations for Azure VMSSes and the relevant CNI plugins. This allowed them to get host-level network throughput on their pods.

Another reason for the switch was that, on their largest clusters, they could have approximately 200,000 IP addresses in use at any one time. When they tested route-based pod networking, they found significant limitations in the number of routes they could effectively use.

OpenAI uses iptables tagging on the host to track network resource usage per Namespace and pod. Researchers can visualize their network usage patterns. They also use an open-source Prometheus exporter called iptables-exporter to get these tracked into their monitoring system.

One unique aspect of OpenAI’s network model is that they fully expose the node, pod, and service network CIDR ranges to their researchers. They have a hub-and-spoke network model, and use the native node and pod CIDR ranges to route that traffic. Researchers connect to the hub, and from there have access to any of the individual clusters (the spokes). But the clusters themselves cannot talk to one another, ensuring that clusters remain isolated with no cross-cluster dependencies that can break failure isolation.

Scaling a single Kubernetes cluster to this size is rarely done and requires some special care. However, the upside is a simple infrastructure that allows OpenAI’s machine learning research teams to move faster and scale up without changing their code. The lessons learned during this process are valuable to the Kubernetes community and can be applied to other organizations as well.

OpenAI’s API servers are critical components that allow for smooth functioning of its clusters. The company pays special attention to the stress on Kubernetes API Servers and etcd, two of the most important components of a healthy working cluster. OpenAI has used Grafana dashboards provided by kube-prometheus, in addition to its in-house dashboards, to alert them on the rate of HTTP status 429 and 5xx on the API servers. This allows them to quickly recognize and respond to any issues with the API servers.

OpenAI has always run API servers and etcd nodes on their own dedicated nodes outside of the cluster. This ensures that the load is spread out and minimizes the impact if one goes down. OpenAI’s largest clusters run five API servers and five etcd nodes, further reducing the impact of downtime. They have not had any significant trouble with etcd since splitting out Kubernetes Events into their own etcd cluster, which they discussed in their last blog post. API servers are stateless and generally easy to run in a self-healing instance group or scaleset, but the company hasn’t yet tried to build any self-healing automation of etcd clusters because incidents have been extremely rare.

OpenAI is very mindful of any API server requests that scale with the size of the cluster. They try to avoid having any DaemonSets interact with the API server. In cases where each node needs to watch for changes, they introduce an intermediary caching service, such as the Datadog Cluster Agent, to avoid cluster-wide bottlenecks. This is important as the number of nodes increases, and so does the memory usage of API servers, which tends to scale linearly with the number of nodes in the cluster.

OpenAI uses Prometheus to collect time-series metrics and Grafana for graphs, dashboards, and alerts. They started with a deployment of kube-prometheus that collects a wide variety of metrics and good dashboards for visualization. Over time, OpenAI added many of their own dashboards, metrics, and alerts. As they added more nodes, they struggled with the sheer amount of metrics being collected by Prometheus. While kube-prometheus exposes a lot of useful data, some of it they weren’t actually ever looking at, and some was just too granular to collect, store, and query effectively. They use Prometheus rules to “drop” some of these metrics from being ingested.

Prometheus would often consume more and more memory until eventually crashing the container in an Out-Of-Memory error (OOM). This seemed to occur even after throwing enormous amounts of memory capacity at the application. What’s worse was, when it did crash, it would take many hours on startup replaying write-ahead-log files before it was usable again. Eventually, OpenAI tracked down the source of these OOMs to be an interaction between Grafana and Prometheus. They patched Prometheus to contain this API within a Context to enforce a timeout, which fixed it entirely.

OpenAI’s cluster is large, and they rely on automation to detect and remove misbehaving nodes from the cluster. Over time they have built up a number of healthcheck systems that monitor basic system resources such as network reachability, bad or full disks, or GPU errors. GPUs exhibit problems in several ways, but an easy common one is an “Uncorrectable error,” and OpenAI monitors for these errors.

In conclusion, OpenAI’s approach to scaling a single Kubernetes cluster to such a large size is a testament to the company’s commitment to innovation and meeting user demand. The lessons learned from this process can benefit others in the Kubernetes community, especially those dealing with large machine learning workloads. While there are still challenges to be addressed, such as optimizing networking and improving job resiliency, the infrastructure currently in place allows OpenAI’s research teams to move faster and scale up without having to change their code. This is a remarkable achievement that showcases OpenAI’s expertise in the field of machine learning and their dedication to advancing the state of the art.

OpenAI SRE and scaling explained easy.

Written by Prannay Hebbar