Illustration by Chenyu Wang

The Petuum Data Center Operating System For AI and ML

By Aurick Qiao

In recent years, artificial intelligence (AI) and machine learning (ML) have become integral to a variety of different sectors. AI and ML form the basis of many technologies we now take for granted: medical imaging systems that can identify abnormalities, predictive maintenance systems that can preemptively alert you of failures in manufacturing plants, and fraud detection systems that can comb through billions of financial transactions for criminal activity.

However, as these industries become more technologically complex, the problems we want to apply AI and ML to are also becoming more complex and require entire data centers of computational power. Harnessing that computational power often poses an insurmountable challenge for end-users. That’s why, at Petuum, we are building a data center operating system that bridges the gaps between data center hardware, AI/ML applications, and end-users.

We are frequently asked the question, “what is a data center operating system?” When the term “operating system” (or “OS”) is brought up, we often think of desktop operating systems like Windows and Linux, or mobile operating systems like Android and iOS. We use them everyday on devices we are familiar with, like laptops and smartphones. On the other hand, not many of us directly interact with data centers. In this post, we will attempt to demystify the concept of a data center operating systems and explain their similarities and differences with traditional operating systems.

OS for a Data Center

One definition of an operating system is “software that manages hardware and software resources and provides common services for applications and users.” For example, Windows manages the hardware in a personal computer and supports desktop applications like web browsers or text processors. iOS manages the hardware in a smartphone and supports mobile applications like text messaging or GPS navigation. Both of these types of OS’ are typically used by the owner of the device (i.e. smartphones), or by a few friends or family members (i.e. desktop computers).

However, the definition of an operating system is more general than the common ones we use everyday on our personal devices, and naturally extends to a data center. Like any other OS, a data center OS is software that manages hardware resources so that they can be harnessed by applications and users. Of course, the hardware, applications, and users of a data center differ greatly from those of personal devices. We’ll focus our discussion on these three aspects, and how Petuum conceptualizes a data center OS for AI and ML.

Hardware Resources

The central responsibility of any OS is to manage hardware resources. For example, desktop operating systems control all hardware devices that can connect to a personal computer, such as hard drives, memory modules, and keyboards. Likewise, a data center OS controls all the hardware resources in a data-center like CPUs, memory, and the network. While much of the individual management of the hardware on each server can be delegated to single-machine operating systems like Linux, data center operating systems are also concerned with the interplay between hardware on different machines. This includes:

  • Understanding the characteristics of hardware that does not reside on any single machine, like the local network. Knowing which servers belong on a single physical rack lets the data center OS schedule applications for better performance and reliability.
  • Managing and balancing hardware failures — even if the hardware on one machine fails, the data center still lives. The data center OS must detect failures and properly report and manage them so that the data center can continue to operate.

Furthermore, AI and ML applications may use hardware resources in ways that can be challenging for a data center OS to support. For example, training a model on a modern dataset can use terabytes of memory for weeks at a time, require specific hardware like GPUs, and exhibit complex network communication patterns. The data center OS must ensure that each competing AI/ML application gets a fair share of hardware resources, and are able to adapt when its resources are re-allocated to a higher-priority application.

Application Support

Another core responsibility of an OS is to provide easy-to-use tools for application development. The hardware managed by the OS is abstracted out into programming models, letting developers fully take advantage of the hardware without getting bogged down with details. Developers all around the world use abstractions like these every day:

  • Threads that let you program CPUs without knowing exactly which core the application executes on.
  • Virtual memory that lets an application use physical memory as if it owns an infinite amount.
  • Processes that combine threads, virtual memory, and other abstractions to give the application a view of the entire machine.

Although such mature programming abstractions don’t exist in data center operating systems yet, they’re well on their way. A distributed computation can already be abstracted away as a job (the data center equivalent of a process) and decomposed into a number of tasks that can execute on different machines and communicate with each other through the network. At Petuum, we are creating a set of OS-level abstractions specifically targeted towards AI/ML, including:

  • Parameter server as a distributed memory abstraction for AI/ML.
  • Structure-aware scheduling as a distributed computation abstraction for AI/ML.
  • Storage abstractions for datasets, models, and other AI/ML-related objects.

User Management

Lastly, an OS must provide an environment that allows us to use both the hardware and the applications. For example, multiple people can have accounts on the same OS, see their own files, and share the applications installed on it. However, the usage patterns of data centers are drastically different from other operating systems because of:

  • More users. In a single organization, there can be thousands or tens of thousands of people. The data center OS must be able to handle a much greater load and ensure fairness of resource utilization in the data center.
  • Concurrent use. Many people can be logged into the data center, launching computation jobs and using applications at the same time. The real-time interactions between users must be properly facilitated by the OS.

For an organization to fully leverage AI and ML, the data center OS must support a diverse set of different types of users. For example, a team of data scientists might run resource-intensive experiments on large datasets to incrementally fine-tune a predictive model, while another team of deployment engineers make the model available for use. On the other hand, a business oriented team might not care at all about the experimentation or deployment processes, but want to use their results to derive insights for their business. At Petuum, we believe that these different types of users are equally important in an organization and we are building our data center OS to serve AI/ML experts and non-experts alike.

Data center operating systems are true operating systems for data centers. They manage hardware devices and their interplay within a networked cluster of machines, provide programming abstractions for distributed applications, and support a wide variety of users in an organization. At Petuum, we are tackling the challenges of building a data center operating system for AI and ML. In future posts, we will discuss specific systems problems we are solving, including hardware resource management, adaptability in multi-tenant environments, and programming abstractions for distributed AI/ML.