Notes on Data-Intensive Systems, Serverless Compute, & Orchestration

12 min readJan 30, 2024

If attaining performance in machine learning is analogous to a car, then data is the fuel, and the compute is the motor. When I first began working in machine learning, I kept hearing that data is the most critical part of the workflow. It’s true to a degree — without quality and high-confidence data relevant to the particular task, machine learning processes tend to run into several roadblocks when it comes to performance. Rather than squeeze out performance from data alone, research scientists across academic institutions and industry labs have realized that there are significant performance benefits when training with scaled-up compute resources. What was once a hypothesis regarding how to improve capabilities has become a practice of throwing as much compute as possible at a task:

The benefits of an NVIDIA A100 vs H100 are consequential when it comes to tasks like AI training and inference. Imagine the benefits of working with 100 GPUs versus 1000 and 10,000. The differences would equate to orders of magnitude. Thus, it comes as no surprise that GPUs are in high demand. One could argue that if everyone could access GPUs then we could unlock the ability to solve problems that lie in currently intractable search spaces. AlphaFold and AlphaCode are strong initial cases due to the impressive results simply achieved through the sheer brute force of optimized parallel computation.

The ultimate advantage of modern computing is the ability to perform statistical pattern recognition at an unprecedented scale. In Rich Sutton’s famous blog titled “The Bitter Lesson”, the key idea follows suit in that general methods leveraging computation tend to be superior to methods relying on human-crafted knowledge, especially in the long term. This disparity is largely due to the exponentially decreasing cost of computation, following Moore’s Law. Historical trends in AI, exemplified in fields like computer chess, Go, speech recognition, and computer vision, consistently show that approaches using massive computation, such as deep search and learning, eventually outperform those based on human knowledge. One great example is a Monte Carlo Tree Search (MCTS) algorithm and its applications in game-playing: instead of depending on predefined strategies or expert knowledge, MCTS uses random sampling and statistical methods to make decisions that are either guided towards exploration or exploitation. This approach allows MCTS to discover novel strategies and gameplay tactics, often surpassing human intuition and expertise.

AI’s success in game-playing (rich solution spaces with well-structured rewards) alongside the ideas from Sutton emphasizes that while integrating human knowledge can yield short-term gains, it often plateaus and may hinder further progress. Breakthroughs in AI have predominantly come from scaling up computation through search and learning methods. In this article, we walk through the history of large-scale data-oriented systems, understand recent developments in the cloud with containerized applications, and then explore the future of serverless computing and orchestration.

Modern Challenges

Training large language models across multiple GPUs in a distributed system presents a host of intricate challenges, spanning both hardware constraints and software complexities. One of the primary hurdles is the synchronization overhead, essential for keeping the model’s state consistent across all units. This synchronization, particularly with models harboring an extensive number of parameters, can significantly slow down the training process due to the time-consuming overhead it incurs.

Additionally, memory bandwidth and transfer speeds are critical bottlenecks. The data transfer between GPUs, especially in models that demand substantial memory, can hinder efficiency. Technologies like NVIDIA’s NVLink or AMD’s Infinity Fabric play pivotal roles in this context, with their speeds being crucial for swift data transfer. However, limited bandwidth can lead to GPUs idling as they wait for data, thereby prolonging the training time or leading to CUDA out-of-memory issues.

Effective load balancing is also paramount. Distributing the computational workload evenly across all GPUs is a complex task. It involves not just partitioning (sharding) the model but also the data in an efficient manner. An imbalance in this distribution can lead to the underutilization of some GPUs while overburdening others, culminating in an inefficient use of resources. Thus, precision allocation of workloads is a challenge in its own right.

The finite memory capacity of each GPU is another constraint. Given that large models or batch sizes can exceed this capacity, strategies like model parallelism or gradient checkpointing become necessary. However, these strategies introduce their own complexities and can potentially hurt efficiency. Moreover, the software and frameworks facilitating distributed computing need to be optimized to leverage multi-GPU training effectively. Any inefficiency in these tools can negate the benefits of the underlying powerful hardware. One library

Communication & networking costs, especially in multi-node clusters, pose additional challenges. The time and resources required for inter-GPU communication, particularly between GPUs in different nodes, can be substantial. Such high communication costs can erode the advantages of parallel processing, leading to a situation where increasing the number of GPUs offers diminishing returns. Furthermore, the scalability of the system is a concern. As the number of GPUs scales up, so does the complexity of efficiently managing the training process, often resulting in lesser gains than expected due to increased communication and synchronization overheads. Additionally, the energy consumption of large-scale distributed training can be considerable, raising operational costs.

History of Large-Scale Data Processing

If we approach modern machine learning applications and deployment from an architecture lens, distributed systems for data processing and serving have become the paradigm for enabling ML workflows at scale. Whether it be data labeling or model inference, handling massive amounts of queries or examples requires flexible and scalable computing frameworks. From a first-principles view, distributed systems can be defined as large-scale groups of computers (a cluster) or even multi-cluster networks that execute operations in a fault-tolerant and resilient manner. From the earlier section of this piece, you could note that general-purpose computation methods are the ones to index on — distributed systems supply the raw scalability to employ such methods on vast amounts of data. In the early 2000s, Google, grappling with the challenge of indexing an exponentially growing web, developed MapReduce. The architects of this system — Jeff Dean and Sanjay Ghemawat — are icons in the Google lore and the tale of the internet. Their programming model revolutionized data processing by splitting tasks into two phases: ‘Map’ to process and convert data into key-value pairs, and ‘Reduce’ to aggregate these results. The scalability of MapReduce came from its ability to distribute these tasks across numerous computers, handling each chunk of data in parallel.

Hadoop, an open-source framework inspired by MapReduce and GFS (Google File System), emerged to bring these capabilities to the broader tech world. At Hadoop’s core was the Hadoop Distributed File System (HDFS), a scalable and fault-tolerant storage system designed to run on commodity hardware, alongside MapReduce for processing. Hadoop’s YARN (Yet Another Resource Negotiator) later evolved as a cluster management component, enhancing the system’s ability to handle diverse data processing tasks beyond just MapReduce, thereby improving resource utilization and scheduling.

A few years later, Apache Spark was developed in response to the limitations of Hadoop’s MapReduce, especially in terms of speed. Spark introduced an in-memory data processing capability, dramatically reducing the heavy reliance on disk I/O and improving speeds by orders of magnitude, particularly for iterative algorithms and interactive data exploration. Spark’s resilient distributed datasets (RDDs) and its ability to run computations in memory made it highly efficient for a range of applications, from machine learning to stream processing.

In parallel, Apache Mesos emerged as a solution for more dynamic resource sharing across distributed applications. Unlike YARN, which was closely tied to Hadoop, Mesos provided a more generalized cluster management solution, capable of running Hadoop and other applications like Spark, thereby offering greater flexibility and resource efficiency. This technological evolution from MapReduce to Hadoop, and then to Spark and Mesos, represents a foundational shift in big data processing. It encapsulates the transition from single-server limitations to the expansive, distributed computing landscapes we see today, driven by the need to harness and extract insights from ever-growing data volumes in a scalable, efficient manner. With the scale presented by large language models, there has become a need for data-intensive systems that can provide an optimized user experience alongside heavy scalability. From the systems I’ve encountered, I’d argue that the four key ideas to maintain the integrity and robustness of a scalable big data system are:

1. Strong performance (low latency, high throughput)
2. Reliability, availability, and flexibility (scale up and down)
3. Low-cost and competitive pricing
4. Ease of use (as simple as can be)

The Cloud & Containerization

Following the advent of large-scale data processing systems like MapReduce and Hadoop in the 2000s, the 2010s introduced the era of the modern public cloud — built on the ability to provide services for data and SaaS applications at scale. The rise of the cloud, pioneered by AWS’s launch of Elastic Compute (now EC2) in 2006, marked a significant shift in IT infrastructure management. AWS’s approach, born from the need to support its growing e-commerce platform, evolved into a highly scalable, shared infrastructure model, drastically reducing application build times and costs. Microsoft Azure, entering the market later in 2010, under Satya Nadella’s leadership, shifted Microsoft’s focus to a cloud-first strategy, incorporating acquisitions like LinkedIn and Github to bolster its cloud offerings. Google Cloud Platform, starting with its App Engine in 2008, has focused on industry-specific services and partnerships under Thomas Kurian’s direction. This competitive landscape among the three giants has accelerated cloud innovation and accessibility, profoundly impacting businesses’ approach to IT infrastructure.

The major question with the cloud is the idea of vendor lock-in as well as choosing what vendor to pick. Yet, various containerized applications, particularly Docker and Kubernetes (K8s), also began to play a pivotal role in the ecosystem of public cloud providers like AWS, Azure, and GCP. These platforms offer a standardized way to package and deploy applications across various environments, ensuring consistency and efficiency. Cloud providers have integrated these container technologies into their platforms, offering services like Amazon ECS and EKS, Azure Kubernetes Service, and Google Kubernetes Engine. This integration allows businesses to leverage the scalability and flexibility of the cloud while maintaining the portability and ease of management provided by containerization.

From a technical lens, Docker is a platform that utilizes OS-level virtualization to deliver software in packages called containers. These containers are isolated but can communicate with each other via well-defined channels. Kubernetes (K8s) is an orchestration system for managing containerized applications across a cluster of nodes. It automates deployment, scaling, and management of application containers, ensuring high availability and resource optimization. Both Docker and K8s emphasize modular, microservices-based architectures, promoting agility, scalability, and resilience in application deployment and management.

The key lesson from the evolution of the modern public cloud and containerized applications is a trend towards a more serverless, automated computing environment — frameworks built with simplicity at the core. This progression emphasizes the importance of scalability, flexibility, and efficiency in resource utilization. The abstraction of underlying infrastructure complexities, as seen in container and serverless technologies, allows developers and organizations to focus more on application development rather than on managing servers. This shift is increasingly leading towards an era where workload allocation, resource optimization, and scaling up and down are inherently managed by the system, enabling more agile and cost-effective software development and deployment.

Looking to the Future: The Case for Serverless

I want to dive deeper into the ideas behind serverless computing and the more general idea of implicit orchestration of compute and application resources. With LLMs, we’re closer than ever to having models that can perform optimal decision-making and solve problems in resource allocation. Serverless computing, despite its name, still relies on servers but abstracts their management from the user. In this context, serverless often refers to Functions-as-a-Service (FaaS), where the focus is on individual functions handling specific tasks. These functions offer a simplified API and lifecycle compared to more complex structures like containers. This simplification aligns well with common web operations like responding to requests or server-side rendering. The transformative aspect lies in its invocation-based approach, rather than maintaining a continuous process. This model offers streamlined orchestration, scalability, and deployment, fitting the needs of businesses focused on rapid, scalable compute solutions. Serverless architectures also bring benefits like automatic multi-region deployment and easy failover, continuously evolving to enhance developer experience and computational efficiency:

Vercel is one company advancing the state-of-the-art in serverless compute: they are building a frontend cloud that optimizes webpage rendering and performance while avoiding the clunkiness of typical container management. Pinecone, a vector DB provider, is another strong example of the case for serverless:

There are no pods, shards, replicas, sizes, or storage limits to think about or manage. Simply name your index, load your data, and start querying through the API or the client. There’s nothing more to it.
— Edo Liberty, CEO and Founder of Pinecone

Pinecone’s recent transition emphasizes the evolution of a serverless vector database, specifically tailored for GenAI applications. Emphasizing the significance of knowledge in AI, they highlight how access to extensive data improves AI application performance. The redesigned database offers up to 50 times lower costs, with a focus on ease of use, scalability, and efficient vector-search performance. Notable features include the separation of reads, writes, and storage, innovative indexing and retrieval algorithms, and a multi-tenant compute layer. This serverless solution enables developers to focus more on application development rather than infrastructure management. It may seem too good to be true but Pinecone is not sacrificing optimization for ease of use: Just like their existing pod-based indexes, the serverless implementation supports live index updates, metadata filtering, hybrid search, and namespaces to let the user have the most control of their data.

Performance is also preserved: when queries are more frequent (greater caching locally in multi-tenant workers), serverless indexes provide significantly lower latencies compared to pod-based indexes, with roughly the same level of recall.

The final piece: Orchestration

Serverless is just one piece of the pie towards a timeless theme in computing: simplistic abstraction with robust optimization hidden away from the user. In the context of serverless and cloud computing, orchestration refers to the automated arrangement, coordination, and management of complex computer systems, middleware, and services. It’s about managing the interactions and dependencies between numerous cloud-based components and services: pods, containers, surrounding infra, etc. We’re trending towards a future where orchestration could evolve to be nearly completely automatic and driven by independent ML planning. This would involve systems that can learn from their environment and user demands, dynamically adapting and optimizing workflows, resource allocation, and responses to changes in real time. Such advancements could significantly enhance efficiency, reduce manual oversight, and streamline cloud operations. The key challenge in the immediate horizon and for 2024 is likely heterogeneity: connecting unaligned resources whether it be different workload shapes and types, different hardware and GPUs, different networks and servers, different tasks, etc.

The ultimate beauty of compute and the cloud is that ease of use is not at odds with performance as one would typically be encouraged to think. Such an economic tradeoff is more easily definable at the optimum between user control and abstraction: for various use cases, there are times when being more elastic and having system-level fluctuations hidden underneath is more ideal, and in other cases, users may prefer the ability to have more control. Going back to the initial example of a car, it’s not in human nature to initially have our hands totally off the wheel even in a fully autonomous vehicle. For designing distributed systems for solving problems in ML, I encourage all to chase the frontier of simplicity and user-oriented abstraction.

“I don’t need a hard disk in my computer if I can get to the server faster… carrying around these non-connected computers is byzantine by comparison.”
— Steve Jobs, late chairman of Apple (1997)