What does Serverless mean for Data Platforms?

Published in

Wrong AI

2 min readFeb 15, 2018

Serverless is a new compute paradigm where the programmer specifies functions that need to run in response to events. This is in contrast to the existing paradigm of writing programs, and then deploying them to run either on bare metal or VMs or containers.

The key difference is the shift in the abstraction for the programmer — instead of worrying about how to run, when to run, where to run, and how to scale the program, the Serverless paradigm allows the programmer to focus purely on the business logic that runs in response to an event (such as customer request, new file upload, etc.). Under the covers, the platform is responsible for spinning up the compute resources, process the data, auto-scale, and load-balance, and reclaim the resources upon completion.

The Serverless paradigm is the next logical step in this evolution of computing and builds on the technology disruptions that have been in the making over the last 15–20 years. In early 2000 (prior to VMware), compute was essentially bare-metal resources that took months to order and provision. Then came the Virtualization Era where compute was treated as Virtual Machines (VMs) — multiple VMs ran on the same physical server and could be spun up, down, and vmotion’ed between servers. Over the last few years, the idea of Containers has been gaining popularity where processes running within the traditional OS can be isolated and managed as independent compute resources. Containers are significantly light-weight compared to VMs allowing a higher density on a single server, as well as faster spin up/down. Today, the use-cases for Serverless are mainly limited to short-running processes that are relatively stateless.

So what does it mean when a data platform is referred to as being serverless? Serverless assumes a broader meaning where a data architect does not have to define the physical aspects of the infrastructure deployment namely the number of nodes within the cluster, data layout, scaling, types of compute and storage instances, etc.

Instead, the architect specifies the SLAs, and the system figures out the underlying configuration details under the covers. This is the ideal goal of eliminating guess-work w.r.t. storage/compute resource provisioning, database configuration (such as # of connections, etc.), and query-specific tuning (indexes, data layout, etc).

Serverless implementation today in the context of data platforms is essentially a souped-up version of “Auto-scaling.” While this is true in the current phase of technology evolution, Serverless is much broader than just scaling up and down — it encapsulates expert data layout and cluster tuning as a programmatic capability transparent to the data users.

A good example is AWS Aurora Serverless — it's a relational database service where the data engineer does not manage the cluster scale and hardware configuration. The database automatically scales up/down under the covers in units of ACU (Aurora Compute Unit). ACU represents 2GB of memory with corresponding CPU and network bandwidth. In the current implementation, the scaling is from 1 to 256 ACUs.

What does Serverless mean for Data Platforms?

Written by Sandeep Uttamchandani