Scaling Skip: Serverless Architectures

TLDR: Serverless has made it possible to scale Skip with a small team of engineers. It’s also given us a programming model that lets us tackle complexity early on, and gives us the ability to view our platform as a set of fine-grained services we can spread across agile teams.

In 2013, a seminal paper was published on the topic of managing large-scale internet services.

Your Server as a Function, by Marius Eriksen at Twitter, examined challenges of managing concurrency in a stateful environment — even for the most experienced of programmers. Eriksen proposed an abstraction that enabled shipping scalable code faster, while encouraging modularity between services(and ultimately teams). This abstraction was the foundation for Finagle, a framework today powering Twitter’s real-time systems.

At Skip, we were inspired by Eriksen’s work and took it a step further: extending his abstraction to how we think about our entire platform. Since day one, we built Skip as a set of functions backed by the power of Google Cloud Platform (GCP), a highly available managed infrastructure by our friends at Google.

Every single service at Skip is powered by Google Cloud Functions. What’s most apparent about Skip’s infrastructure is what’s missing: no kubernetes, no orchestration, no containers, no middleware. And thanks to auto-scaling, no load balancers either.

In this blog post, we’d like to walk you through how thinking in terms of serverless architectures has enabled us to scale to millions of users with just a handful of engineers.

Simpler by default

Switching one’s mindset from a container-based web services architecture to the serverless world can take a bit of getting used to. At Skip,we view these constraints as boons that ultimately make service correctness easier to reason about, much like static type checking does for program correctness.

Expressing our platform as a set of stateless cloud functions mandates several design decisions.

For example, Google explicitly discourages sharing global state across cloud function calls, as the underlying function instance may be garbage collected.

  • Cloud Functions implements the serverless paradigm, in which you just run your code without worrying about the underlying infrastructure, such as servers or virtual machines. To allow Google to automatically manage and scale the functions, they must be stateless — one function invocation should not rely on in-memory state set by a previous invocation.

Similarly, for functions that are triggered by messages on a Pub/Sub queue, Google guarantees “at-least once” message delivery, so every Cloud Function must be safe to re-run. Finally, Cloud Functions must finish within a given amount of time, or they are interrupted early.

These design constraints force our engineers to confront bugs like duplicate requests and timeouts early on in the development cycle. The result is services that are simpler by default. State is explicitly stored in services best suited for the type of data we are saving. At least once delivery makes retries safe in the code base. Timeout limits encourages the team to decompose rather than build monolithic functions.

Although there’s a learning curve, once you get used to thinking serverless,the simplicity and correctness gains make it hard to go back.

To infinity and beyond

By leveraging the stateless nature of functions, scaling Skip becomes partly a workload scheduling optimization for Google’s data center software. Crucially, it’s an optimization problem we don’t have to solve ourselves.

  • Cloud Functions handles incoming requests by assigning them to instances of your function. Depending on the volume of requests, as well as the number of existing function instances, Cloud Functions may assign a request to an existing instance or create a new one.

In addition to servers being scaled up and down based on our traffic patterns automatically, Cloud Functions gives us the power to leverage the broader GCP.

Cloud Functions integrate automatically with StackDriver, giving us centralized logging, visibility and error reporting for every endpoint. We can then push these logs to BigQuery for further analysis.

But most importantly, Cloud Functions can respond to events from other GCP services like PubSub (scalable messaging) and Cloud Firestore (Spanner-backed NoSQL database). This allows our engineers to build cloud functions that respond to external events and data changes with the same unified serverless programming model.

By leveraging Cloud Functions in Pub/Sub and Cloud Firestore, we take the “server as a function” analogy all the way through our stack, including in many of our task and storage APIs.

Your team as a group of functions

As our engineering team has grown, so too has the number of our functions, now in the hundreds.

A happy accident of composing our system in terms of functions rather than servers is we can combine cloud functions logically into sets of APIs that each developer or team can be collectively responsible for.

Furthermore, we can assign service level objectives (SLO) for different teams based on the functions they manage. For example, functions that make up the operations stack for both Skip’s Rangers and operations teams may have different SLO requirements for error rates, tail latencies and on-call response when compared with other parts of our platform, given the mission-critical nature of those functions to our city operations.

Finally, by having ‘function groups’ as an atomic unit of a service, we can also easily track how complex services are and how to best allocate engineers to them over time.

Known limits

Like any new technology, there are limitations. What are things that we wish we could do that are either difficult or impossible in a serverless world?

A common client/server architecture is long-lived socket connections, which allows data to be pushed to clients, rather than clients having to poll the server periodically. This is virtually impossible with serverless. At Skip, we are big fans of Cloud Firestore, which enables mobile devices to receive live pushes when data updates in the database. This is how we currently tackle push. However, using a database is just a temporary solution, and in the coming months we’ll be examining how to best push data through HTTPS.

Another limitation is not having local network connections to services like Redis and StatsD available. At the time of this article, the GCP team has added bridging to Google Compute Engine directly from Cloud Functions with VPC Beta. We’re looking forward to this capability and are eager to spin up Cloud Memorystore, GCP’s managed Redis service.

There are other similar rough edges, but so far we’ve found the GCP teams have been responsive and helpful along the way as we scale.

Looking ahead

Serverless has made it possible to scale Skip with a small team of engineers. It’s also given us a programming model that lets us tackle complexity early on, and gives us the ability to view our platform as a set of fine-grained services we can spread across agile teams.

As our business and industry continues to evolve, we’ll be sharing more on how we use other GCP products to power Skip’s core infrastructure.

If solving big engineering problems with elegant solutions gets you excited too, we’re hiring!