Mesos at Strava

Over the past few years at Strava, server side development has transitioned from our monolithic Ruby on Rails app (The Monorail) to a service oriented architecture with services written in Scala, Ruby, and golang.

Top: Commits to our monolithic rails app. Bottom: Total private repositories at Strava as a proxy for the number of services.

Initially, services at Strava were reserved for things that were simply not possible to implement in Rails (for example, see my previous blog post on Routemaster). It was difficult to bring up a new service, and it required some combination of prior knowledge and trailblazing.

At some point, it became clear that services had become the preferred way of development at Strava. Service oriented architecture (SOA) had triumphed over the Monorail, however, all was not yet well. If done hastily, introducing dozens of services can cause havoc. We had figured out that we liked building services, but we didn’t really have a plan for scaling the number of them.

This post covers the infrastructure that has supported smooth adoption of services at Strava. Today we run over a hundred services deployed by over a dozen engineers. Engineers are able to fully implement and deploy new services quickly and with minimal guidance, having almost no prior infrastructure experience. Services are deployed extremely quickly (< 30 seconds) in a consistent and reliable way. They are monitored, logged, and always kept running.

If you are interested in working with our team on the infrastructure described in this post, we’re hiring! Check out our open Platform Engineer role.

Mesos

Mesos is an abstraction layer that removes the need to think about individual servers in a datacenter. Mesos represents a collection of servers as a pool of “resources” (CPU Cores, Memory, Disk Space). Users of a Mesos cluster (called “frameworks”) schedule “tasks” on resources provided by Mesos. Machines providing resources to Mesos are called “Agents”. Agents can run as many tasks as they have free resources to support. If the underlying hardware running an Agent fails, frameworks are notified and can simply reschedule their tasks elsewhere.

The idea of Mesos at Strava was originally a running joke about hiding experimental services from AWS spend tracking in a “Rogue Mesos Cluster”. It was also something we genuinely wanted to learn about, but initially seemed too lofty and abstract of a concept to provide tangible benefits for a small organization like ours.

Actual adoption took a while and consisted of a long experimental underground phase, another long internal persuasion phase, followed by yet another phase of improving tooling and documentation to the point where engineers without prior experience could use it effectively. Today, we have been using Mesos in production for about a year.

Currently, we use the following Mesos Frameworks at Strava:

  • Marathon (keeps instances of services as long running tasks)
  • Storm (a streaming computation framework)
  • Spark (a data processing framework)

We really like how Mesos is rapidly becoming a standard way to run distributed computing projects. The scope of the problem Mesos solves is very cleverly chosen, and the execution is great. We anticipate moving more of our infrastructure to Mesos using these frameworks and new ones in the future.

Docker

Docker is a containerization tool. Containers function as lightweight and low overhead virtual machines. A Docker image is a file that forms the initial state of a container. It is immutable and carries no state, so containers run from the same image are initially identical.

Using containers allow developers to use whatever language or framework they wish. A service developer simply needs to produce a Docker image that implements their service. They don’t have to configure any servers.

It is a good idea to use containers with Mesos because a Mesos Agent runs a heterogeneous workload colocated on a single machine. Inside of a container, services don’t care at all what the configuration of the Mesos Agent looks like. Without containers, heterogeneous services would constantly be running into dependency conflicts, and developers would be opinionated about the configuration of Mesos Agents.

A powerful feature of Docker is layers. A Docker build file is a series of command that starts from a base image, and each command produces an intermediate image. When updating an Docker image, only layers that have changed need to be built. Layers of a Docker image for a Scala service might by sized and ordered like the following:

  • (200 mb) Base Linux system (Docker ubuntu image)
  • (100 mb) System dependencies (apt-get install of java, scala, etc)
  • (50 mb) Application/Library dependencies (jars of 3rd party dependencies)
  • (1 mb) Compiled application (A single jar containing just your own code)

If a developer only edits application code, everything but the last layer will remain the same. Only the last layer will need to be built and pushed to the remote Docker registry. The unchanged layers will also likely be cached on Mesos Agents as well. This speeds up deployment significantly.

Marathon

Marathon is a Mesos framework for long running tasks, analogous to upstart/systemd. A Marathon application declares what Docker image to run, how many replicas to run, and how much CPU and Memory to provision for each instance. Marathon then uses Mesos to make it so. Marathon also provides features such as health checks and deploy roll-out semantics.

We choose to use Marathon somewhat early on. At the time, Aurora seemed a bit complicated for our needs, and Marathon looked more actively developed than Singularity. We also are fond of tools written in Scala. There are also competing, non Mesos based “orchestration layers” that have the function of running containers (Kubernetes, Docker Swarm, CoreOS/Fleet, AWS EC2 Container Service (ECS)).

About a year ago we committed to Marathon as our preferred way to deploy services. We have seen a nearly linear growth of services without any show stopping issues. We have been very happy with Marathon so far.

It certainly was not necessary to use Mesos to deploy services, however there are undeniable benefits. At the very least, Mesos eliminates many of the drawbacks that the fragmented service oriented architecture introduces. Some particular topics warrant additional discussion:

Developer Happiness and Efficiency

Our original infrastructure for deploying services caused much frustration and was far from ideal. It took many long running steps with manual verification of success between steps. It was also difficult for a new engineer to develop a new service on their own, because the process was not simple and not well documented. The process looked something like this:

  • Build a Debian (.deb) package
  • Push deb package to our apt server (1 minute)
  • Wait for apt server to have deb package ready (5 minutes)
  • Build step:
  • Boot a new AWS instance (1–2 minutes)
  • Run puppet on the instance, installing the deb (2–5 minutes)
  • Turn the instance into an AMI image (2–3 minutes)
  • Boot new AWS instances using new AMI (1–2 minutes)
  • Terminate old AWS instances

It would easily take 30 minutes for a single deploy even under ideal conditions where nothing broke.

A fast deploy cycle is a critical component of a service oriented architecture. More services means more deploys, and no one wants to spend time deploying. With Mesos/Marathon/Docker our deploy process now looks like this:

  • Build and push Docker image (< 1 minute, can run locally or via CI)
  • Deploy with Marathon (~1 min to complete a rolling restart, but a failure is known sooner)

A simple service might only take 20 seconds to fully deploy under ideal conditions.

There are benefits that span beyond direct time savings. Rapidly deploying and observing multiple service configurations allows for testing and optimization that otherwise would be too costly. Developers learn to deploy a single change at a time, this makes tracking where bugs were introduced trivial. Fast rollbacks also reduce the impact of a broken deploy, as well as the pressure on a developer to not make mistakes.

We have also realized tremendous benefits from adoption of Docker. Before, a reference production build could only be generated by doing a real deploy on AWS. With Docker, a developer can do this all locally. The surface area of difference between local development and production is tiny. We also had almost universally negative experiences with puppet and are happy to move away from that technology.

Infrastructure Simplicity

All Mesos tasks run on a pool of servers configured in the same way. Our infrastructure is very simple. Management of our AWS billing and selection of instances to reserve is easier. Previously, services would reserve different instance types according to their needs. Juggling instance reservations in this environment was challenging and wasteful.

A service process always restarts on a clean Docker image and is stateless. A service can be restarted without concern for losing state. An on-caller can restart a service that looks like it is in a bad state without fear that some pet server will be lost. Since logs are collected centrally there is a record of what happened as well.

Since the Mesos Agent machine itself is stateless and all of the underlying services are tolerant of failure, any single instance can be terminated without worry if a problem arises. Containers can’t run out of disk space, if they use too much disk they will be rescheduled.

The barrier to redeploy all of our services is quite low. They could all be redeployed without any site impact in less than an hour by a single engineer. This gives tremendous flexibility in emergencies; and allows painless roll-outs of tiny changes affecting multiple services.

Resource Efficiency

Many of our services require very few resources, but still need to be highly available. As an example, our internal authorization service is accessed for every web request (~1000 QPS), but consumes less than 1 CPU core. Mesos/Marathon allows us to schedule 3 seperate instances of this critical service, each running on a separate Mesos Agent and provisioned with significantly less than one full CPU core. In aggregate, across dozens of services, the cost savings add up.

Mesos also gives the freedom for a service to change its resources in a single deploy. If you realize your service performs better with a tiny bit more memory, it is super simple to make that change.

The savings for staging environments are even higher, because staging services can remain online while consuming hardly any resources at all. In a less efficient environment, staging services are often entirely neglected in order to cut costs.

Since deciding to seriously focus on getting Mesos/Marathon/Docker into production about a year ago, we found numerous small missing pieces of functionality that we took the liberty to build or configure ourselves. In the meantime, many other projects have implemented overlapping solutions, so our solutions are no longer necessarily the best tools.

Marathon Deploy script

Marathon ships with an HTTP JSON API. The API is a fantastic foundation, but we wanted a little bit more. In particular, we wanted the following main features:

  • Use HOCON for Marathon configuration, because most of our (scala) services already used HOCON for their internal configuration. We also feel that the features of HOCON are fantastic.
  • All production configurations need to be checked into a single repository.
  • While the Marathon UI is useful, most developers wanted to use command line. However, deploying via command line with a raw HTTP API was a bit daunting. We also didn’t want developers using the Marathon UI to create apps.

Along the way, we continually found new features to add to our Marathon deploy script:

  • Command line arguments can override configuration variables
  • Blocks until the deploy completes or times out, while printing the status
  • Subscribes to Marathon event stream and outputs information task status information
  • Immediately fetches and prints logs from any tasks in the new deployment that fail
  • Adds a label to the app with the username of the deployer
  • Verifies that the target Docker image/tag actually exists (a surprisingly common mistake that is difficult to debug)
  • Extends the configuration spec to support our own Marathon tools to support things like autoscaling, autodeploying, and external traffic routing
  • Supports inline configuration of the underlying service in a standardized way

Autoscaling Marathon Apps

Marathon has no built in support for automatically changing the number of instances running for a single app. We wrote a simple tool for this task.

Our Marathon deploy script reads configuration from an autoscale section, validates and injects this config as the value of an autoscale app label. The configuration defines a graphite metric as well as bounds for that metric. When the metric is above the upper bound, instances will be added, and when it is below the bound instances will be removed.

autoscale = { minInstances = 3 maxInstances = 9 cooldown = 900 graphite = { target = mesos.marathon.${id}.cpu_util down = 0.4 up = 0.7 } }

We also modified our deploy script to maintain the current number of instances when deploying an app that is autoscaled.

Marathon instance metrics for a single Marathon app. The red line tracks CPU Cores allocated to this app. The yellow line indicates aggregate CPU utilization out of this quota (on a separate axis). Autoscaling stabilizes CPU utilization of this app to remain at a optimal ~60%.

Autoscaling on Marathon has been great, and can occur almost an order of magnitude faster relative to autoscaling of AWS’s Auto Scaling Groups.

Autoscaling of Mesos Agents

Once individual Marathon apps are able to autoscale, it is necessary to scale the number of Mesos Agents in the cluster to satisfy demand for resources and to save money during non-peak hours. We export metrics about resource utilization from Mesos to Cloudformation, and use an Autoscaling rule that maintains a ~20% overhead of resources at all times. We also configured graceful shutdown of Mesos Agents — containers running are given time to cleanly exit and be rescheduled.

Total CPU Cores available to Mesos, as well as cores actually used by Mesos tasks over a three day period. The number of Mesos Agents changes to satisfy demand.

Adding autoscaling to the cluster has some side effects (both good and bad) that influenced our Mesos usage:

  • Forces a high level of confidence in our Mesos infrastructure — Machines running some part of critical services are regularly terminated throughout the day.
  • Truly forces service developers to write applications that handle failure.
  • Greater risk of encountering bin packing issues, as tasks are rescheduled frequently and without supervision.
  • Problems like disk space leaks or slow memory leaks can go unnoticed, because tasks will be frequently restarted.
  • Has hindered the use of truly long term tasks on the Mesos cluster: Retaining the ability to terminate Mesos Agents without notifying service owners is very valuable as a cluster administrator.

These side effects warrant adopting a hybrid approach: Configure two pools of Mesos Agents, a “stable” set that are fixed and long running, and a “scaling” set that are ephemeral and autoscaling. Using Agent attributes, Mesos tasks can declare if they want to be placed in the “stable” Mesos Agents pool, or the “scaling” Agent pool.

Automatic Deploying

Having only two steps required to deploy a service is great, but doing it in one step is even better. The Automatic Deploying service allows for Marathon apps to be deployed automatically when a Docker image is pushed to our private registry.

The automated deployment service listens to Docker registry events. When a Docker image is pushed to the registry, it looks for Marathon applications with a special autodeploy label in their configuration. The value of this label is an expression that is matched against the Docker image name and tag. For example, “foo” service might be configured with autodeploy = strava/foo-service:master**. This would match against any Docker image tags for strava/foo-service beginning with master*. These tags might then be automatically build by out continuous integration after any push to the master branch of git, or built locally from a developer's machine. The autodeployment services a deployment with the container.docker.image attribute of the matched Marathon app.

We mostly use automatic deploying carefully for staging services, or for non-critical production services.

Docker exec helper script

This is a tool to facilitate debugging of production containers in the simplest possible way. The script allows docker exec command to instead take a Marathon app ID and run Docker commands on a Marathon managed container on a remote Mesos Agent. A developer can get a shell or exec and command on an instance of their service with a single command from their local machine.

Internally, the script consults the Marathon API to get the Mesos Agent host that is running one of the app tasks, and the Mesos API to get the specific Docker container ID. Finally, it runs the docker command with DOCKER_HOST set to the Mesos Agent's docker socket, and the container ID of the remote container. The Mesos Agent's Docker daemon socket port is exposed to a network that only engineers can access.

Service Discovery, Load Balancing, Network Restrictions

Mesos also manages ports as a resource. A Marathon task thus gets some random ephemeral port on the Agent where it is run. Most of our services use finagle which can use zookeeper server sets for service discovery. Finagle clients know how to consult zookeeper to resolve a static service path into a dynamic set of ephemeral host/ports.

For HTTP load balancing, we so far have used Bamboo plus Amazon ELBs for edge traffic. We have multiple ELB/Bamboo stacks, each with different security group rules (External Traffic, Internal Traffic, Internal Admin Traffic). Marathon apps configure which ELB/Bamboo they would like to use, as well as a hostname. Inbound traffic is routed by the host HTTP header to the correct set of instances.

Service discovery, load balancing, and networking is by far the most fragmented part of the Mesos Stack. There are numerous other tools related to networking that we would like to explore in the future as our needs grow:

We don’t yet have a solution for network restrictions between containers. Our legacy deploy system relied on AWS Security Groups for access rules between different services. Currently, all Mesos tasks run in the same security group. This is a bit unfortunate but has worked for us so far.

Centralized Logging

Strava uses a Elastic Search / Logstash / Kibana (ELK) stack for centralized logging. Each Mesos Agent runs a Logspout container which pulls all logs from colocated containers and ships them to the ELK stack. Log messages are annotated with metadata including the Agent host, container ID, and Marathon app name.

Metrics Collection

Strava uses graphite for metric storage. A collection of simple scripts were written to pull metrics into graphite. Metric sources include:

  • Docker stats API
  • Mesos Master and Agent API
  • Marathon API

Our metric viewing frontend has a template to give all aggregate stats for a single Marathon app.

IAM Policies

IAM Policies regulate access to AWS APIs for resources such as s3, SQS, etc. We found a brilliant project called ec2metaproxy. This proxy runs on each Mesos Agent and basically intercepts all calls to the AWS API made by any container. In particular, it intercepts the call to the AWS api made when a container asks for temporary security credentials. Instead of returning security credentials for the Agent’s IAM Role, it first looks at what IAM Policy the container is configured to use, and instead requests credentials only for that role, using the role of the Mesos Agent itself. The end result is that a developer can configure the IAM Policy for their app directly as part of the Marathon configuration.

Volume Support

We configured REX-Ray as a Docker volume driver and use AWS EBS volumes for a few Marathon apps. There are numerous limitations to this approach, but it has been helpful for a few particular cases.

Slack integration

We also wrote a simple service that sends Docker image push notifications and Marathon deploy status updates to a company slack channel. We will soon have this also provide configuration diffs for all deployments as well.

Marathon Inception

We thought Marathon was such a good way to deploy things, we deploy Marathon on Marathon. We do however run dedicated Mesos Master machines, as well as dedicated Zookeeper machines. Aside from those, the only type of machine needed to run is the Mesos Agent.

Mesos now provides obvious benefits to our infrastructure, but getting there was not without uncertainty and investment. Timing combined with a small team dedicated to the project (mostly only myself) made us an early adopter. There were numerous time consuming issues to figure out and even help solve upstream. However, the ecosystem around Mesos/Marathon/Docker has rapidly improved and continues to evolve. An organization like ours starting with Mesos today would have significantly less work cut out for them, and the benefits would be much more immediate.