Cluster Management at Chartbeat — Part 1

Published in

Chartbeat Engineering

11 min readAug 2, 2017

Cluster Management at Chartbeat is a series of posts about how we deploy and manage services running on Apache Mesos and Aurora. The series covers a wide variety of subjects including Pants, HAProxy, Kafka, Puppet and OpenTSDB.

Introduction

Chartbeat handles massive amounts of traffic and we crunch a ton of data. Just over a year ago, as our EC2 server instance count approached 1000 we decided to migrate as much of our infrastructure as we could to Apache Mesos, using Apache Aurora as our management framework and Pants for managing our python deployments. Today, the project is nearly complete and it has been a huge success in terms of improving engineering efficiency, reducing our server footprint and providing faster and more reliable services to our customers.

Larry Wall once said “The three chief virtues of a programmer are: Laziness, Impatience and Hubris”. We are a stubborn and opinionated group of engineers at Chartbeat, we like to automate things and we don’t like to wait. Laziness in this quote does not imply a lack of work ethic, quite the contrary. Once choosing our solution we spent a lot of time operationalizing our cluster with custom tools that make our lives better by automating our workflow and providing the insights we need to keep things running. Anyone who has spent time in the DevOps world knows that the engineering required to maintain a large and complex environment is incredibly complicated and the solutions can be quite creative. The tools and techniques we have adopted or developed are going to be the focus of this series and after this introductory post I will outline what’s to come.

While I am extremely proud of the work we’ve done, by no means do we claim to have all the answers for anyone else embarking on a similar project. We believe in openness and sharing work in order to inspire others and give back some of our learnings to the community. A large part of programming is building on the work of others, which is why we believe in open source and sharing learnings, both good and bad. While this series will definitely get deep into some code, it’s more about the architecture and the approach we take to solving DevOps problems.

The Project

Chartbeat was around 6 years old, a point where most successful startups find themselves with a significant amount of tech-debt and start considering newer techniques for optimizing their DevOps. We’re not a large team, our entire engineering staff is around 25 people and of those we have 5 on our “platform” team, which includes DevOps. For us, the problem was that while we had a very well architected system for deploying our array of services to their respective puppet managed instance types, every new product that we built was deployed to its own set of servers which was leading to wasted CPUs and RAM, a steady stream of nagios alerts for random things needing to be rebooted and a sprawl of puppet code which was becoming hard to maintain. Additionally, while we adore python, managing package dependencies is a real headache and we knew that anything we did would have to allow isolation of python dependencies on a per-job level.

Based on our existing problem set and our (rather opinionated) engineering mores our project goals were essentially:

Make life better for our engineering team
Reduce server footprint and overall costs
Provide faster and more reliable services to our customers

Additionally, whatever 3rd party product we choose must:

Solve python dependency management and isolation
Play nicely with our current workflow and be hackable
Be Open Source and supported by an active user community using it in production
Allow us to migrate our services over time in a safe manner with no downtime

Background

Before diving into our solution, it’s important to understand what the servers we wanted to migrate were doing. Our users are the editors of major newspapers and blogs. We provide them dashboards and APIs that allow them to know — in real-time — how well every article is performing. Our metrics are “concurrent users” and “engaged time”.

I’m not going to get into Chartbeat as a product, and it’s not really important for understanding our solution other than to know that the vast majority of our data comes from a JavaScript beacon that sends us “pings” with data about what page a site visitor is reading and where they are on the page. When they leave the page we’re notified and we can call that “session” complete. These pings come in at a frequency of around 275K/second on an average day. If you’re curious what this looks like to our users, you can see it here.

There are many parts to the various products we’ve built around this data. We use a number of databases including: memoryfly (our proprietary real-time database), Mongo, Amazon Redshift, Postgres, Riak, Apache HBase and Amazon Athena. When pings come in, they are bifurcated into two pipelines. One pipeline sends pings into memoryfly, which as the name suggests is an in-memory database written in C which responds to (most) requests for real-time queries through an embedded lua interpreter. This database is the oldest piece of code at Chartbeat and while we do make changes to it, it’s what can be considered “steady-state” and it serves us well. We decided early on that this would not be part of phase 1 of our migration.

The second pipeline starts with pings being fed into into Apache Kafka. This is where most of our product engineering happens. The general pattern for these products is a consumer (usually written in Clojure) reads pings off of the ping topic (where every ping starts), does some transformation or aggregation and writes data to a database and/or another Kafka topic. Additional consumers may process the data in subsequent topics. Two of our Kafka pipelines are on the order of 3–4 consumers deep while the others are much simpler, but still follow the same db/api pattern. We don’t currently use Spark, Storm, Flink, Samza or any other streaming framework, although we’ve built prototypes with most of them. I’m sure we will at some point, but so far we haven’t found the need. A typical pipeline looks something like this.

Our API servers (usually written in python) read from the various databases and for the most part just do simple combining and formatting of data, along with access control checks. Some of these are responding to requests for real-time data, some are historical. Some of them are called internally by each other. You might call them micro-services, I hesitate to use this term but I might just not like assigning labels to things.

Finally, we have what we call “workers”. These can mostly be thought of as ETL jobs. They run either as crons or continuously read from a rabbit queue. They do things like roll-up data every 15 minutes and insert it into a different database or read from two databases and write the result to a third.

In our old setup, a single “product” consisting of a 3 step kafka consumer process writing to two databases along the way and offering an API would have required adding 50–75 servers to our fleet (plus any new databases). While EC2 makes it easy to do spin them up, and puppet does a reasonable job of managing them, maintaining a large fleet is inherently difficult, wasteful and time consuming. Iterative development also becomes slow when it takes 15 minutes to replace a server (ironic coming from someone who used to rack his own boxes).

It should be obvious at this point that we have a very complex system. I haven’t even discussed code build and deployment, routing, monitoring or security. All of these items were on the top of our mind, and actually a larger concern than running our pipelines.

Choosing a Solution

In terms of options for cluster management, there were essentially three (and apologies to any we didn’t really consider) — Mesos, Kubernetes and deeper AWS integration with ECS/Lamba/etc. We really didn’t want to dive more into AWS’s stack as we really don’t like black-boxes, so while we did prototype deploying a service to ECS and we do use Lambda for a few things, we wrote this off as a choice early on.

Kubernetes and Apache Mesos were both relatively new when we started the evaluation, but had been under development for a number of years. Kubernetes was a developed at Google, based on their Borg system. Mesos began as a research project as UC Berkley and eventually became an Apache project. While both attempt to solve the general problem of cluster management, their approaches differ and each have their own pros and cons. After a few days of reading docs and whiteboarding possible issues with each, we decided that Mesos was more flexible and, for better or for worse, its future seemed more certain.

At the time, the future of Kubernetes wasn’t as clear as that of Mesos. Stewardship of Kubernetes was being transferred to the Linux Foundation and we weren’t convinced that it would continue moving ahead quickly (although it has). Mesos, on the otherhand was being used in production by a number of large companies, already had several frameworks available, books were written about it and a new company (Mesosphere) had just formed to maintain it.

Mesos has a number of job scheduling “frameworks”, the two most generic ones at the time were Marathon and Apache Aurora. It may have been bad timing, but Mesosphere had just announced DCOS (to presumably replace Marathon) and we didn’t think it made sense to jump into a product that was about to be transformed into something new. At the time Aurora had, and continues to have, a strong community of engineers who were using and maintaining it at companies like Twitter (who built it) and AirBnB.

I would be remis if I didn’t mention another major technology we adopted for this project. Deploying code to a Mesos framework requires a way of bunding that code into a container of some sort. We have yet to adopt Docker for code deployments, so it didn’t bother us that Aurora didn’t yet support Docker (it does now) but we still needed a way to deploy. Pants (also from Twitter), a great dependency management framework for python, bundles code and dependencies into PEX files — essentially executable archives. This allows us to ship a single file with all of our code and dependencies for a service and run it from the command line. You can think of it as an uberjar for python.

Refining the Scope

Clearly, moving our entire infrastructure into Mesos wasn’t going to happen in a single year, and anything a 6 year old company decides to do should provide value in that timeframe. We decided to leave databases out of the migration along with memoryfly. Additionally we left two other large, fairly static clusters including our varnish servers off in the first phase.

There are various reasons for leaving databases out; they are special purpose with particular requirements and in general we maximize the capacity of the servers on which they run so there isn’t likely to be much ROI in terms of cost savings. They’re also too important to muck with and the risks involved with moving them seemed too high, at least for now. While some people do run databases in Mesos, Aurora doesn’t provide consistent on-disk state for jobs that are restarted, which means a database which is rescheduled to a new server would have to pull its data from another location causing very slow restarts.

This left about a third of our servers (just over 300) as candidates for migrating.

Where We are Now

Currently, of the 300 or so servers that we wanted to migrate we have fewer than 50 to go, most of those being our top level kafka ping consumer which we’ve had trouble migrating. In addition, we’ve wound up moving a lot of our engineering tools (tsdb, burrow, grafana…). Our mesos cluster is only 40 servers (most of them quite large) and we’ve reduced costs and improved engineering quality of life significantly. We get far fewer random alerts for servers needing attention and the environment is much easier to maintain.

The largest win for engineers was how much easier deployments are. In our previous environment, a new system would require several new complex puppet files, start scripts, virtual environment wrappers and modifications to multiple configuration files. New engineers would take days and we had to do detailed PRs for every change. Now any engineer can deploy new code in literally a matter of minutes. As evidence, the first non-Platform engineer to attempt an Aurora deployment did it in one hour just by looking at our examples.

The second biggest win was how we were able to reduce the flood of on-call pages for simple server operations (like full disks or “wedged” services). Thanks to Aurora’s disk space allocation and health check features, the vast majority of those issues are handled automatically.

Proof of Concept and Challenges

The first step in any project of this scale is a proof-of-concept using a “steel thread” test. The idea is to deploy something that hits every point in your stack so you can expose any difficult integration issues first. Needless to say, in just whiteboarding what a deployment would look like we found numerous integration challenges that needed to be addressed and we set out to solve them. The reality of migrating a production environment is that every part of our workflow from build systems through routing and server deployments needed to be integrated with (at least to some degree) before we could declare our POC complete and even consider pointing production traffic at it. While we certainly had the freedom to make massive changes, some approaches we would have taken if we were starting from scratch were off the table for various reasons.

The solutions we settled on for each integration challenge turned out to be the foundation of our cluster deployment and future posts will explore more deeply how we solved them. The first step to everything however, was determining how we would work with Aurora, the subject of my next post.

Part II — How we use Aurora

Blog Post Backlog (subject to grooming)

Following is the list of articles we’re currently working on in this series with the smallest hint of what to expect. Look for them soon! I will link back to them here when they’re published.

Our approach to Aurora: (published 8/28/17) Aurora is a very flexible product which allowed us to create a set of reusable components for deploying services that fit in with our workflow. We didn’t just adopt Aurora’s way of doing things, we made it our own with a set of custom tools and templates.

Version and Build Management: Pants, Git and a LittleFinger. No complex system can get away from a strict system for deploying code. Pants brought order to our python chaos and has proven to be the core technology for not only bundling our python code but also for building tools that our engineers use to interact with our cluster.

Request Routing: Integrating with Synapse and HAProxy. Synapse is a clever bit of code that dynamically rewrites HAProxy configs using Aurora’s job information in real time. By combining this with puppet we were able to slowly migrate our services and now it underpins our ability to move routes between backend services running in our cluster.

Monitoring and Healthchecking: A Lot to Say about a Mundane Topic. We might actually have enough for a few posts here. We like our graphs, and they’re built with TSDB and Grafana. We also like to be alerted when things change, and we use Nagios for that. We’ve written a few tools to help us here and will discuss them. We’ve also learned a few things about automating healthchecks for jobs running in Aurora.

Log Shipping/Analysis: Dealing with log files is actually a very difficult problem… assuming you want to read them. We’ve tried a few things and have settled on shipping them to S3 and analysing them with Athena for historical forensics. Additionally, we’ve written a tool for combining the real-time logs for multiple instances of a job running in Aurora into a single console stream.