Cluster Management at Chartbeat — Part 2

Published in

Chartbeat Engineering

23 min readAug 28, 2017

How we use Apache Aurora

Cluster Management at Chartbeat is a series of posts about how we deploy and manage services running on Apache Mesos and Aurora. The series covers a wide variety of subjects including Pants, HAProxy, Kafka, Puppet, AWS and OpenTSDB. You may want to start at the beginning.

Note: Our cluster is running Apache Aurora on Mesos. If you’re reading this because you’re just curious about the Mesos world (or interesting engineering stories), I encourage you to read at least the background on their website before continuing. Some familiarity with the way Aurora works will definitely make this a more enjoyable read.

Introduction

Note: all of this is based on our existing Aurora 0.15 cluster. We just started kicking the tires on the recently released Aurora 0.18 (so far so good!) and we’re excited to take advantage of the improvements. Fundamentally we would not have changed much about our approach, but it’s important to make this clear from the beginning since some of our decisions were based on functionality that was missing or not quite working yet.

This post covers how we went about integrating Aurora into our environment. Rather than simply migrating all of our existing deployment systems and processes to Aurora’s workflow, we spent a lot of time creating a wrapper around the Aurora client and a set of templates and tools that adapted it to ours. As a result, we’ve improved the time it takes engineers to deploy new services dramatically while improving reliability and monitoring of our services at the same time.

As a “platform” team, our main customers are the rest of the engineers at Chartbeat. We definite our team mission as — To build an efficient, effective, and secure development platform for Chartbeat engineers… We believe an efficient and effective development platform leads to fast execution. Execution time, in this context refers to the ergonomics of deploying code. A smooth workflow is incredibly important for rapid development. Nobody wants to work in an environment where it takes 10, or even 5 minutes to push a change to production. Forcing product engineers to spend hours wrangling puppet to configure a set of servers for their new project was starting to become a major pain point.

The ergonomics and workflow of programming are extremely important, especially in DevOps. I cooked in a restaurant in college so I love to make analogies to that industry. Writing code as a product engineer is sort of like being a prep cook. You can put on some music and chop away by yourself at a comfortable pace. There’s very little pressure to do things right now (but there’s plenty of other pressure). DevOps, on the other hand, is a dinner rush. You need an organized workspace, a sharp knife and focus. It feels more like a contact sport. There’s a reason why every restaurant maps out exactly which cooler bin contains what ingredient. Chefs don’t have time to try to find the parsley, it’s always in the same place. They also have to trust that if a pan is sitting on a shelf it’s not going to burn them when they grab it.

The way this plays out in organizing the ergonomics of your workflow as a DevOps engineer can be nuanced and personal, but painfully obvious when done wrong. Having to type extra parameters for commonly typed commands or needing to take multiple steps to deploy fresh builds are the types of things that slow operations down and cause grief. Bigger decisions, like whether to store job configurations in a database, git or a 3rd party tool can be life changing and need to be thought through. In a world where product engineers push code to production it’s even more important to make these steps bullet-proof. We spent a lot of time just whiteboarding what it would look like to launch a job in an Aurora world and what it would mean to our engineers’ day to day lives.

There are always tradeoffs to be made when adopting a major piece of technology. A core decision to be made is doing things the way the new tool prescribes versus the way you have been doing them in the past. Every framework has a set of “patterns” it wants you to follow, and Aurora is no different. In general we were very happy with the Aurora “way”, but we have a way of doing things that we did not want to simply throw away and we’re quite stubborn. That’s not to say we weren’t willing to change, change is inevitable and clearly we were looking for improvements, we just wanted to adhere to some of the core principles of our workflow

Our plan for integrating with Aurora was to build a layer between our engineers and the Aurora CLI that imposed a rigid workflow.

It may seem counter-intuitive to choose a tool based on its flexibility and then give our users a strict framework, but it’s actually quite fundamental to good DevOps and not as dictatorial as it may sound. One selling point for us with Aurora was the templating features which allowed us to essentially create our own DSL and CLI for configuring and managing jobs.

Side note — one of our engineers joked during this project that we were reinventing Puppet in Aurora. There are a lot of things we love about Puppet and we continue to use it to manage our fleet so this comment was definitely based in truth.

Background

There is never a one-size-fits-all solution in software engineering. Different tools make sense for different scenarios. What makes sense for Facebook doesn’t necessarily make sense for a five person startup. The software you’re building is only one factor in the decision making process. Team size, skills, work habits and culture all need to be considered along with the technical product requirements. Most organizations develop a workflow over time that finds a (constantly evolving) happy medium between engineering culture and whatever DevOps technology was adopted or built over the years to support them.

Our Team

We divide our engineers into two groups — product engineers who work on new customer facing projects and platform engineers who handle DevOps, architecture, R&D and support legacy products. Our platform team has five engineers and product engineering is around 20. Product engineering includes several groups of engineers (backend, frontend, data science, etc.). For the sake of discussion it’s everybody who pushes code and is not a platform engineer.

Our Workflow

Our software development workflow is a fairly common one, it’s based around a few principles:

Everything lives in a single git repo, master is the source of truth
Deployment of services (api servers, kafka consumers, workers, whatever) can be done by a product engineer to dev or production from a specific githash
Everything can be done from a command line prompt
Whatever can be scripted/automated should be scripted/automated
Code should be tested by the CI server before it gets merged and deployed
Collect metrics on everything and alert on problems immediately

In practice, we push code frequently. Generally (pre-aurora) engineers work on either their local mac or a Vagrant VM running ubuntu. Most code changes flow from local -> dev -> prod. Engineers push to git, get a PR approved and then run a fab command to deploy their githash to production.

Right now, our workflow is essentially the same. It looks like the diagram below. An engineer pushes code to github which triggers a build in jenkins. If the build succeedes, its artifact (usually a pex or an uberjar) is pushed to an S3 bucket and a record of the build it entered in our build database (littlefinger, the master of coin). When a user wants to run a job in aurora they run a command which; checks if the build exists, parses our custom aurora configs and then calls the aurora client with the resulting environment. When a job is scheduled by aurora, the thermos executor fetches the artifacts from S3 and installs them into the local sandbox.

Adapting Aurora

Aurora includes a scheduler and a client. The client is a python executable (a pex file) that parses a given .aurora file, finds a specified job and calls the scheduler via thrift to request whatever command the user asked for.

For example (from Aurora HelloWorld docs please go read them):

aurora job create dev/www-data/devel/hello_world hello_world.aurora

This will create the hello_world job in the devel stage, running as www-data on the mesos cluster named dev. The job information is all specified in hello_world.aurora. The aurora client accepts other commands to perform basic CRUD operations on jobs (update, inspect, kill, etc.).

The .aurora file is python, but it’s parsed by the client. The templates are based on pystachio which provides type checked structs that are evaluated at execution time by either the aurora client or the scheduler (depending on the field). The idea is that you may know that you need a port to bind a webserver to in your job, but the port number is assigned by Aurora when the job starts running so it can’t be resolved until then. Similarly, you might have options like database connection strings that differ between stages that you want to assign programatically before the job starts running.

These properties can be used in creating .aurora files with mustache style templates. For example, these two entries will be replaced with the http port assigned to the job instance by Aurora at runtime and some sort of hbase_master url specified in a struct called myservices which will be evaluated by the Aurora client.

{{thermos.ports[http]}}
{{myservices[hbase_master]}}

My first thought when looking at Aurora templates for creating jobs was “wow, this is cool!” my second thought was “this could get messy real quick”.

The last post described the main types of programs we run with the vast majority falling into three groups: workers (python crons or queue consumers), api servers (usually python) and kafka consumers (clojure). Software engineering is about finding patterns and abstracting solutions to fit those patterns. DevOps is no different, give people an easy to use template to follow or they will invent their own time and time again, leaving you with endless piles of tech-debt.

We had great success in providing puppet classes for repeatable tasks like installing runit scripts to manage an apiserver along with the nginx routes, authentication proxy, log file slurper and stats collector that run along side it. When you have templates like this, creating a new service is fairly easy, and so is scaling. Creating 20 apiservers running diffent code is no big deal, they all work the same way just with different parameters. Try doing this without a set of generic puppet classes and you’ll quickly find yourself with a dozen differently configured servers and a maintenance nightmare.

One of the problems with a puppet based environment, even with a set of custom templates, is that puppet is incredibly complex for non-DevOps savvy engineers. Not only is the language complicated, it requires a somewhat sophisticated knowledge of the operating system to do much more than simple tasks since you’re actually operating at that layer. This is one of the main problems we solved. Every engineer at Chartbeat feels completely comfortable writing, launching and managing Aurora jobs — a huge shift from where we were a year ago.

Knowing that taking the time upfont to create our scaffolding was well worth the effort, we did just that. We approached this migration as a tech-debt project which gave us the freedom to throw away a lot of old work and essentially build a 2.0 infrastructure that fit our needs based on what we liked about 1.0.

First Migration

The first thing we tried running in Aurora is an apiserver called eightball. It’s a python process and is fairly typical of our apis. It handles 4 endpoints, which fetch data from a mongo server, do some formatting and return json. It gets fairly high traffic, around 15K rpm.

Setting up the endpoint required the following:

Fetch a pex file (we’ll talk pants/pex in another post) from s3 and put it in the sandbox
Install “authproxy”, a proxy we use for managing access rights to endpoints (also a pex file stored in s3). This proxies every request
Install and/or configure an aurora healthcheck
Configure haproxy to route to this endpoint
Configure the apiserver for the correct database, etc.
Determine required resources to handle the traffic
Ensure that stats were being reported correctly and monitoring was in place

Needless to say, we need prod/dev versions of most endpoints and we want to be able to canary changes (send some live traffic to a new version for testing and make sure it doesn’t die).

Getting this working was not hard at all, especially once we implemented synapse for managing our haproxy configs (the subject of a future blog post) it was much easier than we expected.

Making it Easy

Looking back at the list of our workflow principles and applying them to what we learned deploying eightball we decided to do a few things:

Put all aurora configs into a single directory in our repo
Create a yaml file for each job extracting out mutable configurations without changing the .aurora file (more on this in a minute)
Wrap the Aurora client in our own CLI offering a sub/super-set of aurora commands with our own grammar to support our workflow
Create a library of Aurora templates (literally a python module) for everything we possibly can so that end users are all doing things the same way and we wind up with .aurora files that are all similar
Create a database backed service that can verify builds before they are deployed to production and allow the tagging of specific builds with an identifier that can be deployed quickly
Use aurora’s notion of “role” to map to a unix user per product scheme for quotas and reporting
Maintain 2 clusters, aa and bb for testing out and deploying new versions of Aurora (clearly since we’ve been on 0.15 for a year this didn’t get much attention)

This has gone through quite a few iterations, but we have settled on a very nice workflow that Works for Us™. We set a goal of migrating half of our jobs to Mesos over the subsequent year and showing a significant reduction in our server footprint.

There are a few major changes from a design standpoint between what we have and what a user gets off the shelf with Aurora. The most important is that we have a 1 to 1 mapping between Aurora job and a yaml file. Aurora has a many-to-one mapping between a .aurora file and an Aurora job. That is to say, in the Aurora world you can define multiple jobs in a single .aurora file. When you create a job you specify both a file and a job:

aurora job create cluster/foo/devel/hello_world hello_world.aurora

There are some great things about this, it logically groups related jobs together. For ergonomic and long term configuration management reasons, we didn’t like it. We currently have 270 jobs defined, any of which can run in prod or devel. I know this because:

± |master {18} ?:24 ✗| → ls aurora/configs/*.yaml |wc -l270

This makes it very easy to write scripts that iterate over every job in the repo to find the job you’re looking for. It also makes it easy to make global configuration changes, like adding resources to all the jobs in a given list.

Additionally, grouping multiple jobs into a single file naturally means that the file has the potential to be large and complex. One of the main reasons we wanted to get away from puppet for this class of work was to make things simple. By making the decision that a job was always defined by a single yaml file we made it very easy to bump the RAM on a job that’s flapping in the middle of the night. It’s right there in the file with the same name as the job that’s paging. By deciding that an aurora file is only defining a single job we’ve limited the potential complexity of the file.

Just to be clear — that doesn’t mean that multiple processes can’t be defined in a job. See Jobs, Tasks and Processes on the Aurora website for the nomenclature.

This is what our aurora file looks like for eightball today

PROFILE = make_profile()
PEX_PROFILE = make_pexprofile('eightball')
SERVICES = get_service_struct()install_eightball = pex_install_templateeightball_server = Process(
    name='eightball',
    cmdline=('./{{pex.pexfile}} --port={{thermos.ports[private]}} '
             '--memcache_servers={{services.services[memcache_servers]}} '
             '--series_daily_host={{services.services[series_daily_replicaset]}} '
             '--cellar_host={{services.services[cellar_replicaset]}} '
             '--history_host={{services.services[cellar_replicaset]}} '
             '--alert_host={{services.services[cellar_replicaset]}} '
             '--workers={{profile.taskargs[CB_TASK_WORKERS]}} '
             '--console '
             '--logstash_format=True '
             '--logstash_tags=eight_ball ')
)

auth_proxy_processes = get_authproxy_processes()

health_check_processes = get_proxy_healthcheck_processes(
                         url="/private/stats/", port_name='private')

MAIN = make_main_template(
            ([install_eightball, eightball_server],
              auth_proxy_processes, health_check_processes,),
              res=resources_template)


jobs = [
    job_template(
        task=MAIN,
        health_check_config=healthcheck_template(
            timeout_secs=5,
            interval_secs=5,
            max_consecutive_failures=5
        ),
        update_config=get_update_config(t='fast')(watch_secs=41),
    ).bind(pex=PEX_PROFILE, profile=PROFILE, services=SERVICES)
]

And it has a corresponding yaml file that clearly lays out most things a user might want to change once this service is live.

file: eightball
user: cbe
buildname: eightball
hashtype: git
arch: trusty
config:
  cpu: 0.25
  num_instances: 1
  ram: 300
  disk: 5000
taskargs:
  workers: 10
envs:
  prod:
    cpu: 1.5
    num_instances: 12
    taskargs:
      workers: 34
    githash: 60a43FOO363b
  devel:
    githash: 60a43BARdbdb

When this is all evaluated, the resulting job configuration looks like this (edited slightly for clarity and to obfuscate server names):

Job level informationname:       'eightball'
role:       'cbe'
contact:    '<class 'pystachio.composite.Empty'>'
cluster:    'bb'
instances:  '12'
service:    True
production: TrueTask level information
name: 'envvar_setup'
constraints: envvar_setup
pexinstall < eightball
authproxy_install < authproxy
proxy_healthcheck_install < proxy-healthcheckProcess 'envvar_setup':
cmdline: mkdir ./tmp && echo 'export TMPDIR=./tmp
export CB_AURORA_INSTANCE={{mesos.instance}}
export CB_AURORA_HOSTNAME={{mesos.hostname}}
export CB_AURORA_TASK_ID={{thermos.task_id}}
export CB_AURORA_STAGE=prod' > .thermos_profile; echo '{}' > service_overrides.yamlProcess 'pexinstall':
cmdline: s3cache https://<our-repo>/pex/eightball-SOMEGITHASH-trusty_x86_64.pex --cache-dir=/mnt/s3cache --copy-to=eightball-SOMEGITHASH-trusty_x86_64.pex --console --loglevel=debug --logstash_format && chmod 700 eightball-SOMEGITHASH-trusty_x86_64.pexProcess 'eightball':
cmdline: ./eightball-SOMEGITHASH-trusty_x86_64.pex --port={{thermos.ports[private]}} --memcache_servers=memcache-main.0001.use1.cache.amazonaws.com:11211,memcache-main.0002.use1.cache.amazonaws.com:11211,memcache-main.0003.use1.cache.amazonaws.com:11211,memcache-main.0004.use1.cache.amazonaws.com:11211,memcache-main.0005.use1.cache.amazonaws.com:11211,memcache-...etc --series_daily_host=seriesdaily05,seriesdaily06,seriesdaily07,seriesdaily08 --cellar_host=cellarreplicaset03,cellarreplicaset06,cellarreplicaset07,cellarreplicaset08 --history_host=cellarreplicaset03,cellarreplicaset06,cellarreplicaset07,cellarreplicaset08 --alert_host=cellarreplicaset03,cellarreplicaset06,cellarreplicaset07,cellarreplicaset08 --workers=34 --console --logstash_format=True --logstash_tags=eight_ballProcess 'authproxy_install':
cmdline: s3cache https://<our-repo>:7443/pex/authproxy_server-ANOTHERGITHASH-trusty_x86_64.pex --cache-dir=/mnt/s3cache --copy-to=authproxy_server-ANOTHERGITHASH-trusty_x86_64.pex --console --loglevel=debug --logstash_format && chmod 700 authproxy_server-ANOTHERGITHASH-trusty_x86_64.pexProcess 'authproxy':
cmdline: ./authproxy_server-ANOTHERGITHASH-trusty_x86_64.pex --port={{thermos.ports[http]}} --memcache_servers=memcache-main.0001.use1.cache.amazonaws.com:11211,memcache-main.0002.use1.cache.amazonaws.com:11211,memcache-main.0003.use1.cache.amazonaws.com:11211,memcache-main.0004.use1.cache.amazonaws.com:11211,memcache-main.0005.use1.cache.amazonaws.com:11211,memcache-...etc --cache=1 --cache_prefix=eightball --timeout=15 --console=True --logstash_format=True --server=http://localhost:{{thermos.ports[private]}} --right=all --use_liveconf_aclserverProcess 'proxy_healthcheck_install':
cmdline:
s3cache https://<our-repo>:7443/pex/aurora-proxy-healthcheck-ATHIRDGITHASH-trusty_x86_64.pex --cache-dir=/mnt/s3cache --copy-to=aurora-proxy-healthcheck-ATHIRDGITHASH-trusty_x86_64.pex --console --loglevel=debug --logstash_format && chmod 700 aurora-proxy-healthcheck-ATHIRDGITHASH-trusty_x86_64.pexProcess 'proxy-healthcheck':
cmdline: ./aurora-proxy-healthcheck-ATHIRDGITHASH-trusty_x86_64.pex --port={{thermos.ports[health]}} --url="/private/stats/" --test_port={{thermos.ports[private]}}

Starting, updating, restarting eightball

aurora-manage create eightball --stage=prodaurora-manage update start eightball --stage=develaurora-manage restart eightball --stage=prod --batch-size=6

Most of the commands are mirrors of the aurora client, but we’re adding some sugar on top to do things our way. As these commands show, one feature is being able to easily specify dev or prod for deployments (and we default to dev). What you don’t see is some safety checks, like that the user is at the head of master in the repo and that the githash being operated on has successfully passed CI tests.

The Yaml file…

Starting with the bottom and working our way up… If you’re familiar with Aurora, the first thing you notice from those commands is that we don’t specify job names. The create command will start /bb/cbe/prod/eightball. If I has specified — hash=devel it would have created /bb/cbe/devel/eightball. There is no way to change cbe (chartbeat for everyone) to cbp (chartbeat pro) or cbops (chartbeat operations) without editing the yaml file. This is by design, we don’t want people changing the names of jobs by mistake.

The second thing you probably notice is that we have moved the resource requirements into a ‘config’ map in the yaml file which can be overriden on a per-stage basis. Since we really only have dev and prod, a very common need is to have a dev instance that’s exactly like prod, just smaller. There’s also a ‘taskargs’ map which can be used to specify command-line args that wind up being applied to binary being run. This allows users to tweak these configs quickly.

At the top of the yaml file you’ll see a bunch of meta data about the job:

file: eightball
user: cbe
buildname: eightball
hashtype: git
arch: trusty

File is the name of the .aurora file this .yaml file corresponds to. When a user types aurora-manage <job> it’s referring to the name of a yaml file. It’s entirely possible for multiple yaml files to point to the same aurora file. Our goal is actually minimize the number of aurora files through templates. As evidence of this, almost all of our workers point to the same aurora code.

User specifies which user (role) to run as. Originally the idea was to be able to tell the business folks how much each of our products was costing (from a server standpoint). Nobody has really ever asked us the answer, and it’s probably best since most of the jobs are run as cbp. This really doesn’t do much more than help with navigating the aurora web ui. If we were to revisit it we would definitely get more fine-grained for the purposes of that reporting, probably to the project level (group consumers, api servers and workers from the same project as a unique user).

Buildname is the name of the pex file (or jar file) being run. When a user tries to run a job we look in a mysql db (called littlefinger) and make sure that the build for the given githash exists before continuing. This prevents jobs from being launched that die due to a failed install on the cluster. In the case of jar files there will be a version instead of a githash.

Hashtype and Githash/Version can be one of git, version or static. The intention is that version is a java style semantic version number and static means that whatever is being run has no versioning (we use this for a lot of devops tools where we’re running third party things that are not really versioned). In the case of git there is also a githash supplied and in the case of version there is a version supplied.

Arch. I guess I haven’t mentioned that besides adopting aurora and pants we were also in the midst of upgrading from Ubuntu precise to trusty and we support builds for both. We’re leaving this here for our next LTS upgrade, but currently this is always trusty.

All of the data in the yaml file is loaded in the aurora template environment into a pystachio struct which is the basis of all of our jobs. Once this is loaded, making higher level templates for our users becomes easy.

class CBProfile(Struct):
    bucket = Default(String, PEX_BUCKET)
    cluster = Default(String, 'bb')
    cpu = Required(Float)
    ram = Required(Integer)
    disk = Required(Integer)
    githash = Default(String, '')
    role = Default(String, 'ubuntu')
    stage = Default(String, 'test')
    name = Required(String)
    is_production = Default(Boolean, True)
    num_instances = Default(Integer, 1)
    taskargs = Map(String, String)
    service_overrides = Map(String, String)
    cron_schedule = Default(String, None)
    buildname = Default(String, None)
    cron_collision_policy = Default(String, 'KILL_EXISTING')

The .aurora file

In the spirit of reuse for simplicity and consistency, we have a library of template code that our product engineers use to compose their jobs. For example, probably the most common thing we do is fetch an executable from s3 and install it in the sandbox. Let’s look at these lines

from aurora.cbaurora.base import make_profile, pex_install_templatePEX_PROFILE = make_pexprofile('eightball')
install_eightball = pex_install_template

The python code these call is the following:

def make_pexprofile(appname=None, **kwargs):
    env = os.environ
    if not appname:
        appname = env['CB_BUILDNAME']
    args = {'appname': appname, 
            'githash': env['CB_GITHASH']}
    args.update(kwargs)
    return PEXProfile(args)
class PEXProfile(Struct):
    bucket = Default(String, PEX_BUCKET)
    appname = Required(String)
    githash = Required(String)
    arch = Default(String, 'trusty_x86_64')
    pexfile = Default(String, 
              '{{appname}}-{{githash}}-{{arch}}.pex')pex_install_template = Process(
    name='pexinstall',
    cmdline='s3cache {{pex.bucket}}/{{pex.pexfile}} 
             --cache-dir=/mnt/s3cache 
             --copy-to={{pex.pexfile}} 
             && chmod 700 {{pex.pexfile}}')

Most of our api servers also run what we call authproxy. This is a python app that proxies all requests coming from the outside world and checks the apikey of the requestor against our permissions engine. Authproxy is generally bound to what we call the “public” http port and the actual api server is bound to “private”. Internal requests from other processes can just connect directly to the private port. This will be covered in detail when we talk about how we integrate haproxy.

Installing authproxy is such a common requirement we made it just a few lines:

auth_proxy_processes = get_authproxy_processes()

Similarly, some older apiservers don’t bind to multiple ports, so we have a health check proxy that listens to aurora’s health endpoints

health_check_processes = get_proxy_healthcheck_processes(
                 url="/private/stats/", port_name='private')

Both of these return an install process and a task process as a list.

Generally the only slightly complicated piece of creating a new aurora job for us is configuring the command line args to run a job.

eightball_server = Process(
    name='eightball',
    cmdline=('./{{pex.pexfile}} --port={{thermos.ports[private]}} '
             '--memcache_servers 
                {{services.services[memcache_servers]}} '
             '--series_daily_host=
                {{services.services[series_daily_replicaset]}} '
             '--cellar_host={{services.services[cellar_replicaset]}} '
             '--history_host{{services.services[cellar_replicaset]}} '
             '--alert_host={{services.services[cellar_replicaset]}} '
             '--workers={{profile.taskargs[CB_TASK_WORKERS]}} '
             '--console '
             '--logstash_format=True '
             '--logstash_tags=eight_ball ')
)

You’ll notice that there are three structs that will eventually be bound to this snippet. Thermos is bound by aurora at the end, profile is our main CBProfile struct which contains a map called taskargs for the purpose of passing arbitrary arguments as shown. The services struct is how we keep track of databases and other shared services. It binds a big map and allows us to override dev vs. prod based on where the job is being deployed. So a dev job will be connected to the dev kafka cluster. In this case, the replicasets are mongo databases.

We have similar templates for installing jar files, configuring jmx parameters and various healthchecks that we’ve written. Ultimately, everything is assembled here:

MAIN = make_main_template(([install_eightball, eightball_server],
                           auth_proxy_processes, health_check_processes,), res=resources_template)


jobs = [
    job_template(
        task=MAIN,
        health_check_config=healthcheck_template(
            timeout_secs=5,
            interval_secs=5,
            max_consecutive_failures=5
        ),
        update_config=get_update_config(t='fast')(watch_secs=41),
    ).bind(pex=PEX_PROFILE, profile=PROFILE, services=SERVICES)
]

The first function, make_main_template(), returns a sequential task. Each list passed to it runs in sequence (auth_proxy_processes and health_check_processes are lists), so the installers run, then the servers. The function also sets up some environment variables in .thermos_profile (which is sourced by every process in the sandbox). You can actually pass additional vars to be set in .thermos_profile as well which comes in very handy for 3rd party apps that expect to read configurations from the environment variables.

Operations

One of the features of Aurora that we embraced whole heartedly after a bit of trepidation is allowing jobs to fail and restart.

Prior to Aurora, a lot of our on-call pages were for jobs that had failed for whatever reason and just needed to be restarted. A common experience for an engineer (all of our engineers participate in on-call rotations, not just platform engineers) used to be that they would get a page for some service not responding, they would log onto the machine, verify that yes, the service died for whatever reason and then restart everything. Usually this could be done via a fab command, but there were always case by case caveats so usually they would have to look through our docs to make sure they were doing the right thing if they weren’t familiar with whatever was down. Most of the time, a simple restart solved the problem. The rest of the time, the problem was usually a full disk (ugh, logs), bad server or some downstream problem that needed further investigation.

This does not happen anymore. Clearly you don’t want your software to be so brittle that it dies all the time, but if it’s going to happen sometimes it should be a non-event and nobody should need to be woken up to handle it.

There are a several reasons why Aurora will kill a running job:

The job exits (this isn’t Aurora killing the job…). In python calling exit(0) or not catching an exception will cause this to happen.
The job exceeds CPU / RAM allocations
The job exceedes Disk allocation
The job fails a healthcheck

Of these events, #2 is the only one that requires our intervention. We don’t only ignore the others, we actively embrace them. This took a leap of faith for us, but it’s paid dividends.

Log Files

Managing application log files on production servers at scale is actually a very difficult problem. We tried various tools and struggled with file handles not being closed and disks filling up with the sheer volume of data before finally settling on having Flume ship our logs to s3 where we do adhoc querying with Athena (the subject of a future post).

What we don’t do with our log files anymore is rotate them. Instead, we allow Aurora to kill the jobs when they fill up their disk limit (anywhere from a few days to a month depending on the job). When this happens, the job is immediately restarted with a fresh sandbox and the old job is archived. Aurora periodically prunes old copies of jobs as the server disk fills up. These events have the added benefit of forcing some movement in our jobs among servers which helps ensure good utilization of our cluster’s resources.

Exceptions

Usually an unexpected runtime exception is a bad thing, and we definitely try to track down the source and fix them when they happen in production. One issue that we kept running into is a long-standing bug in our mongo client that manifests itself as a NoneType error response when it catches (some) servers in the middle of doing a periodic roll-off of data. We’ve never really been able to track down a good fix where the client recovers, so we’ve implemented what we call the seppuku handler. As the name suggests, when it catches this error the handler just calls exit(0), killing the process. Not the most elegant solution, but it beats an engineer being woken up just to do the same thing. Actually, when this first started happening it paged people every day at 9:05 AM when most of our engineers are on a subway or bike heading into the office — it became known as the 9:05 bug.

Sometimes fixes aren’t pretty, but the internet is built on pragmatic hacks. Failing a single request — especially one that will be retried — is always better than a 15 minute outage.

Healthchecks

Healthchecks are a first class feature of Aurora and we insist that all jobs have one if possible. Most of our api servers have at least simple “i’m alive” healthchecks and our Kafka consumer jobs run a healthcheck that checks for status lines in their logfile. Again, when a job fails its check it’s just restarted by Aurora and nobody gets woken up.

Many of our api servers run a custom tornado framework we call sharknado3. For these servers we have a drop-in healthcheck handler that implements Aurora’s health endpoints, by default just returning “ok” as long as the server is up.

For other jobs we’ve implemented several custom healthcheckers which handle checking logfiles or binding to the health port and proxying requests to the api server. All of these are simple aurora templates which can be included in job configuration files. The following snippet checks for any line being written in the json encoded logfile beginning with ‘{‘. The initial interval of 2 minutes gives the kafka consumer time to install and start reading. We like to give these processes a chance to recover on their own. On some of our older kafka 9 clients when a single consumer restarts it can cause what we call “the zookeeper shuffle” causing a rebalance of all of them. It’s best to leave them alone; they usually sort it out. Restarting all of the consumers and once can exacerbate the issue. We’ve been moving to static partition assignment with our kafka 10 consumers and they’re much more stable.

HEALTH_CHECK_SETUP = check_log_health.get_setup_process()

health_check_config = check_log_health.get_config(
    app_directory=processname,
    line_pattern='{',
    freshness=100,
)(
    timeout_secs=3,
    initial_interval_secs=120,
    interval_secs=60,
    max_consecutive_failures=3,
)

Resource Constraints

CPU and RAM throttling require intervention. We limit CPU with cgroups, so we don’t see jobs die due to high CPU but we do see them die because they’re being throttled and start to fail their healthchecks, especially at job startup which is freqently more CPU intensive than the job actually running. This can be hard to debug because the error is usually unrelated to CPU, an installer might timeout for example. Since we graph job CPU/RAM usage we can usually see right away if a job is being limited by cgroups. Fortunately this can usually be fixed by extending the initial healthcheck period.

RAM doesn’t fix itself, especially in the python world. If a job requests too much RAM it will be killed. For java programs this usually means that someone forgot to take into account other tasks running alongside their job when setting their JVM max heap. For python it usually means a dataset was too big. We run our python programs extremely lean (the typical API is configured for .2 CPU and around 300mb RAM), but our workers can be very heavy, sometimes on the order of 10–20GB of RAM. We don’t like to allocate that much unless we need to, and in the cases where a job fails someone has to go and tweak the job’s yaml file to allocate enough memory.

Future Work

Our ultimate goal (which we are confident is very doable) is to eliminate all but a few of our .aurora files and really just define jobs in nicely structured yaml. Since each yaml file specifies the aurora file it operates on, we support groups of jobs each differing only by resource requirements and command-line arguments. We actually have several dozen workers all sharing a single generic .aurora worker config. Right now we could do the same with many of our existing jobs, but we also want to integrate this with routing and monitoring, a much more complicated beast. In the meantime, as we migrate to Aurora 0.18 we’re going to refactor our templates to make them cleaner and start grouping similar .aurora files into generic configs.

We recently made a big change to the way we call the Aurora client which will allow us to support more complex interactions. Previously, our client forked a call to the Aurora client after setting a bunch of environment variables. Now we’re importing the client module and building it all as a single pex which lets us skip the environment calls and stay in python.

Conclusion

Aurora is a fantastic tool for running the types of jobs we have at Chartbeat. We’ve embraced it whole heartedly and it’s completely changed the way we work. Fortunately, the flexibility it provides allowed us, with some creative engineering and hard work, to pick the types of change we wanted without having to compromise in ways that didn’t work for us.

Up Next

We still have several topics to cover in this series. I’m not sure what’s coming next but we will definitely be discussing Pants, HAProxy with Synapse and some of the log analysis work we’ve done… soon.