Making Breakfast: Chef at Airbnb

By Igor Serebryany & Martin Rhoads

Airbnb is a dynamic code environment. As we build an SOA to cope with our rapid growth, we have dozens of people creating, launching, and wiring together the services which create the user experience you see on our site.

All of these services have to run on some machines — in our case, virtual machines in Amazon EC2. We’ve gone through several approaches to launching and configuring these machines. Last year, we open-sourced a tool we were using at the time, called CloudMaker. At the time, this tool was a major improvement over our previous manual approach. However, it didn’t scale for us. There wasn’t enough power in the YAML configs to specify all of the configuration we wanted to do, and so we grew complicated trees of versioned shell scripts stored in S3 that would do the bulk of the work and were completely unmaintainable.

In December of 2012, we began a migration to Chef. We started out using Opscode’s excellent hosted chef, which enabled us to hit the ground running. Within a few weeks of getting started, we had all our common configuration ported over and were running Chef-enabled services in production. As we added more Chef users inside the company, however, we realized that our chef server workflow was not scaling. As users worked on different parts of the system, they would push different cookbooks, roles, and data bags to the server, sometimes clobbering the code other users had pushed previously. We needed a way to enable all of our users to work in parallel without affecting the work of other users, or the deployed code in production.

So, we ended up moving over to Chef Solo, and thus arrived where we are today. We have a set of tools and workflow which make machine administration a breeze. We would like to share our approach with the community, in the hopes that our lessons might come in handy in other organizations

Git-Based Workflow

We use a single git repo for all of our configuration management. All of our own cookbooks as well as any cookbooks we require as dependencies are imported into this repo.

A single branch, called production, contains the authoritative configuration code which configures production instances. The SRE team reviews all merge requests into the production branch, acting as gatekeepers of the infrastructure.

People who work on features for their services or on changes to the common configuration work in branches. For testing, our engineers spin up Vagrant boxes or EC2 instances using the chef code in their branch, and they submit their pull requests once they’ve already tested their changes.

Democratic approach

Almost all of our engineers now have some exposure to Chef. Many have gone through internal training sessions. We also have very extensive internal documentation, which covers everything from explaining the basics of chef to the nitty-gritty of cookbook testing at Airbnb.

Any team which maintains a service running on it’s own machines also maintains it’s own cookbook for that service. Our Site Reliability (SRE) team is responsible for the common cookbooks that set up the base system as well as some shared cookbooks, such as the ones that install java or configure nginx. We’ve also created internal abstraction definitions, such as java_service and rails_service. Using these, it is often possible to set up a service using a recipe containing just that one single resource.

Box Attributes

We consider three primary attributes which define how a machine is going to be configured:

  • environment
  • role
  • branch

Our environments are fairly minimal, and contain overarching configuration information, such as the addresses of the zookeeper servers in that environment. Cookbooks also usually use the environment to determine which credentials files to load during a run; machines in the development environment never get production credentials.

We always assign a single role to a box. We occasionally have SPOFs — roles which are only used by a single machine — but usually a role will be shared by a group of machines. A role is the main identifier for machines. Some examples: mobile-web-worker, sphinx, pricing.

Finally, the branch attribute of a machine determines which chef repo branch will be used to configure the machine during a chef-solo run. Most of our boxes are on the production branch, but people will run boxes on branches when they’re developing their cookbooks.

The ability to run boxes on git branches is an amazingly powerful thing. Testing chef code means pushing to a git branch — no knife commands! Engineers working on cookbooks can easily bring up machines running their new code, without any fear that this code will affect production machines in any way. While running on a branch, they can be sure that nothing else is changing in their machine configs except for the changes they themselves are making. We can do all of this even while avoiding strict and painful versioning of cookbooks and specialized tools like knife-spork — all you need to know is git.

Because all of our service owners want to stay current with the changes that SRE is making in the common cookbooks, they have an incentive to get their code merged into production. But they don’t have to do it until they’re ready; running on a branch is just as good most of the time, and if there are some critical updates they need they can always rebase or cherry-pick.

We store the chef attributes in three files — /etc/chef/environment, /etc/chef/role and /etc/chef/branch. Grabbing a machine from production to test a new configuration is as simple as sshing in and editing /etc/chef/branch.

Converging

Every one of our production instances have a converge command, which is a simple Bash wrapper around running chef-solo. The converge command first pulls down the latest copy of our chef repo, and checks out the correct branch — the one listed in /etc/chef/branch. Then, using the environment and role specified in /etc/chef/role and /etc/chef/environment, it generates a JSON hash and a runlist and passes those to chef-solo.

Bootstrapping

We loved being able to easily bring up instances using knife ec2. When we moved over to Chef solo, we implemented our own version of this, which we call stemcell.

Stemcell calls the AWS API to create the correct kind of instance, sets the proper tags, and then bootstraps the box to run Chef. Initially, we passed all information about launching instances to stemcell by hand on the command line — or created little bash wrapper scripts to properly invoke the command. Now, we keep role metadata in the role files themselves, and allow stemcell to parse the role information to determine what to do. Most of our roles have an attribute hash that looks like this:

default_attributes({
...
"instance_metadata" => {
"instance_type" => "c1.xlarge",
"backing_store" => "instance_store",
"security_groups" => [
"Web Server"
],
"tags" => {
"Name" => "webworker",
"Group" => "monorail",
},
},
...
})

Stemcell works with ubuntu’s cloud-init for the initial configuration of instances. The script we place in the AWS machine’s user data goes through the following steps:

  • run an initial apt-get update
  • install some fundementally necessary software — curl, git, chef
  • create the node attribute files like /etc/chef/branch
  • pull down a copy of our chef repo onto the box
  • run an initial converge

Tracking our infrastructure

Chef-server provides you with the ability to list all of your nodes, via the web UI or via a knife command. We wanted to implement similar functionality, so we came up with a simple service we call optica.

At the end of every run of chef, our custom optica handler pings to the service over HTTP and reports on changed resources, the success/failure of the run, and any custom parameters we would like to report. Optica also supports a convenient query API. For instance, to get all of our web worker nodes on the igor-test branch, we could do curl -sf http://optica/?role=web-worker&branch=igor-test.

Optica expects and returns JSON. For routine operations, our engineers use optica with JQ on the command line to learn about our infrastructure.

For example:

$ curl -sf 'http://optica/?role=optica'|jq -r '.nodes[]|.az'
us-east-1c
us-east-1e
us-east-1b

We also enjoyed using knife-ssh to get information or to force a converge on a particular role. To replicate this functionality, we chose fabric. Our fabfile automatically populates the role list by quering optica, so we can do something like:

~/chef $ fab -R web-workers uptime

The fabfile takes care of the rest! The optica repo has more information, along with all of the necessary scripts and tools to get started.

Workflow

Currently, our chef repo has over 50 individual contributors — the bulk of our engineering team. Most of the contributors work on cookbooks responsible for deploying their own services. For instance, engineers from our operations team work on the lantern cookbook, which runs the internal tool for our customer service team.

Engineers are free to develop and test their own code in branches. However, it goes through a rigorous peer review with a member of the SRE team before it can be merged into the production branch. This means both that the code quality across all of our cookbooks is consistently high, and that the knowledge of the configuration is distributed through the team.

Future Improvements

Althrough our workflow right now is solid and works very well for us, we recognize that there is a lot of room for improvement. We’re working on a number of changes to this at the moment.

Stemcell is currently being internally deprecated in favor of a service which handles the launching of instances. This is because we would like to keep instance launching more consistent and reliable, and also to minimize the number of people in our team with credentials to launch instances.

Most of our testing is manual at the moment. We provide basic Vagrant configs that enable sanity-testing of cookbooks locally, and it’s easy to test in ec2 using branches. However, we are developing a comprehensive testing tool built on top of docker, which allows quickly testing changes without having to run through the entire runlist. We’ll say more about this tool (internally called garcon at the moment) in a future blog post.

Finally, we are always interested in hearing from the you about how we could improve. One of the reasons we chose Chef is because of its consistently excellent, creative, dedicated user community, which has made Chef the amazing tool it is today. We’re paying attention on the Chef mailing list and on our repos in GitHub — drop us a line!


Check out all of our open source projects over at airbnb.io and follow us on Twitter: @AirbnbEng + @AirbnbData


Originally published at nerds.airbnb.com on October 15, 2012.