From Zero to Staging and Back

https://unsplash.com/photos/hzgs56Ze49s

When I joined Jimdo’s Werkzeugschmiede team in August 2015, the first major task I took on was building a staging environment for Wonderland, our in-house PaaS for microservices. We’re an internal service provider offering other teams a platform for running production services. As such, we take a great interest in the uptime of our infrastructure.

Since the very beginning of Wonderland, we knew that an isolated test environment matching production as closely as possible would give us more confidence to experiment, fix bugs, and implement new features. And indeed, having a pre-production environment for testing before deploying to production turned out to be invaluable — a safety net making the whole deployment process a lot less scary.

What follows is a detailed account of Wonderland’s staging environment: what it looks like, how we built it, and what we’ve learned since then.

Pair programming

Early on we decided to create the much-needed staging environment by pair programming. For me, working on a setup that is supposed to mirror production, and doing this with a coworker who knows Wonderland inside out, was an excellent way to learn about the platform and its different components.

I was able to ask questions when something was unclear and, at the same time, contribute my own ideas whenever I felt like it. This way, we created a fast feedback loop that not only helped me find my way through Wonderland, but also learn more about its creators — my new colleagues — and their modus operandi.

Taken all together, I highly recommend pairing for onboarding new team members, even if you don’t have the luxury of building a production-like environment from scratch.

One account per environment

Wonderland’s infrastructure runs on AWS. Rather than using a single VPC for both production and staging, we agreed to operate a dedicated AWS account per environment. This setup effectively isolates environments from one another. Most importantly, it prevents changes done in staging — whether intentionally or by mistake — from affecting production.

Other advantages of having one AWS account per environment include:

  • An easier understanding of the (cloned) infrastructure
  • Simpler automation code with fewer environment-specific exceptions
  • Finer access control on VPC level (no more fiddling around with subnets)
  • No naming collisions of non-VPC resources
  • Effortless tracking of costs per environment
  • Ability to opt-in to AWS features outside of prod first

On the downside, working with multiple AWS accounts makes credential management a bit more involved. To make up for this, we’ve been using awsenv and LastPass to quickly switch between accounts. (Of course, it’s all for naught if one forgets to use those tools…)

We actually took this separation one step further and also created additional “stage” accounts for all hosted services we rely on every day, such as Papertrail and Quay. The overhead has been worth it.

Automate all the things

We spent a lot of time automating the setup of our staging environment. We managed to get to the point where we could run make stage in our github.com/Jimdo/wonderland repository and Ansible would take care of everything, from bootstrapping our ECS cluster to provisioning Jenkins — our central state enforcer — to starting essential microservices of our PaaS.

To achieve that, we took the existing Ansible playbooks and CloudFormation templates for production and adapted them for use in staging. This meant we had to:

  • Replace hardcoded parameters like URLs and secrets
  • Implement missing automation steps (some prod resources had been clicked)
  • Address any issues that came up along the way (two words: eventual consistency)

It was also at this point that we decided to leverage standard make targets like “stage” or “prod” across projects. The following paragraph from How to build stable systems sums it up very well:

All projects, language notwithstanding, use the same tool for configuring and building themselves: make(1). Make can call into the given languages choice of build tool, but the common language for continuous integration and deployment is make(1). Use the same make targets for all projects in the organization. This makes it easy to onboard new people and they can just replay the work the CI tool is doing. It also “documents” how to build the software.

Even after automating all the things, there’s only one way to find out if our code works — and continues to work — as expected: creating staging from scratch, again and again.

Destroy all the things

In addition to make stage, we also implemented the inverse operation, make destroy-stage, to deprovision staging completely. This process boils down to deleting all CloudFormation stacks and other resources created by Ansible — in reverse order of creation.

Tearing down CloudFormation stacks is usually straightforward. However, we sometimes have to shell out to the AWS CLI because CloudFormation is very slow when it comes to adding new resources. This can lead to dependencies that are hard to remove. And even when Ansible does provide a particular AWS module, there’s no guarantee that the “absent” state is implemented correctly, making the dreaded CLI our only option.

Once make destroy-stage did the trick, we were able to bootstrap staging from scratch, which in turn allowed us to verify that our infrastructure code does the right thing when starting from a blank slate.

To further automate things, we created a Jenkins job in prod to destroy staging every Friday night and another one to rebuild it on Monday morning.

Drawbacks and improvements

While I’m happy with what we’ve achieved so far, there’s still room for improvement. Here are some of the challenges we’ve seen:

  • Having a single staging environment for all four members of our team means that we need to coordinate testing in a few cases. Nobody likes to wait, especially not me. One solution would be to split up those Jenkins jobs and/or the backend systems they target.
  • It currently takes at least 5 hours to rebuild staging. We’ve already outsourced the building and storage of Docker images to Travis and Quay. We could further accelerate the process, for example by baking AMIs of our cluster instances. Again, nobody likes to wait.
  • Unsurprisingly, we experienced a couple of problems caused by broken/missing external dependencies. One that comes to mind is Docker’s APT repository. Yes, mirrors would certainly help. We started using GitHub releases for hosting artifacts whenever possible.
  • Allspaw is right when he says that “[testing outside production] is incomplete because some behaviors can be seen only in production, no matter how identical a staging environment can be made”. We’ve learned this the hard way. Staging is a safety measure — no more, no less.
  • To be honest, the automated weekly rebuild of staging has caused us a lot of trouble and extra work lately. Sometimes the Jenkins job fails after running into API limits. Other times the job orchestration goes wrong due to network issues (or because Jenkins happens to have a bad day?). In any case, we need to make the process more reliable again.

For more on this topic, I recommend reading my other post: If it hurts, do it more often.

P.S. This article first appeared on my Production Ready mailing list.