Coded Infrastructure

“Use the Force”

At ACL, we have a special team: the J.E.D.I. team (Jedi Engineers Developing Infrastructure). We are the experts called in for complex infrastructure projects, security enhancements, and innovation that requires going beyond application code. We have a single mandate:

Increase the resilience, availability, performance, and security of our infrastructure on Amazon Web Services (AWS).

That’s it. No additional detail and nobody specifying exactly which features we need to add. Simply go out and make it happen. Both exciting…

…and frightening.

The force is strong with the J.E.D.I. Team (Source)

You see, ACL’s SaaS platform (known as ACL GRC) uses Ruby on Rails for many of its web applications, and traditional Ruby on Rails applications deploy onto existing servers via tools such as Capistrano. This is adequate when starting off, but it’s limited. It does not tap into AWS’s elasticity and infinite scalability. It doesn’t address concerns outside of deployment, such as firewall rules, OS hardening, and disaster recovery. It’s not a workflow that helps us flow change through our company. In short, it does not address the full scope of our mandate. So how do we successfully address our mandate?

In a series of blog posts, I hope to share with you how we tackled our mandate and the challenges we faced along the way. I’ll start off by sharing our vision, the challenges we faced with our vision, and how we overcame those challenges with the implementation we choose. By the end of the series, you should have an understanding of what it takes to codify your infrastructure and the benefits gained. Sounds good?

Perfect. Let’s get started!

The Vision

We started off by asking ourselves how we could meet our mandate. We grouped our experiences together and came up with the following goals:

  • We didn’t want a fixed set of servers we kept deploying to; we wanted an infrastructure where any change would be done on entirely new servers.
  • We didn’t want a faulty code change to cause downtime; we wanted the ability to deploy our code, test it in a hidden state, then seamlessly promote it only once we’re satisfied.
  • We didn’t want to distinguish between different types of code change; we wanted all change, whether it be kernel patches, application code change, etc. to be done via the same workflow.
  • We didn’t want to document how to manually provision servers and thus risk the documentation becoming stale; we wanted to automate our server creation and have assurance that we can quickly perform disaster recovery by always creating and deploying to new servers.
  • We didn’t want our servers to rot; we wanted to constantly use new servers, so there would no accumulation of hotfixes or undocumented configuration changes over time.
  • Lastly, we didn’t want to settle on a sub-par solution just because it was easy; we wanted the right solution. If that solution posed a challenge, then all the better. Smart engineers want a challenge, and that’d also help us recruit top talent going forward.

We believed that if we achieved these goals, we’d be in a better position to reason about, control, and modify our infrastructure going forward, and thus enabling us to meet our mandate.

Unfortunately for us, there was either too little information or too much conflicting information on the Internet for us to figure out how we could meet our goals. So we decided to step back and ask ourselves: what does success look like?

It turns out, success looks like code.

In fact, code similar to the following pseudo-code:

As a result of executing that pseudo-code, we wanted to have gone from having nothing in AWS, to having a secure infrastructure up-and-running with our application’s codebase from 2015–09–20.

Furthermore, as the application developed and needed to be updated, the code to push out the updated application would be the following:

Voila! Bring up a new version of the application on new servers, test that it works fine, and if it does, remove the old version. This would effectively meet majority of our goals and allow us to recover from any disaster quickly! But…

…at what cost?

Benefits Gained

Aren’t there alternative options to coding that can meet our goals? Are we arbitrarily choosing a code-centric solution because, as software engineers, that’s our comfort zone?

No.

Rather than investing time into learning tools, DSLs, or worse, graphical interfaces, in order to maybe achieve our goals, we can tap into our development skills to certainly achieve our goals. There will be no limitation that can block our ambitious vision down the road. We gain other benefits as well, such as:

  • We can write automated tests for our infrastructure. Our disaster recovery practice will no longer be a once-in-a-blue-moon ordeal, but can instead be done regularly since it’s simply replaying our infrastructure code and confirming its results.
  • Our code and tests act as our documentation going forward. Documentation quickly goes stale, and people leave companies on a frequent basis, so how can we ensure our day-to-day mechanisms and infrastructure will be well understood when a new person comes onboard? By having our tasks and infrastructure as code. Not only will our code document nearly 100% of our infrastructure, compared to documentation that may miss key parts, it will also be up-to-date since we use it daily.
  • We can version control our infrastructure. We can now track our infrastructure’s development history and look to the past to determine why certain decisions were made.
  • Our team’s coding skills will allow intra-departmental collaboration. J.E.D.I. developers can communicate fluently with application developers and even swap positions once in awhile. The same skill set can be used to create innovative tools, such as our internal Slack bot that automates common tasks and requests.
  • We increase our employee engagement and increase our chances of talent recruitment. Top talent individuals want challenging problems and nothing is more discouraging than asking them to do repetitive tasks that do not tap into their software engineering skills. With a coded infrastructure, you significantly mitigate this concern.
  • And so forth; although I may be biased. ;)

In contrast, if we choose a less automated approach, we’d quickly become overwhelmed (and bored) with the numerous routine tasks involved. Lastly, we’d have little assurance regarding the reproducibility of our infrastructure, since our servers would persist for long periods of time and accumulate too many one-off hand-crafted changes. Clearly, a coded infrastructure was the better choice.

We knew we were on the right track, but exactly how would we go about coding our infrastructure?

Implementing the Vision

Where should we start with codifying our infrastructure? Should we leverage a 3rd-party library that builds on AWS? Should we use a tool that abstracts away the complexity of AWS with an easy-to-use DSL?

No. We didn’t want to limit ourselves. We wanted access to the full power of AWS. Thus, we decided to directly use AWS’s Ruby SDK.

Unknown to us however, were the many unexpected challenges we faced along the way. From the rate throttling, to the required wait-until-complete logic, to the random failures due to “eventual consistency” issues, it was a challenge to get up and running. It took a little while before we got the required abstractions in place to help us leverage AWS consistently.

In the next blog post, I will share with you in-depth our challenges with AWS’s Ruby SDK and how we overcame them.

Stay tuned!