Building, automating and deploying serverless microservice infrastructures
A couple years ago, we at Fender were tasked with setting up an entirely new architecture to support the APIs required for our new product — Fender Play. Now, starting from scratch is not something we get to do most of the time in the course of our work, but when we do get the chance, why not go big?
So, the architecture entailed splitting up the API into multiple microservices, using Golang as the language of choice for development, and leveraging multiple AWS technologies and services to help us along the way, such as API Gateway, DynamoDB and Lambda functions. Serverless, basically, even though servers do exist somewhere along the line, like the proverbial turtle our planet stands on.
However, since I’m just a grubby little DevOps engineer, this article is going to focus on a small part of the whole — the build process — and how we managed, through trial and error, to get our deployments to multiple lambda functions from taking tens of minutes to a couple minutes long.
At Fender, early on, we realized that while serverless makes it look like there’s very little administration required, the wiring between the various components, and maintaining track of changes, is something that could potentially pose quite a challenge if not automated well.
After experimenting with the SAM tools (such as CloudFormation) provided by AWS, and Chef Provisioning, we realized that we needed something closer to an actual programming language, that kept up with the changes to the cloud service(s) we were going to be using. Hashicorp’s Terraform appeared to fit the bill perfectly. We also threw in some Ansible for legacy services on EC2 instances, and to manage a few things about the serverless infrastructure that we’ll talk about later in this article.
This is what our current setup looks like, with respect to tooling:
- Infrastructure management: Terraform
- Configuration management: Ansible
- CI/CD system: CircleCI (we’re evaluating Drone as a potential replacement)
- Custom Docker images are used to speed up build dependency setup
The initial build process
While learning our way around the new, magical serverless world full of bleeding edges, we also had deadlines to meet. This meant that we needed to keep the pace hot while not getting too distracted trying to get the “perfect” setup, at least initially. Now that I’m done with that disclaimer, this is what our build process initially looked like:
- Terraform: Contained in a separate repository. All the infrastructure definitions were stored here, and applied manually as needed. A single state file was used to store all the infrastructure data for each environment.
- Ansible: Contained in the same repository alongside Terraform. Used to store environment variables that lambda functions were configured with, as well as DynamoDB capacity settings. Playbooks create scripts that utilize the AWS CLI to perform/validate the requisite changes.
- CircleCI: Compiled the Golang code, and uploaded it to the lambda functions created/configured earlier using Terraform and Ansible.
After a few months of deploying changes using this setup, we started noticing a few things that were getting in our way:
- The blast radius was pretty large for Terraform code changes, since all the microservice architecture was stored in a single state file. Even though the code was reasonably well-organized into reusable modules and we had a lot of common code, an engineer making a change in a shared resource could potentially end up impacting parts of the architecture they had no intention of touching.
- Developers were unable to self-service configuration changes to the applications, because those settings were stored in the Terraform/Ansible repository, and changes there could have far-ranging implications a developer should not have to worry about
- CircleCI builds were taking FOREVER — a few of our apps had more than 50 lambda functions and since “go build” uses all CPU resources available, the build process had to build each lambda binary sequentially. Parallelizing the builds wouldn’t speed them up because it was bottlenecked by available CPU resources on the CircleCI build container.
- Even if we managed to cache binaries built in earlier builds, if any change was made to common libraries, every binary would have to be rebuilt — and this was by no means an uncommon occurrence. Also, uploading and downloading large caches took time during builds.
After a year or so in development, most of our microservice applications were seeing build times exceeding 20 minutes. While that provides valuable justification for coffee breaks, it’s also the best way to get out of the “zone” that we all aspire to forever be in. Productivity was taking a nosedive.
- Blast-radius issues: We decided that, in order to ensure each microservice was completely independent from the others (infrastructurally as well as in code), the Terraform definitions for each microservice should live within that service’s codebase/repository. And use a separate Terraform state file. If any change was made to that microservice, it would only affect that service, and not the rest of the infrastructure.
Of course, there are still common infrastructure elements, such as VPCs, bastion servers used to provide access to production systems, etc. — these were still stored in the separate Terraform repository and managed separately from the microservices. However, the resources that were not shared across multiple services now lived in Terraform/Ansible code stored in the same repository as the microservice code.
This change would now allow developers to make any changes they wanted to the actual Terraform infrastructural code, which could potentially be dangerous. We chose to mitigate this using Github’s great CODEOWNERS file, which sits in the “.github” directory at the root of our microservice repositories, and ensures that DevOps eyes see any infrastructural changes before they are merged in and applied.
- Self-service for Developers: Developers mainly needed to be able to edit environment variables configured on their microservices’ lambda functions, and this requirement was quite frequent, especially since we were still in the early- to mid-stages of product development and changes were coming fast and furious.
Moving the Ansible configuration variables into microservice repositories, while retaining the actual Ansible roles in the main Terraform repository, took care of this requirement. The Ansible role that configured lambda functions and DynamoDB tables was generic, and was informed by the variables that developers now had access to. So the damage would be limited to misconfiguration (which could be caught by QA in lower environments), and they did not have the ability to cause infrastructural damage by any changes to the variable files.
- Build speed: This was the big one, causing the greatest loss in productivity and annoyance to us. And there wasn’t an existing, obvious, “managed-service” way to solve this.
What we needed was a way to parallelize building lambda binaries, without having to start up a new beefy EC2 instance with the CPU resources to handle that.
Of course, the solution was staring at us in the face — lambda functions! Each function invocation was a separate event, and would execute independently of other invocations, in parallel. All we needed was a function to build code to be deployed to other functions.
So we wrote it — a very simple Python script in a lambda that did the below:
- Install Go
- Install Apex (we were using it in some services before AWS released native Golang support)
- Check out the app code from Github
- Run unit tests
- Build the binary
- Zip up the binary and upload to S3
Of course, there’s a little more complexity to it (but not much)…
- Lambda functions are “cold” until the first invocation, and then they are considered to be “warm”. This lasts ~10 minutes, and if no further invocations of the lambda are made, then it goes “cold” again, because the container that actually executes the code in the lambda is de-allocated by AWS.
This led to an additional optimization — we could now check if Golang was already installed and skip installing it if so, because an earlier invocation had already installed it, and the lambda was still “warm”. Basically, this meant that even for the first build of the day, only the first invocation would need to install Golang — the rest could use the prior invocation’s installation to perform the rest of the tasks. Well, pretty much — there’s no visibility into exactly how many containers are spawned when multiple invocations are performed — but we’ve seen good results with secondary invocations all round.
- One can only write to “/tmp” within lambda functions, and this space is (currently) limited to 500 MB. This, however, suffices for the purpose of installing Go and dependencies, and building a single Golang binary.
We chose not to deploy the built functions using the builder function, but instead to upload the built, zipped binary to S3, and deploy that package to the microservice’s lambda function(s) from within the CircleCI pipeline.
The above optimizations, for a microservice with 50 lambda functions, reduced build times from ~20 minutes to ~3 minutes.
We hope you’ve gleaned some useful information from our experience with setting up our Golang serverless build pipeline. Do you do something similar in your company? Have you any ideas that solve the issues we faced with more elegance? Please let us know — we’d be more than glad to talk about it!