CI/CD at Scale

9 min readMar 19, 2018

Brainly’s Devops describe their best practices and favourite tools for maintaining continuous integration and continuous delivery at an immense global scale.

Brainly serves a staggering global user-base, with tens of millions of students visiting the platform each month in dozens of markets worldwide. At that scale, delivering new code as rapidly and seamlessly as possible poses significant challenges to the entire production team. For Brainly’s developers, the key challenge is to release new code multiple times each day across two development pipelines. Except when new features or improvements are launched, each and every deployment must occur entirely in the background, without any of Brainly’s 100 million users noticing.

Fig 1. *Number of Nginx request to all Brainly markets. At this amount of traffic reliable CI/CD solution is must.*

A Two-Stack Challenge

The current Brainly stack is comprised of two main sub-stacks, and because of language, purpose, and physical location, each of these stacks requires a different deployment pipeline.

The Monolithic Stack contains frontend and backend applications written in PHP, Node.js, and React.
The Microservices Stack contains services running on Mesos and orchestrated by Marathon.

In the first part of this post, we’ll take a look at how we manage the monolithic or applications stack, and in the second part we’ll drill into the details of maintaining CI and CD in the microservices stack.

The Monolithic Stack

TeamCity: Our Platform of Choice

We choose TeamCity for our CI/CD orchestration platform. It has exceptional configuration flexibility, offers just enough plugins for our needs, and allows us both to integrate with external authentication tools and predefine Meta Runners (custom build definitions). Most of the jobs running on TeamCity Agents are triggered in dedicated docker images, allowing us to maintain discrete sub-environments for each build type. This build isolation frees us from the concern of affecting other builds that might have different requirements. Since the entire provisioning process of the TeamCity Server/Agent is done by Ansible, we can add new Agents quickly, simply, and with full automation, and requires nothing more than having a docker installed on Agent.

Continuous Integration in 4 Steps

We stick to a simple and effective four-step process to maintain CI across Brainly’s applications.

Developers write code and push it to the feature branch on Github
Several Commit Checkers are defined as Github webhooks, with checks executed internally on TeamCity
After all checks are passed, we initiate Code Review
When Code Review is passed, the branch is merged into the master

Continuous Delivery in 4 Solutions

We spent months testing a range of setups before migrating from bare-metal servers to AWS. Ultimately, we chose t2.micro instances to run the Brainly Applications Stack, and gave each application its own dedicated AWS Autoscaling Group (ASG) to allow us to manage the size of the application automatically and to maintain a dynamic infrastructure based on a number of requests. In general, a larger number of requests correlates to higher avg CPU usage of the ASG and a higher number of running EC2 (AWS Elastic Computing) instances. In our case, ASG can be treated as a dedicated process, with each EC2 instance treated as a thread of this process.

Fig 2. *Brainly apps size, as managed by AWS ASG for PL market (7 days).*

Fig 3. *Brainly apps size, as managed by AWS ASG for all markets (7 days).*

Fig 4. *Brainly apps size, as managed by AWS ASG for all markets with deploy information (7 days).*

We’ve mapped out some of our sizing data above, in Fig. 2, 3, and 4. As Fig. 2 demonstrates, the number of ASG instances is directly related to traffic, with dips in traffic occuring on weekends, when fewer students are using the Brainly portal. Fig. 3 shows similar effects in the same time range, and includes ASG activity across all markets. In Fig. 4, we’ve included information about application deployments, with about 70 occurring each week.

Each production environment has both frontend and backend application stack elements. As developers work, they face more than a thousand ASG instances, giving rise to a pressing need for a reliable deployment mechanism capable of fulfilling 4 core requirements.

Deploy code across all ASG instances as rapidly as possible
Ensure that each instance serves the latest version of the application
Deploy code discretely, with no errors, downtimes, or user interruption
Allow code to revert rapidly to a previous version in the event of any problems

1. Deploying Code Rapidly

The most pressing challenge in meeting these requirements was to maintain the ability to deploy rapidly to hundreds of EC2 instances. First, we needed to know where exactly to deploy. Although we use Terraform to provision the entire platform, we’ve chosen not to use it during deployment itself. Terraform does not provide a list of ASG instances directly, so we use Ansible in these cases, since it provides information about all running EC2 instances through an EC2 external inventory script .¹ The list contains all ASG groups with running instances, allowing us to extract needed data from it easily and quickly. In just a few seconds, we’re able to obtain information about all running instances across three zones in one AWS region.

2. Ensuring that the Latest Version of the Application is Served

The next challenge was to design a rapid deployment process. Having hundreds of running EC2 instances requires responding to any given problem with an immediate solution. Fortunately, Ian Barfield from AddThis offers a readymade solution to this challenge through ssync, a recursive wrapper around the rsync unix tool. He describes ssync as a “divide-and-conquer file copying tool to multiple destination hosts,” capable of “transfers to N remote machines in log N iterations” (see Fig. 5, below).²

*Figure 5. ssync’s divide-and-conquer file copying tool*

Thanks to the ssync tool, copying application code across hundreds of instances takes no more than a few minutes. (For more information on ssync, check out this blog post.)

Fig 6. *Average Applications Deploy Time*

3. Maintaining the Latest Version of the Application

The next requirement was maintaining the application in its newest version for each EC2 instance. Each application serves a health check while running, and during deployment the health-check endpoint of the AWS Elastic Load Balancer is updated to the most recent deployed app version. As a result, if even one ASG instance fails to have the most recent code version, ELB will terminate that instance and begin again with a clean one. When the new instance starts, it downloads the newest application version from a dedicated S3 bucket and auto-setup environment. This entire process takes about 20 seconds. After the health check implemented by the app ELB is deemed to be passing, the instance is marked as running and can begin fielding requests.

4. Ensuring Prompt Reversion

The final requirement was to enable rapid rollback in the event of any problems. For each application there is always a few previous app version stored directly on the EC2 instance and S3. During deployment, code is also pushed to the front instances. Each app version is stored in a folder with the version as a name and with a symlink pointing to the newest version. If and when a critical issue occurs and there is no time for redeployment, a revert build is initiated manually. This build sets the symlink to the previous version, or to a version chosen by the developer who initiated the build. The ELB health check is also switched back to the desired version. This protocol allows us to avoid wasting time pushing code to all instances. The entire process takes about 20 seconds.

Figure 7. *Average Applications Redeploy Time*

The Application Deployment Pipeline

Once PR has been merged to the master, TeamCity triggers a final set of code tests. At that point, the package is built and sent to several staging environments, initiating a sequence of Functional Tests on Staging. If the tests are passed, we begin the task of sending code to all markets and updating the content of the dedicated S3 bucket with the newest application package. During rush hours, deployment to 12 markets takes less than 10 minutes.

Figure 8. *Brainly Applications Deploy Pipeline*

The Microservices Stack

Maintaining Continuous Integration in the Microservices Stack

Brainly has over 100 microservices, each one of which requires an individual approach. Since there are only 3 applications, it’s nevertheless quite easy to set up CI on the internal TeamCity. For microservices, we decided to use Travis for all CI requirements, so that every PR prior to the merge-to-master step must to pass a Travis Check/Build in addition to Code Review.

Maintaining Continuous Delivery in the Microservices Stack

We deploy microservices to production about 10 times each day, just as we do in the Brainly Applications stack. We’ve vastly simplified the deployment and scaling processes by creating all microservices in Docker. Automation orchestration, which we achieve with Marathon, is also relatively simple.

Figure 9. *Brainly MIcroservices Deploy Pipeline*

The first challenge was to automate the process of deploying the apps. Since the number of Build Configurations in Teamcity is limited, we decided to set up only one Build for all microservices. This build is orchestrated by a Brainly internal microservices stack and is integrated with each microservice build through Github Webhook. Once the commit-to-master branch of service is complete, Build in TeamCity is triggered. At that point, two repositories are cloned: ‘ci-declaration’ and Infrastructure as a Code. It contains both: logic behind deploying code and set of variables needed by given microservice i.e. database addresses, lists of Memcached instances, and so forth. The general idea behind ci-declaration is to parse the .deploy.yml file, which is available in all microservices. Once the build has begun, microservice repository is also cloned.

Figure 10. *Average Microservice Deploy Time*

We use a custom YAML file, .deploy.yml, to deploy microservices with Travis-like syntax.

For all microservices, we had to create a dedicated auto-deployment mechanism with the possibility of predefining custom deploy steps.

Build steps are defined as top-level elements (such as build, deploy, or after_build) and must always contain a valid configuration:

The above example is a simple CI configuration with customized steps to build and deploy the application to Marathon. Thanks to .deploy.yml, we’re able to tweak microservices extensively. In a basic configuration we can set up any or all of the following:

Number of instances
Whether instances are autoscaling or their number is fixed
Amount of CPU and RAM usage per instance
How service should commence
Whether service is publicly available
Nginx timeouts per microservice
Health check endpoint of microservice used by Mesos
Additional env variables
Custom domain name for microservice
Placement in Mesos cluster or Marathon group

This flexibility allows our development team to initiate and modify microservices without ever touching Mesos or Marathon. Take, for instance, this example of Grafana running as a microservice:

Once .deploy.yml is parsed, ci-declaration triggers a docker build for microservice. After the docker image is created locally on the TeamCity agent, it’s then send to the Brainly Internal Docker Repository. Once uploaded, .deploy.yml is translated to Marathon API format and the job signals the presence of either a new microservice or a new version of a microservice. Marathon then orchestrates Mesos Agents to download the newest docker image and initiate the microservice. At the same time, information is sent to an open-source application called Ołówek.

Ołówek ³ [ɔˈwuvɛk] is a Brainly-created application designed to configure Nginx with applications deployed on Mesos and Marathon. Olówek works very similarly to a consul-template, but since our microservices stack is exceptionally dynamic we required something faster and more customized to our stack.

After deploy, service is propagated automatically by Marathon to specific Mesos clusters. As with the Brainly Application, we decided to provide an auto-scaling mechanism for microservices, so that when traffic is lower there will be fewer instances of running microservices, allowing us to set up AWS Autoscaling for Mesos Agents. In turn we reduced the resources needed to handle lower traffic, resulting in meaningful costs reductions.

Figure 11. *Number of running microservices instances (left Y axis) and corresponding number of Mesos Agents (right X Axis).*

Conclusion

Operating at scale comes with immense responsibility and tends to require labor-intensive customized solutions, since off-the-shelf tools rarely address all of the challenges that arise when deploying at scale. At Brainly, we’ve learned that in about 80% of cases it’s best to create specific features internally, even if it takes longer. In-house customization simplifies any future customizations, and at Brainly’s global scale that simplicity is an outright necessity.

Notes

¹ https://github.com/ansible/ansible/blob/devel/contrib/inventory/ec2.py
² https://www.addthis.com/blog/2013/10/08/open-sourcing-ssync-an-out-of-the-box-distributed-rsync/
³ https://github.com/brainly/olowek