Revamping Continuous Integration and Delivery at Conductor
As companies, products and teams evolve so do development practices. At Conductor we frequently identify areas for opportunity when working on Searchlight. Over the years we have made leaps and bounds in the way we develop, test and deliver high quality software to our customers. Rewind back to 2015 — our customer base was rapidly growing, as were the number of features and enhancements we were building. We had robust Continuous Integration and Delivery processes in place, but they were hard to manage and we knew we could do better. We went back to square one and defined exactly what our goals were for CI and CD:
- Minimize the duration and effort required by each integration episode
- Be able to deliver a product version suitable for release at any moment
We asked ourselves, what can we do to be more successful in achieving these goals? We gathered the team together, collected feedback and came up with specific milestones that we could organize around:
- Adopt a new CI tool that better meets our needs
- Build a testing environment that is easier to manage
- Generate and collect meaningful statistics and metrics
- Create a CI and CD solution for microservices
Adopt a new Continuous Integration tool
We already had a CI tool in place but were ready to adopt a new one that better suited our needs. In a world where a new tool seemingly comes out every week, where should we begin? We started by identifying the requirements, features and functionality that we were interested in. This boiled down to the following:
- Support LDAP and have robust role based access controls that suit the needs of the team
- Must have an on-premise offering
- Integrate with GitHub
- Automatically trigger jobs via SCM hooks
- Display branch changes
- Support for Maven, Cucumber, Rake, etc.
Integration with existing tools
- Integrate with JIRA and Slack
Metrics and monitoring
- Report on test and build history
- Generate custom reports
After evaluating a number of different options, we ended up deciding to go with TeamCity, the CI solution offered by JetBrains. TeamCity offers a ton of functionality out of the box and met all of the requirements we came up with. It was easy to set up and since we already had our quality gates defined we were up and running in no time. Next up, revamping the environments we use for testing.
Build a new test environment
The environments we used as the time were complicated, hard to manage and expensive. They consisted of long standing reserved instances in AWS and maintaining them was difficult. Each feature team had their own repository in GitHub tied to one of these environments. When someone wanted to open a pull request, the environment would be unavailable while tests ran and any issues that popped up affected the entire team. This is a very bad thing. Here is an example of what those environments looked like at the time:
We instead decided to containerize everything using Docker and it was at this point that we coined the term “push button environment”. A PBE is a spot instance that hosts all of the Docker images that make up Searchlight. They make it possible to spin up a production like environment anywhere, anytime and at a fraction of the cost from both a money and time perspective. We built images for each of the components (ZooKeeper, WireMock, Thrift, MySQL) and created tooling to build a containerized version of Searchlight on demand. While it wasn’t an easy feat, the gains were massive. Here is what a PBE looks like:
To further realize their benefits, we consolidated all of the team repositories into one repository and set up a single Continuous Integration pipeline that everyone on the engineering team could use. This greatly simplified things and made it easier to understand the health of the build and troubleshoot issues as they came up. At this point our CI infrastructure looked like this:
And our CI pipeline looked like this:
Generate meaningful statistics and metrics
With the entire engineering team now using TeamCity and revamped environments for pull requests everything was humming smoothly. Until of course, it wasn’t. Consolidating our tooling and infrastructure made it easier for people to work, but when there was an issue everyone experienced it and wanted it fixed as soon as possible. Issues popped up randomly and we were playing whack-a-mole in an attempt to get things under control.
To help stay ahead of the curve, we set out on a mission to figure out the best way to anticipate issues before they happened and position ourselves to fix things as quickly as possible when they did come up. After some investigation, we realized that pull requests failed for one of three reasons:
For each problem we asked ourselves: what can we do to have greater visibility into the problem and what can we do to fix it as soon as possible? We started off by creating some tools to help us understand how often we were having each of these problems. This would enable us to both determine where we are today and gauge our success at improving things as we move forward.
To do that we wrote scripts to correlate build failures in TeamCity with pull request and commit history in GitHub and then graph that data over a period of time. By doing this we were able to classify each pull request failure into one of the three issues above. Learning more about the types of failures we were having made it clear where we should be spending our time to have the largest impact.
Having that information spread out over a period of time helps to identify how long a particular type of issue has been present and is useful to help correlate possible causes as well. We also set up a simple chart that graphs average PR build time for the last week. Since we want these builds to run as quickly as possible, it is useful to know if that changes. Here’s an example of the graphs we use to keep this in check:
Once we knew how often things were failing and why, we set off to create solutions to help mitigate and solve issues that popped up.
Issues in our test framework largely were the result of flakey UI tests. These tests pass most of the time but on occasion they fail waiting for an element to appear or something similar. This was a big time sink for the team, since each pull request takes approximately 1.25 hours to run and more often than not the solution here was to rerun the test.
To help relieve the pressure here we created a new pull request workflow that would rerun only the tests that failed and if they succeeded, the pull request would be given a green check in GitHub. In addition we placed a larger emphasis on tracking and fixing flakey tests. These simple steps helped alleviate most of the test problems that were encountered.
Running our test infrastructure on AWS spot instances is very cheap compared to reserved instances (each pull request costs a few dollars) but can be problematic if AWS has an outage, or if the spot market is volatile and we are outbid for instances. It wasn’t uncommon to be outbid during the initial request or after the request was fulfilled and while tests were running. In either case the pull request workflow would fail and teams would have to rerun jobs in TeamCity. We often resorted to increasing the bid price or attempting to provision in different regions. This usually worked in a pinch but wasn’t sustainable.
We came up with two solutions to approach this problem. The first is to use spot fleets instead of spot instances. Requesting a spot fleet allows you to specify multiple instance types and a bid price, and AWS will then provision an instance of the cheapest type for you automatically. This is far more beneficial than constantly requesting one instance type or trying to open multiple requests in different regions. Simply switching to fleets greatly improved the stability of our environments.
The second solution was to build a deployment solution that was cloud agnostic to reduce our reliance on AWS. We were already using Kubernetes in our lab at the time and it seemed like a sensible choice. Kubernetes is an open source platform that automates container operations across a cluster of hosts. It allows us to easily deploy and manage containers for pull requests without the overhead of requesting hardware from AWS. It took a while to get everything running, but we now have another option for pull requests if we need it. Doing this work was also a precursor to building test environment for microservices, which is what we focused on next.
Create a Solution for Microservices
Once we had resolved many of the day to day low level issues that were affecting our engineering team, it was time to think bigger about improvements that we could make to help us move faster. Naturally, microservices seemed the next sensible choice to empower the team. There is a lot of buzz around this topic right now, but are microservices everything they’re cracked up to be? There are various articles that explain the benefits and pitfalls, but there were clear reasons why it made sense for us.
The Searchlight web application is a large, monolithic application. It is complicated to develop, test and deploy largely because of the dependencies it creates. For example, if a developer makes a change in one part of the application we must test the entire application to ensure that everything is working properly. In addition, monolithic applications can be hard to scale vertically and horizontally. For example, attempting to increase performance in one feature usually requires us to think about the entire application as a whole. These issues affect the overall velocity and speed of our engineers, which in turn slows our ability to roll out new features, enhancements and bug fixes to our customers.
Great, now what?
We decided to use lessons learned from developing Continuous Integration and Delivery for the monolith to help guide our effort and focus on three main principles:
- Define organizational standards for designing, developing and deploying microservices. Defining as much of this as you can up front will help in the long run. For example — should teams be able to use any language they want or will there be one standard? Will feature teams manage CI themselves or will this be a centrally managed function? This is easier said than done and is an ongoing process — start small, see what works and review.
- Build new features as microservices, pull existing features out on an as needed basis. Rather than dive right in and start pulling apart the monolith, we decided to only build things as microservices that make sense outside of the monolith. If while working on a new feature we identify a part of the monolith that would be better served as a microservice, we will go ahead and break that out.
- Use out of the box solutions as much as possible. Open source and paid options generally provide community support whereas custom solutions must be maintained by the team. It’s important to find the right mix that suits the makeup of the team, budget and technologies you are working with.
With those principles in mind we came up with the following plan to develop microservices:
First off, you’ll notice we picked a different CI tool — all CI tools are not equal and each is well suited for certain situations. For us, Jenkins just made more sense for microservices and TeamCity continues to make more sense for monolith development. You will also notice that we are running contract tests, what are those? We want to keep the tests that are run for CI (the left side of the diagram) as low level as possible and contract tests are intended to only verify that the “contracts” between your microservice and its dependencies are upheld. This can we done using Wiremock or by deploying those dependencies and running tests against them. Once the two Jenkins jobs have completed, we end up with a Docker Image that is ready to be deployed.
As I mentioned earlier we were already experimenting with Kubernetes for container orchestration and it made sense for us to use this for microservices as well. We are evaluating tools to help orchestrate continuous deployments (Spinnaker and OpenShift to name a few), so in the meantime we are using Jenkins in a semi-manual process to deploy to production. The long term goal is to do fully automated deployments to production in a blue / green fashion, where we run end to end, performance and scale tests. We are making progress in this space every day, it has been a challenging journey but we are already reaping the benefits and learning a lot along the way.
Where we are today
All of the improvements we have made enable us to develop, test and deliver high quality software to our customers faster and more efficiently. We now deploy monolith code on a daily basis and can deploy microservices multiple times per day if we need to. We have streamlined our development process, enabling our software engineers to work faster and with fewer issues. We have checks and balances in place to help anticipate issues before they pop up, understand the effect they have on the team when they are present and gauge our success at resolving them. This empowers our engineering team to spend more time writing code and less time worrying about the infrastructure and tooling that supports it.
The important takeaway from this experience is that internal processes and tooling require frequent inspection and review to make sure the needs of the team, our customers and the company are being met. Software development is an iterative process and it’s important to make sure that as the world around you changes, you adapt to stay ahead of the curve.