The road to continuous web deployment at PlanGrid

Published in

PlanGrid Technology

7 min readApr 7, 2017

Over the past year, we’ve transformed our web releases from a burdensome, manual process to an efficient, scalable, and stable continuous integration (CI) and deployment (CD) pipeline. Our CI/CD pipeline has enabled us to successfully ship over six hundred releases.

Early 2016
At the beginning of 2016, our web team had little continuous integration, resulting in release problems, degrading product quality, and poor habits emerging in developers. Our development process at the time looked something like this:

The development team creates a feature branch.
The feature is developed.
The feature branch is merged into our development branch along with other new features.
Once a few features are merged, we deploy a release candidate to a testing environment.
Our QA team manually verifies existing functionality is not broken.
Our release team coordinates bug fixes with the appropriate teams.
Another release candidate is cut and we repeat steps (5) and (6) until we are comfortable to go to production.

The first problem with this process: our release dates were unpredictable. Release schedule is critical to our business because other departments, including marketing, sales, and support, rely on this information to operate effectively. We had to coordinate manual testing and triaging of bugs, making the process of giving accurate estimates even more burdensome.

The releases themselves were also a big risk and were often followed by hot patches. The pressure of completing features coupled with immediate hot patches added a lot of stress to the product team. It also meant that we regularly broke users’ workflows, which is unacceptable.

All of these factors created a sense of fear and led to fewer, larger releases, which in turn meant simple bug fixes didn’t land on production for weeks. Infrequent releases also meant that a developer’s feedback loop between writing code and shipping it to production was essentially nonexistent. Without this loop, it’s difficult to learn and grow as a developer. Not having context around issues on production was problematic because by the time a developer’s change landed in production, they had already spent a month and a half working on something else.

You might wonder, “Didn’t y’all write tests?” Well, we did have integration tests, but often the failing tests were false positives making the testing suite completely unreliable. Worst of all, the integration tests eroded our confidence in testing and increased our dependence on manual QA. Confronted with these issues, we acknowledged that something was wrong and started to iterate on improving our development and release process in three steps:

Building a weekly release cadence.
Adding integration tests.
Adding end-to-end testing.

Cadence
First we established a weekly release cadence. By releasing weekly we reduced the amount of code we shipped. This is important because every line of code change contains risk that something may go wrong. By decreasing the number of lines changed, we decreased that risk.

We made some operational changes to make this a reality. Coordinating bug fixes became an urgent task in order to maintain our rigid schedule and it was all hands on deck to get bug fixes shipped. We also had no time to run the entire manual QA script, so instead, we only tested the areas in which the app was changed that week. To mitigate the increased risk of less testing, we forced feature teams to test their changes prior to merging to the development branch. This kept the development branch stable, resulted in the release team finding fewer bugs during the release cycle, and meant we could now better predict feature release dates.

Integration tests
Automated testing is crucial to ensuring quick, successful releases, but we had an end-to-end testing suite that was almost always failing or broken. It spun up the necessary services to run the website locally, but, since these services were maintained by different teams, it was hard to keep our end-to-end testing up to date on the frequent changes to these services. The tests also took a long time: a suite of 30 tests would take at least 30 minutes. Most of that time was spent setting up each test case and then tearing it down.

To address these issues, we used factories to mock out internal services and remove our dependencies on them. With this change, setting up projects took less than a second and most of the timing issues went away. With speedy tests, we could add more, and we backfilled every feature with integration tests. Now instead of running 30 tests in 30 minutes, we run more than 170 tests in less than five minutes. With the suite in place, we realized the following benefits:

We had a stable way of doing integration tests and began to trust our test suite again.
We could do application-wide changes confidently.
Developers got quick feedback when we blocked merges for failing tests.
Our development environment became much more stable and reliable.
QA no longer ran the full test script and instead did one hour smoke tests prior to release.
Release engineers and QA no longer had to spend the entire week shepherding releases and were able to do other work.

Even with these improvements, the release process still wasn’t 100% automated, and QA still had a one hour smoke test that tested the integration between the web app and other internal services.

End-to-end testing
We wanted to remove this one hour smoke test with end-to-end testing. This was probably one of the hardest pieces to achieve, and I’ll admit we failed at our first attempt. The original idea was to create a repository that would launch all of PlanGrids services via Docker Compose. Once that was up and running, we would run a small set of tests — about six of them — to test the most critical areas of the product. As an example, one of the tests was to log in to PlanGrid and set up a project.

Our first roadblock in creating the end-to-end tool was that not all services had Docker images, and we had to bootstrap the missing services with Docker images. The task itself was easy but beside end-to-end testing those Docker images were not being used for development or deployment, so as the services changed we had to update them. This was a problem because we didn’t find it feasible to keep up with all the changes.

The second failure point was trying to run or replicate S3 and SQS locally. Our initial effort was unsuccessful in making this happen, and it was unclear how much effort it would take. We also weren’t sure if we wanted to go through this again if we added more AWS services to our stack. After spending many hours on the project, we decided to cut our losses and take an easier approach. We ended up creating a separate web application that used all production services underneath. This meant that prior to deploying to production, we’d deploy the changes to this production-like application and run a couple Selenium tests. This was not the ideal solution because tests were affected by production issues and an additional web application needed to be in sync with production. However, despite these drawbacks, we eliminated the one hour smoke test and the last manual piece in the release pipeline. With end-to-end tests in place, the CI/CD pipeline took about 30 minutes to release a change to production instead of weeks.

Outcome
To date, we’ve shipped over 600 releases using this pipeline, and here’s some of the positive changes we’ve seen:

Product quality increased as we reduced (and continue to reduce) the potential harm a release can do. With these safety checks in place, it is no longer possible to take down the entire app by accident. It also gives us the opportunity to make the web client more robust in the future by adding more automated checks, further reducing the risk that comes with releasing changes.
Automating the release means we no longer have a dedicated team responsible for releasing. We can keep pace with a growing product team by simply adding more machines to accommodate for more releases.
We were able to shorten and close the feedback loop between the developer and code in production. Developers now receive immediate feedback if their changes will break other parts of the application.
Troubleshooting issues on production is easier and sometimes leads to faster resolution. We’ve noticed that occasionally when issues arise on production, it’s typically due to the latest release. Debugging was easier when the last release only contained a small change opposed to many months worth of changes.
We reduced the cost of making mistakes or creating unforeseen delays in projects. Previously missing a release meant a potential delay by weeks or even months.
Hot patches no longer circumvent the release process. A hot patch is just a regular release, runs through all automated testing, and we can be confident it won’t cause a series of cascading hot patches.
Agile and iterative development became possible and unlocked rapid experimentation and iteration of product ideas.

We’ve come a long way and we’ve made huge improvements. And we still have challenges; maintaining the CI/CD pipeline is itself a time-consuming effort. Despite these challenges we’re more efficiently using our limited resources, which sets up for future growth by making our releases more stable and scalable.

The road to continuous web deployment at PlanGrid

Written by Alexei Schiopu