EXPEDIA GROUP TECHNOLOGY — SOFTWARE

Automating Performance Standards

Or how I learned to stop worrying and love canary releases

Heena Gupta
Expedia Group Technology

--

“Canary releases: releasing software is too often art, it should be an engineering discipline” — David Farley, Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation

The term canary comes from an old coal mining technique in which canaries were used for the early detection of airborne toxins. Similarly, the canary release provides the ability to do capacity testing of a new release in a production environment with a safe rollback strategy if issues are found.
By slowly ramping up the load, one could monitor and capture metrics about how the new version impacts production. This is an alternative approach to creating an entirely separate capacity testing environment because the environment will be as production-like as it could be, it is in fact production!

Overhead view of tables and decorative fountains
Photo by Abdullah Öğük on Unsplash

Why automate performance monitoring?

Recently, I had an opportunity to build and standardize performance tooling across the Expedia Group™️ Flights team and it was an awesome learning experience. My goal was to automate the release pipeline and provide continuous reporting which would enable committers to fix defects prior to a production release. This eventually leads to establishing performance standards by ensuring the version deployed to production is not causing any degradation or unexpected improvement. Interestingly, unexpected improvements often point to a defect in the measurement of performance metrics. This also helped save manual effort that is usually consumed in an investigation of defects post-production release and monitoring during the release process.

Now, let us discuss the implementation!

“Coding” a release pipeline is the key to controlled and automated releases

For efficient release management, it is crucial to make the release pipeline as flexible and as automated as possible. To accomplish this, the release pipeline is refactored into steps linked to each other. This allows the steps to be re-used and makes the release process both flexible and readable.

The main pipeline could be broken into categories such as:

  1. Build and test: ensures that the current version builds successfully. It involves stages such as unit tests, stress tests, automation, and regression tests along with lighthouse tests for ensuring the current version is stable. The stages in this pipeline could be automated.
  2. Deploy to integration: deploys and releases to an integration environment for verifying changes on a production-like environment. This pipeline could be triggered once the “Build and test” pipeline is successfully executed. The stages in this pipeline could be automated.
  3. Canary pre-ramp-up: Once the “Deploy to integration” pipeline is complete and any changes are verified on integration, the stages needed for canary pre-ramp-up could be triggered manually. This pipeline could involve stages such as deploying and releasing changes on a canary, deleting and deactivating older stacks consumed by the previous canary version to save costs.
  4. Canary ramp-up at x%: Once the “Canary pre-ramp-up” pipeline is successfully complete, changes could be ramped up on the canary at x%. The x% could be determined based on the sample count received by the application to ensure the changes ramped up on the canary are not affecting production traffic adversely.
  5. Canary ramp-up at 2x%: Once the canary is ramped up at x% and is successfully monitored, it gradually ramps up traffic further to continue monitoring.
  6. Deploy to production: involves full deployment and release to all AWS regions, running smoke tests for each region, and publishing release notes.
Each step in the main pipeline is itself a pipeline which is linked as per the release flow
Sample main pipeline with steps linked to each other

Scheduled canary ramp-ups

To gather performance trends every y hours for automated ramp-ups, I added a cron trigger to the main pipeline via a Jenkins™️ job. This Jenkins job identifies the version to provide to the “Canary ramp up at x% pipeline along with specifying the cron schedule needed for the trigger.

Monitoring

Once the version is ramped up on canary at x% traffic and has a significant sample count, it is crucial to monitor the performance trends in monitoring tools such as Splunk and Grafana. Don’t forget to create a monitoring dashboard for efficient monitoring during ramp-ups. To automate monitoring, the addition of alerts in monitoring tools will be very useful to let the team know about possible issues.

It is important to monitor the trends on which the performance of the application is dependent. The performance metric could be a custom metric or lighthouse performance metrics such as First Contentful Paint based on the project analysis and requirement. To ensure that the monitored trends are reliable, one could keep a check on the sample count rate to ensure that the monitored traffic is consistent.

Sample canary performance dashboard for monitoring the application during canary release
Sample Canary Performance Dashboard for Canary Release Monitoring

Notifications

One should always add notifications to keep teams updated about ongoing releases or monitoring status for efficient communication. Make sure to add notifications for pipeline completion, pipeline failure, Jenkins job success, and alerts.
Always remember, communication is the key!

Further improvements

This process could be improved further by performing a canary ramp-up on each version followed by monitoring metrics, instead of scheduling the canary ramp-ups in batches every x hours, as the trends will be relative to each of the versions in that case.

The addition of automated alerts and automated ramp-ups for continuous monitoring leads to a reduced error rate during reporting. This leads to a more reliable monitoring process for production releases.

Happy automation and alerting!!

Do share your experience with automation on release pipelines for performance monitoring in the comments section!

--

--