How We Automated Canary Analysis for Deployments

Martin Ehmke
Nov 19 · 6 min read

Here’s a look at how we integrated Kayenta into our existing deployment infrastructure, highly improving the reliability of our deployment pipeline and setting the stage for automated production deployments.

Introduction

Deploying microservices spread out across multiple regions and cloud providers is a time-consuming task when not automated properly. While most companies are exercising deployment and test automation extensively, monitoring during canary deployments is often a manual task.

In 2018 Kayenta, an open source tool for automated canary analysis, became available as part of the Spinnaker project¹. In this article we describe how we integrated Kayenta into our existing deployment infrastructure, highly improving the reliability of our deployment pipeline and creating the preconditions for automated production deployments. Moreover, to get a more comprehensive picture, we augmented Kayenta with a machine learning-based approach to detect anomalies in error log files.

In a nutshell, Kayenta compares two time series and alerts when the deviation exceeds a certain threshold. As a source for these application metrics we are using New Relic Insights. For this backend we also contributed the code as open source to Kayenta³. Support for other backends (Datadog, Prometheus, etc.) is available, too. For general information about automated canary analysis, please refer to the excellent articles by Google¹ and Netflix².

Kayenta integrates with multiple backends

From CI to CD

At Adobe we have been exercising automation for several years. Beside test automation, automated deployments are one pillar of getting maintenance overhead under control and maintaining a high level of stability. Over the years our service landscape has been evolving into a multi-region, multi-cloud distributed system. Despite our one-button deployments, supervising the careful rollout of a single component to 6+ production data centers can keep an engineer busy for a couple of hours when monitoring needs to be done manually. In our most recent optimization initiative, we therefore focused on the deployment and monitoring phase. The public release of Kayenta came just at the right time.

Benefits

As explained in the Google Cloud Blog, there are certain benefits of adding automated canary analysis. For us, the following benefits are the most important ones:

  • Reducing the overhead and eventually increasing the operational velocity of our organization were the initial triggers of our work.

Our solution

Before adding Kayenta to our deployment pipeline we evaluated several options. We eventually decided to integrate standalone Kayenta into our deployment pipeline, because switching to another deployment tooling like Spinnaker was not an option in our enterprise setup.

Automated canary analysis using blue-green deployment

On our staging and production systems we are running traditional blue-green deployment with canary support. As an incremental approach we just replaced the manual monitoring step with automation (step five in the list below):

  1. One cluster of live-instances is running using the currently active version, in the following referred to as baseline.

Canary log analyzer

In addition to time-series analytics we have configured one of our own custom-built components, the canary log analyzer, as one of the monitoring backends in Kayenta (beside New Relic). It essentially detects anomalies based on error log output and reports them as metrics. Error log messages that did not occur in a several-week time span before are considered anomalies. As with the usual time series, Kayenta will compare anomalies coming from baseline with the ones coming from the experiment.

Learnings

Despite the Google/Netflix recommendation to use dedicated experiment and baseline instances beside the main production cluster, we decided to just use our blue-green setup and add the monitoring automation on top of it. This allowed us to quickly iterate whether the approach would work out for us without reengineering our deployment process. Clearly, we underestimated the importance of dedicated monitoring instances. Let us look at the learning in detail:

  • Having individual experiment and baseline clusters prevents metric degradation due to warm-up when new instances are scaled up as part of the cut-over process. This helps because the set of monitored instances will not change.
A metric being zero most of the time

Additional learnings

  • Statistical relevance: As always, when analyzing metrics it is important to have relevant traffic. On our staging environments we were not able to use metrics with high standard deviation like database response time. Only the more general metrics like error rate and transaction time did produce reliable results. Clearly, running a small amount of tests against the platform was not enough for getting accurate results. We had the best results on our production systems with several thousand requests per minute and instance.

Conclusion

Out of all the technical learnings, the most important one is the need for dedicated monitoring instances for the experiment cluster and baseline cluster. Kayenta is designed to monitor metrics quite sensitively and therefore will notice minimal timing deviations in either cluster.

Investing into automated canary analysis was a risk with no guarantees that it would play out well in our scenario. One year later we can call it a success. It enabled us to confidently automate deployments all the way to production with more consistent and more detailed monitoring results.


Adobe Tech Blog

News, updates, and thoughts related to Adobe, developers, and technology.

Martin Ehmke

Written by

Software Engineer @Adobe, Hamburg, Germany.

Adobe Tech Blog

News, updates, and thoughts related to Adobe, developers, and technology.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade