Here’s a look at how we integrated Kayenta into our existing deployment infrastructure, highly improving the reliability of our deployment pipeline and setting the stage for automated production deployments.
Deploying microservices spread out across multiple regions and cloud providers is a time-consuming task when not automated properly. While most companies are exercising deployment and test automation extensively, monitoring during canary deployments is often a manual task.
In 2018 Kayenta, an open source tool for automated canary analysis, became available as part of the Spinnaker project¹. In this article we describe how we integrated Kayenta into our existing deployment infrastructure, highly improving the reliability of our deployment pipeline and creating the preconditions for automated production deployments. Moreover, to get a more comprehensive picture, we augmented Kayenta with a machine learning-based approach to detect anomalies in error log files.
In a nutshell, Kayenta compares two time series and alerts when the deviation exceeds a certain threshold. As a source for these application metrics we are using New Relic Insights. For this backend we also contributed the code as open source to Kayenta³. Support for other backends (Datadog, Prometheus, etc.) is available, too. For general information about automated canary analysis, please refer to the excellent articles by Google¹ and Netflix².
From CI to CD
At Adobe we have been exercising automation for several years. Beside test automation, automated deployments are one pillar of getting maintenance overhead under control and maintaining a high level of stability. Over the years our service landscape has been evolving into a multi-region, multi-cloud distributed system. Despite our one-button deployments, supervising the careful rollout of a single component to 6+ production data centers can keep an engineer busy for a couple of hours when monitoring needs to be done manually. In our most recent optimization initiative, we therefore focused on the deployment and monitoring phase. The public release of Kayenta came just at the right time.
As explained in the Google Cloud Blog, there are certain benefits of adding automated canary analysis. For us, the following benefits are the most important ones:
- Reducing the overhead and eventually increasing the operational velocity of our organization were the initial triggers of our work.
- Accounting for human error added another dimension for us: The complexity of mature services being in production for years requires engineers that understand the many facets of that service during deployments. Similar to automated tests, monitoring automation gives developers a tool to “encode their expectations” of statistical variance of certain metrics. In our setup, Kayenta configuration is kept as part of the code in the Git repositories, which keeps everything easily maintainable in one place. This also fits our strategy of enabling development teams to own their services comprehensively.
- Automating the decision for a promote or rollback allows advanced metrics to be considered. Going through them manually in every deployment would be just too tedious. Similarly, decisions are clearly defined and documented by the metrics and analytics they are based on.
- In addition to time series supported through Kayenta we have added an error log analyzer, which looks for log file anomalies in the canary. This serves to detect error messages that did not appear before. This is something that could not be done earlier because the log volume is simply too high for manual browsing.
Before adding Kayenta to our deployment pipeline we evaluated several options. We eventually decided to integrate standalone Kayenta into our deployment pipeline, because switching to another deployment tooling like Spinnaker was not an option in our enterprise setup.
Automated canary analysis using blue-green deployment
On our staging and production systems we are running traditional blue-green deployment with canary support. As an incremental approach we just replaced the manual monitoring step with automation (step five in the list below):
- One cluster of live-instances is running using the currently active version, in the following referred to as baseline.
- A second cluster is scaled up using a small amount of instances, in the following referred to as experiment.
- After the first small batch of instances is up, we run a sanity test suite. Those tests are intended to cover the critical paths of our instances.
- If tests are ok, traffic is shifted to the experiment instances.
- During the traffic shift, Kayenta regularly analyzes New Relic time series and error logs.
- If the experiment performs well, the traffic on the experiment cluster will be increased step by step and finally cut over.
Canary log analyzer
In addition to time-series analytics we have configured one of our own custom-built components, the canary log analyzer, as one of the monitoring backends in Kayenta (beside New Relic). It essentially detects anomalies based on error log output and reports them as metrics. Error log messages that did not occur in a several-week time span before are considered anomalies. As with the usual time series, Kayenta will compare anomalies coming from baseline with the ones coming from the experiment.
Despite the Google/Netflix recommendation to use dedicated experiment and baseline instances beside the main production cluster, we decided to just use our blue-green setup and add the monitoring automation on top of it. This allowed us to quickly iterate whether the approach would work out for us without reengineering our deployment process. Clearly, we underestimated the importance of dedicated monitoring instances. Let us look at the learning in detail:
- Having individual experiment and baseline clusters prevents metric degradation due to warm-up when new instances are scaled up as part of the cut-over process. This helps because the set of monitored instances will not change.
- Availability zone (AZ) affinity on AWS: When using setups deployed across several AZs it is important to pick baseline and experiment instances from the same AZs. Even though in plain dashboards this may not be visible to a human, Kayenta will look at time series in much more detail than a human and notice when timings differ because of AZs.
- Low-frequency metrics: Comparing low-frequency metrics from very unequally large sets of hosts corrupts comparability, specifically when chances to see a non-zero number per time bucket are high. For example, let’s assume 100 requests per second across all instances with an error rate of one percent. When monitoring a blue-green deployment with 99 baseline instances and one canary instance in one-second intervals, we would expect a continuous error rate of one percent on the baseline side and one error every 100 seconds on the experiment side. Therefore the error rate on the canary side would be zero in most of the one-second intervals. In contrast, if we would look at two equally-sized sets monitoring instances, we could simply use absolute error numbers to compare with each other.
- Statistical relevance: As always, when analyzing metrics it is important to have relevant traffic. On our staging environments we were not able to use metrics with high standard deviation like database response time. Only the more general metrics like error rate and transaction time did produce reliable results. Clearly, running a small amount of tests against the platform was not enough for getting accurate results. We had the best results on our production systems with several thousand requests per minute and instance.
- Tweaking metrics is time consuming. Of course there are certain metrics to start with (error rate, New Relic apdex, transaction time). However, Kayenta can play out more strengths when development teams are actively contributing to the metrics configurations and are maintaining them themselves.
- Operating Kayenta as stand-alone service works very well as soon as it is understood how it works internally.
Out of all the technical learnings, the most important one is the need for dedicated monitoring instances for the experiment cluster and baseline cluster. Kayenta is designed to monitor metrics quite sensitively and therefore will notice minimal timing deviations in either cluster.
Investing into automated canary analysis was a risk with no guarantees that it would play out well in our scenario. One year later we can call it a success. It enabled us to confidently automate deployments all the way to production with more consistent and more detailed monitoring results.