Canary release with Cloudflare Workers

Boozt Tech
Boozt Tech
Published in
9 min readSep 16, 2020

--

By Vytautas Verseckas, Back End Developer

We at Boozt aim to ensure the best shopping experience for our customers 24/7. This includes providing (almost) immediate access to any improvements and updates our team is working on daily.

We ensure this by following best Continuous Integration and Continuous Deployment practices.

Our developers merge changes to the main branch as often as possible and our build processes rely on automated tests to make sure the application is not broken. Further down the line builds are assigned a version number and our deployment process takes care that every build is successfully deployed to our web servers and served to customers as soon as possible.

Nevertheless deploying to production is risky, despite all the strategies we apply, a new feature can easily introduce unforeseen bugs and in turn, affect user experience in negative and even critical ways.

Why Canary release?

The term canary release is actually related to the saying “like a canary in the coal mine” which is an allusion to caged canaries (birds) that miners would carry down into the mine tunnels with them. This technique is known to be used since the beginning of 20th century and was changed by electronic devices only around 30 years ago. The canaries are more sensitive to toxic gases than humans and miners would use them as early detectors, so one bird falling ill due to poisoning could save multiple human lives.

[image: https://upload.wikimedia.org/wikipedia/commons/6/6a/Revival_cage.jpg, source https://commons.wikimedia.org/wiki/File:Revival_cage.jpg]

A similar approach, used in canary release deployment, helps us detect potential bugs and issues before it disrupts our service to all visitors. This generally reduces the risk of introducing a new version and is critical in a fast paced CI/CD environment, where it is normal to have multiples of 10 releases every day.

How it should be done?

The principle of canary release deployment is pretty straightforward and comes down to several steps applied to our traffic during the time when a new version of our web application is being released:

1. Before a new release is deployed all of our visitors are served a web application version, for the sake of simplicity — v1.

2. A new version, let’s call it v2, is built and deployed by our CI/CD pipeline, which triggers the canary release process.

3. During the canary release process — we select a small part of our visitors and serve them the newest version (v2) of our web application, the rest of our visitors are served the previous version (v1).

4. After the canary release process we start serving v2 to all of our visitors, or in case of critical issues — we have the option to immediately start rollback to v1 at any time during the canary process.

Before we implement this process we need to make several small but crucial decisions which affect the design of our canary release solution.

First and foremost — don’t do canary if you are not confident about your systems monitoring. You must trust your monitoring fully and make sure it is sensitive enough to track and analyze new exceptions and warnings in real-time.

Second, we need to decide the selection criteria of our canary visitors. This decision should be based upon the specifics of the system and can range from serving to a specific focus group, to randomly selecting a percentage of visitors — as we normally do. In such a case, the canary percentage is advised to be evaluated for your typical visitor load. For us, we evaluated 10% to be optimal.

Third, we need to decide the time duration of the canary release process. This decision also depends on your system monitoring and CI/CD pipeline specifics. In our case we have evaluated that 10 minutes + 2 minutes extra should be sufficient to identify if a system breaking change was introduced by the deployed version and to rollback all our users to the previous version in case of failure.

Last but not least — we need to choose the right tools for our infrastructure.

Release versioning in the backend

As already mentioned — our CI/CD pipeline builds our web application whenever a new commit appears on a master branch in git. This process is triggered “manually” by a developer commit, but the rest happens automatically. During this process, a build is assigned a certain version number and distributed to our web servers.

Our web servers host several versions of the application side by side in a path containing the version number. The newest version is always symlinked as the current version.

This allows us to employ a nice feature of nginx web server configuration — we can use a value of http header to serve a specified version for the request.

This, however, means that our web server will expect to receive this header and in turn, we need to make sure that a request always sends the header X-Release-Version.

Canary release with Cloudflare Workers

Cloudflare services are used in our infrastructure for DDoS protection, DNS and Caching. One of the nice and useful products in the Cloudflare stack is their serverless application platform called Cloudflare Workers. Workers allow us to put the JavaScript based service workers between the visitor and our application and implement all kinds of additional logic for processing our visitors requests. The main benefit of using Workers for us is the ability to control caching of our applications responses with programmable logic. But in general, Workers is very handy in modifying both requests and responses, thus it is a good candidate to implement the canary release process as well.

Implementing the process requires us to be able to identify the current state and add a header with a specific version of our application at any state of the canary process: before, during and after.

To enable that, we need to persist enough version metadata to identify current and previous release versions, and their release times, in order to handle the state of the canary process over time. This could be achieved in more than one way, but since we are doing it in Cloudflare Workers, preferably it would be some data storage mechanism “native” to Workers runtime environment to avoid any potential access and latency issues.

Cloudflare offers a key-value store called Workers KV specifically for that — to be used as a low-latency data store for Workers. This seemed like an obvious choice for our needs.

Nice thing about Workers KV is that we can interact with it via Cloudflare API from other parts of our infrastructure. This allows us to keep an actual list of our application version metadata in Workers KV, push version metadata to the list as a new version is deployed, and remove a version from the list in case of a rollback.

Having a version metadata list immediately available for our Cloudflare Worker lets us easily identify what the current release version is and since we know its release timestamp — identify whether we are currently in the during state of the canary process. This in general allows us to add a proper release version header for every request at any time.

During the canary release time, if a visitor doesn’t have a cookie, we do a random generation of an integer between 0 and 100, and check if it’s less or equal to 10, which over a series of generations will result in approximately 10% of our visitors. Depending on the outcome, we set a short-lived cookie with the assigned version number and expiration time equal to our planned canary time. After the first such request, all the following requests of the user come with the assigned version number in the cookie.

Before and after the canary release time our time condition fails, so we always fetch requests with the current version at that time. When this is implemented we can visualize our traffic, sorted by version, in our monitoring system.

Unfortunately, as you can see, we expect 10% of the traffic to use current version of our application, but in reality we see that up to 80% of the traffic is used with current version during the canary time.

This left us deeply puzzled.

After spending quite some time trying to figure out the issue, we eventually introduced additional logging functionality in our Cloudflare Worker, so we could push log information to our Google Cloud Logging. Even after time was spent on cleaning up various exceptions in other parts of our Worker we still were logging a large number of exceptions, which to our huge surprise, we eventually pinpointed to no other than reading the version list from the Workers KV.

To our big surprise, we would be getting a cryptic error message saying “Error: HTTP GET request failed: 403 Forbidden”.

Since we couldn’t investigate any deeper into Cloudflare Workers runtime, we started looking for alternatives.

Secret variables in Cloudflare Workers

Initially, we started to contemplate on how we could use Environment variables to “cache” current and previous version metadata for our canary versioning logic to use in our Workers.

Environment variables are very straightforward to use — once they have been defined in Worker configuration you can use them as global constants inside the Worker scope. However, this would require us to dynamically generate configuration files and re-deploy our Worker project on every new version released. Not only did the generation of configuration files seem inconvenient, but also potentially risky — someone could overwrite the correct version values while deploying the Worker.

However, we have noticed that contrary to Environment variables, another class of variables — so-called Secrets — were available to use by the Workers. We found out that they work similar to Environment variables, but instead of setting them via configuration files, we could set their value via a Cloudflare API call. This was convincing enough for us to try this option out.

We decided to keep using Workers KV to store and manage the list of versions, but additionally we “cache” current and previous version metadata into Secret variables on every release in our CI/CD pipeline.

Of course, we had to adapt our Worker logic to use Secret variables and fallback to using Workers KV in case those variables did not exist for one or another reason.

Luckily for us, this immediately made our canary release stable, serving the newest version to 10% of our visitors. Unfortunately, this was true for one release only.

Whenever the next release happened in our CI/CD pipeline — our Secret variables were pushed to Cloudflare successfully, but the versions wouldn’t change according to our monitoring. As we started debugging by adding a couple of logging statements and deployed our Worker to production — Canary process immediately triggered and started working.

This could only mean one thing — fresh Secret variable values would be loaded to Cloudflare Workers runtime only on cold-start — when the Worker is loaded for execution for the first time. This drives us to redeploy our Worker every time our application version is released.

Conclusions

Even though Cloudflare Workers KV is advertised as a low-latency global key-value store, its eventual consistency and apparent instability were causing errors in our canary versioning logic and a large chunk of our visitor traffic ended up defaulting to accessing the current version. This is the only hypothesis we found believable enough to cause 80% of the traffic hit the current version during canary time as opposed to the rather stable 10% of the traffic when using Secret variables.

Although the primary purpose of Secret variables is different by design, we found them to be a very handy option for our case. This means that Cloudflare Workers is a viable option to implement canary release, especially if you use Workers for other purposes already.

If you enjoyed this article, and want to read more great stories from the Boozt platform team, be sure to subscribe to the Boozt Tech publication!

Or perhaps you’re interested in joining our platform team? Then check out our careers page.

--

--