Risk-free feature toggling with Unleash and Keptn

Jürgen Etzlstorfer
keptn
Published in
8 min readMar 3, 2020

When it comes to bringing new features to your users, you spend your time on designing, developing, and testing the features. However, customer value is only created once those features are released to your end-users. Naturally, you want to avoid any issues when releasing new features. Therefore, the principles of progressive delivery call for deployment and release methods such as blue/green deployments and canary releases. An even faster approach is feature toggles, which enable you to deploy new code into production while hiding it from users until the point in time when you want to release it. However, each new feature still brings the risk of unintended side-effects that can break production.

With Keptn we are solving this challenge by managing the safe release of feature toggles through observability. When Keptn is notified about an issue with a toggled feature flag, it automatically toggles the feature flag to mitigate the impact of the issue. Read on to learn more about feature toggle use cases and how Keptn integrates with tools like Unleash to make feature toggling risk-free!

What is a feature toggle?

A feature toggle is a simple way of decoupling the process of deploying code from the releasing of a new feature. Feature toggles are essentially a sophisticated “if” statement in your code that can be externally controlled. With feature toggles it’s possible to have a new feature included in your code base while hiding it from select end-users. For example, you might enable a feature only for your internal users. You can additionally control exactly when you release a new feature, and a new deployment of the entire application isn’t required as the code has already been shipped, but not yet activated.

Feature flags improve developer experience

On top of this, feature toggles come in handy during the development process itself. Instead of working on long-living branches in your Git repository, developers can work on the master branch itself by hiding new features behind feature toggles. With this approach, developers avoid the pain of merging a long-living branches back into the master branch, which can lead to severe merge conflicts. Having new-feature code in the master code base that is shared between all developers can mitigate such pain.

In addition to implementing feature toggles in your code, a feature toggle server is needed to store the configuration of all your feature toggles and allow you to manage their configuration. One such feature toggle solution is Unleash. As you can see in the figure below, the Unleash server on the right-hand side holds all feature toggle configurations and provides access to an API as well as the administration UI. Applications on the client side can be implemented in a variety of different languages and will connect to the Unleash server to fetch and evaluate the configuration. In this sense, privacy is also guaranteed since no user data is shared with the server. In addition, client-side evaluation is also critical for increasing the performance and resilience of the applications.

High-level architecture of Unleash feature toggle framework

When to use a feature toggle?

Not every line of code you write must be hidden behind a feature flag, but if you want to use feature toggles, there are basically three categories your use case may fall into:

- Releases: as described earlier, this is mainly related to hiding a new feature from end-users until the decision is made by the product owner to release the feature. Other use cases include marketing campaigns that require a new feature to be available at a specific date and time; a new full deployment of an application can be difficult to sync with a specific marketing campaign launch.

- Experiments: to figure out which version of a new feature is best accepted by your users (i.e., A/B testing). You can come up with two variants of a feature and have some of your users use variant A while the others use variant B.

- Operations: the purpose of operational feature toggles is to enable a safety mechanism that can be employed at runtime. A well-known pattern for this is a “kill switch” that turns off experimental features and/or enables a safety mechanism in your application to keep it alive without the need to roll back a deployment to a previous version.

What if something breaks?

As feature toggles provide a convenient way of releasing new features to your end users, they also provide the possibility of reverting a feature release and switching back to the previous behavior by turning off the feature flag. This is a practice some refer to as “testing in production.” It provides a safety net by turning off a feature if you find out that something is broken.

This is great, as it can speed up mean-time-to-remediation (MTTR) tremendously, as shown in this figure:

Source: https://medium.com/@sashman90/ops-mitigation-triangle-300c81d97df6

This figure shows that having feature flags as part of your deployment and release process will give you the fastest reaction time when releasing or rolling back features, even faster than blue/green switches and obviously faster than re-deploying an application.

Determining how many users are affected by a feature flag is crucial to knowing its impact.

Unleash gives you the ability to see how many of your users have been exposed to a specific setting of a feature toggle as part of their administration UI, as shown in this screenshot:

Distribution of enabled/disabled state of feature toggle in Unleash

As you can see, in the last minute, 51 % of users had the feature toggle enabled, while for the last hour it was 12 % of users that were exposed to the active feature toggle.

This is great for seeing the distribution of feature toggles, but it does not tell you about the actual impact on the user experience.

Capturing the actual impact is indispensable to deciding if a feature flag should be kept on or turned off. This is where tools such as Dynatrace are needed (or where developers have to invest time into monitoring with open source tools such as Prometheus). Looking at metrics such as response time, failure rate, and others gives an indication as to whether a feature toggle should be enabled. You can differentiate between technical and business metrics. Usually only the combination of both tells you if the system is working. For example, the response time for a service might be fine, but the conversion rate might be going down since additional steps were introduced or a button is removed.

As a more concrete example, take a look at the screenshot below that shows two versions of a conversion funnel in a funnel analysis. In the top part the new feature is still hidden, i.e., the feature flag “Load menu data” is not enabled. In the lower chart, the feature is released, since we have enabled the feature flag “Load menu data.” Technically, this adds an extra step to the conversion funnel. You can see negative impact on the overall conversion rate. Having this comparison and the ability to change the behavior using a simple feature flag enables you to quickly revert to the previous version, which results in better conversion.

Funnel charts show the negative impact on conversion rate of a feature flag exposed to a specific group

Auto-remediating feature flags with negative impact

Keptn as the control plane for continuous delivery and automated operations provides built-in support for setting your feature toggles or kill-switches automatically in response to any production issues that are reported. This means no more worries when releasing a new feature, starting a marketing campaign, or enabling an A/B test for your users in production — if issues arises, Keptn will contact the feature toggle server and set the feature flags to a stable state.

Here’s another concrete example: a shopping cart service for our Sockshop application has a feature toggle for enabling a promotional campaign that adds gift items to every third user’s shopping cart. Looking at the monitoring data, we can see an increase in failure rate when the feature flag was enabled. Once the monitoring platform informs Keptn about the increased failure rate, Keptn contacts Unleash to turn off the problematic feature flag and restore a frictionless user experience! Having Unleash + Keptn in place gives you risk-free toggling of feature flags thanks to its self-healing capabilities. This reduces mean-time-to-remediation (MTTR) to just a couple of minutes by automatically turning off the problematic feature flag. Moreover, having this automation also reduces context switches for developers since they don’t need to remediate issues. Instead they can be informed at a later point in time and can focus on their tasks at hand since the issue has already been mitigated.

Process for Auto-Remediation of Feature Flags

How to set up self-healing feature flags?

Keptn provides a simple way to define self-healing, or remediation actions, in terms of configuration on what has to be done instead of having to write a script on how it has to be done.

A simple remediation file for disabling a promotion campaign if errors are reported looks like this:

remediations:- name: "Failure rate increase"  actions:  - action: featuretoggle    value: PromotionCampaign:off

The file defines a set of remediations in response to an issue. In this example, one remediation is defined in response to a “failure rate increase” problem. Each remediation can have a couple of actions attached that will be executed consecutively. In our example there is one action defined that changes a feature toggle. The name and state of the feature toggle to be changed is defined in the value property of the action (i.e., give the “PromotionCampaign” feature toggle the value “off”). Note that due to the called abstraction mechanism, the Keptn uniform, tools, URLs or tokens are not part of this description. This file only holds declarative instructions for what to do. If the feature toggle framework is changed, this file won’t need to be changed, instead only the Keptn uniform must be updated.

On top of this you might have noticed that the remediation file does not include information about which service or application the remediation is for. Following a GitOps approach, Keptn stores this information in its Git repository at the service level, enabling maximum reuse of remediation files across services.

Self-healing in action

Have a look at this short video to see how Keptn and Unleash work together to auto-remediate feature flags that cause issues in your production environment:

Self-healing feature flags with Unleash and Keptn (youtube video)

Get started!

If you want to start integrating feature flags into your application with Unleash, it is quite easy because Unleash provides SDKs for most of the major programming languages like Java, Node.js, Go and .Net, to name just a few. For hosting the Unleash server they provide two different options: a self-hosted open-source variant and a managed hosted version of the Unleash server.

For managing feature flags with Keptn, a running Keptn installation is needed. Everything else is just a matter of configuration, no coding required. The easiest approach is to follow the provided tutorials for setting up self-healing feature flags with Unleash and Keptn.

Get in touch!

Join the Keptn community and tell us about your self-healing stories in our Slack workspace. Also, make sure to follow us on Twitter via @keptnProject and don’t miss any of our updates!

--

--