Introducing Flagr: A robust, high-performance service for feature flagging and A/B testing

Published in

Checkr Engineering

7 min readDec 18, 2017

Rolling out a new feature is not an easy task. We may want to collect data, identify goals, generate a hypothesis, create variations, run experiments, and analyze results. Dark launching a feature is appealing, but has the following drawbacks:

There’s no easy rollout process. Developers need to write rollout functions for every new feature.
There’s no A/B testing. A/B testing is recommended for generating rigorous metrics that can be monitored to make data-driven decisions.
There’s no dynamic configuration for features. Changing configuration related to a feature may require a new deploy.
There’s no user segmentation, and it’s hard to target a small set of users.
There’s no easy way to collect analytics at the feature level.

At Checkr, we are constantly delivering features to improve the user experience of running background checks. We leverage machine learning to match screening records, deploy finite state machines to automate the report lifecycle and iterate analytics dashboards to gather insights. We are moving fast, and we need a low-risk tool to guide the rollout process and our decisions. This is where the open-source microservice Flagr steps in.

Flagr makes the rollout process as easy as clicking buttons.
Flagr supports feature flagging, A/B testing, and dynamic configuration. All of them are first class citizens in Flagr. Moreover, Flagr can run multi-variants experimentation, and you are not limited to binary on/off toggles.
Flagr can target any audience. It uses rich constraints to define user segmentation. The scope is bigger than the traditional web feature toggles, and it’s flexible enough to flag requests from mobile apps, frontend, and backend systems.
Flagr logs data records and impressions, so it’s easy to build your own analytics at the feature level. It decouples the metrics computation from flag evaluation, and you own 100% of your data.

Once we launched Flagr at Checkr, every engineering team quickly adopted it as the go-to solution for feature deployment. For example, we’ve used Flagr to progressively migrate our message queues from RabbitMQ to Kafka, we’ve used Flagr to control the threshold of probabilistic identity matcher, and we’ve even leveraged Flagr to run scheduled feature deploying via timestamp constraints.

You can find out more in the Flagr Github repository.

Features

Flagr has Swagger REST APIs for flag management and evaluation. For more details, see Flagr Overview.

The philosophy behind Flagr is that feature flagging and a/b testing should be dead simple, performant, reliable, debuggable, and configurable. The following are some selected features from Flagr.

REST API (Built with Go-Swagger)

Rules engine and user segmentation

Debug Console (test your flag evaluation)

Editing History

Kafka Data Logging & JWT Auth User login (via env variable configuration)

Supports MySQL, PostgreSQL, and SQLite3 via GORM

Client SDKs

REST API (https://checkr.github.io/flagr/api_docs/)
Ruby SDK (https://github.com/checkr/rbflagr)
Go SDK (https://github.com/checkr/goflagr)
Javascript SDK (https://github.com/checkr/jsflagr)
Python SDK (https://github.com/checkr/pyflagr)

Use cases

Feature Flagging

A common pattern for feature flagging is a binary on/off toggle. Most of them are kill switches, and sometimes one will have a targeted audience of the feature flags. The following is an example written in pseudocode: given an entity (a user, a request, or a web cookie), Flagr evaluates the entity according to the flag setting.

evaluation_result = flagr.post_evaluation( entity )if (evaluation_result.variant_id == new_feature_on) {
    // do something new and amazing here.
} else {
    // do the current boring stuff.
}

For instance, a feature flag can be configured from the Flagr UI:

Variants
  - on
  - offSegment
  - Constraints (e.g. state == "CA")
  - Rollout Percent: 100%
  - Distribution
    - on: 100%
    - off: 0%

Experimentation A/B testing

If you want to run A/B testing experiments on several variants with a targeted audience, you may want to instrument the code to Flagr like the following.

evaluation_result = flagr.post_evaluation( entity )switch (evaluation_result.variant_id) {
case treatment1:
  // do the treatment 1 experience
case treatment2:
  // do the treatment 2 experience
case treatment3:
  // do the treatment 3 experience
default:
  // do the control experience
}

A typical A/B testing experiment can be configured from Flagr UI like the following.

Variants
  - control
  - treatment1
  - treatment2
  - treatment3

Segment
  - Constraints (state == "CA")
  - Rollout Percent: 20%
  - Distribution
    - control: 25%
    - treatment1: 25%
    - treatment2: 25%
    - treatment3: 25%
Segment
  - Constraints (state == "NY" AND age >= 21)
  - Rollout Percent: 100%
  - Distribution
    - control: 50%
    - treatment1: 0%
    - treatment2: 25%
    - treatment3: 25%

UI setting example (frontend appearance may iterate quickly):

Dynamic Configuration

One can also leverage the Variant Attachment to run dynamic configuration. For example, the color_hex of green variant can be dynamically configured.

evaluation_result = flagr.post_evaluation( entity )
green_color_hex = evaluation_result.variantAttachment[“color_hex”]

Setting example:

Variants
  - green
    - attachment: {"color_hex": "#42b983"} OR {"color_hex": "#008000"}
  - red
    - attachment: {"color_hex": "#ff0000"}

Segment
  - Constraints: null
  - Rollout Percent: 100%
  - Distribution
    - green: 100%
    - red: 0%

Architecture

There are three components: Flagr Evaluator, Flagr Manager, and Flagr Metrics.

Flagr Evaluator

Flagr Evaluator evaluates incoming requests. The evaluation process is to determine which flag variant to apply to which request context. It’s the core of high-performance logic, and we achieve the speed and availability by the following efforts:

Flagr Evaluators have the flexibility to scale with request volume. Adding more hosts under the load balancers can do the work, and it won’t affect how we write or read the database.
All the flags are loaded in memory for Flagr Evaluators. The memory footprint of a flag is small (100–200 bytes per flag) so that we can have a goroutine in each evaluator to periodically load flags from the database into memory.
We’ve picked the fastest uniform hashing algorithm CRC32 as the deterministic random function. Michiel Buddingh wrote a thorough comparison of the uniform distribution hash functions.

Flagr Manager

The Flagr Manager serves as the CRUD interface, defining all operations for managing flags. Features include:

The API is written in Go-Swagger. We’ve considered gRPC, Thrift, gRPC-gateway and plain REST, and the Go implementation of swagger really stands out here in terms of the popularity of REST API and the client code generation support.
All the client SDKs are auto-generated from the Swagger spec. For example, Flagr Ruby SDK https://github.com/checkr/rbflagr.

Flagr Metrics

Last but not least, we have Flagr metrics, the data pipeline for collecting evaluation results. Metrics data is the foundation of building analytics at the feature level. Currently, Flagr supports Kafka as the data logging pipeline.

Data privacy is critical. You can encrypt your data in the data pipeline by turning on the option in ENV variables. And you control 100% of the data generated from Flagr. By design, metrics can be fed into any ETL pipeline and consumed for computation. At Checkr, we use Presto’s Kafka connector and Airflow to build pipelines. A simple example might be to join a users table with Flagr records, grouping by flag variants to calculate user-based metrics. The data is available — what you decide to do with it is completely up to you.

Performance

Flagr is in the center of rolling out features. Feature branching happens before all the features, and every single millisecond latency matters here. We’ve tested locally with Vegeta (a versatile HTTP load testing tool), and we are monitoring production traffic of Flagr via NewRelic. Benchmarks are included in the repository for reference.

Requests      [total, rate]            56521, 2000.04
Duration      [total, attack, wait]    28.2603654s, 28.259999871s, 365.529µs
Latencies     [mean, 50, 95, 99, max]  371.632µs, 327.991µs, 614.918µs, 1.385568ms, 12.50012ms
Bytes In      [total, mean]            23250552, 411.36
Bytes Out     [total, mean]            8308587, 147.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:56521
Error Set:

Production traffic (monitored from NewRelic).

Try it now

# Run Flagr with docker
docker run -it -p 18000:18000 checkr/flagr# Open in browser
http://localhost:18000

Or visit a demo app hosted at Heroku: https://try-flagr.herokuapp.com

Conclusion

If you are looking for a feature flagging or A/B testing tool to reduce risk, iterate quicker, and gain more control, Flagr is a good fit here. It’s actively maintained, and it serves production traffic in Checkr.

We’re only at the beginning of being able to use the service, but in the future, we plan to further ingest evaluation data to feedback into the rollout process. We are going to take automation to a next level, for example, use metrics and machine learning to drive the feature rollout.