Mastering Feature Flags: Performance and Resilience

Martin Chaov

Published in

DraftKings Engineering

9 min readApr 30, 2024

Performance and Resilience when using feature flags according to DALLe3

This article is part of a series; if you got here first, it might be worth it to check some of the previous ones:

The basics
Types of feature flags
Life cycle
Testing feature flags
Risk mitigation
Performance and Resilience (you are here)

Performance and Resilience are essential aspects of software development, and feature flags bring the intricacies of their requirements. This article explores the nuances of performance and resilience challenges introduced by feature flagging on client applications and outlines strategies for their mitigation.

Latency, Resource utilization, throughput

Latency is not solely a network concern; it encompasses the time from cause to effect within an application. A common source of latency in applications using feature flags is the decision-making overhead. Every time a feature flag check occurs, it introduces a slight delay as the system determines whether to enable a feature. This is compounded when new logic must be downloaded or when flags are evaluated sequentially. Depending on product requirements, SLAs, and implementation details, a feature flag could trigger the download of new application logic, which brings another layer of complexity (ability to load and execute code on demand) and latency before the feature can be run. It is not intuitive to notice how one flag could degrade the application’s performance. A feature flag could also bring latency if its value has to be re-validated with a back-end service periodically during an active session or on application bootstrap.

While resource utilization should not be significantly affected in most cases, budgeting how much CPU time and memory allocations for feature flags is a good practice. Does storing a 100MB flag configuration on the device make sense vs. downloading it every time? Is there a way to download only what is required at the cost of additional latency or the inability to download it on demand? What is the worst case for the user if a flag is unavailable or cannot be switched on/off?

Throughput is a business-defined metric expressed with technical values. The basic definition of throughput is the rate at which a system can process requests. Periodic polling for feature flags can increase the load on different components in the system (gateways, authentication, authorization, etc.). One source of degradation can be the system serving the flag values itself if it becomes overwhelmed with traffic or requests. Moreover, it is a simple transaction of returning what a flag value is for a user in a segment. However, user segments can change dynamically; ideally, a user can be in multiple segments, which may overlap. Determining which value is appropriate for a user can quickly become time-consuming for the back end.

Mitigation strategies:

Caching

Caching strategies should be tailored to flag types, with micro-caching implemented to reduce load for frequently accessed flags.:

Release — short-term caching with quick invalidation. Release flags are often used to control the rollout of new features. These flags may change rarely, but when they do have to change, especially in case of a rollback, it should be near instantaneous. Cache the values for a few minutes before revalidation and default to the “off” state rather than using a stale “on” state. This approach balances the need for up-to-date flag states with the benefits of the reduced number of calls reaching the back-end in cases where the back-end system is not available and can’t respond with up-to-date values.
Experiment — session-based caching. Experiment flags control A/B tests or other user experiments. User experience must remain consistent throughout a session once an experiment variant is assigned. Cache the flag state at the beginning of a user session and maintain that state for the session’s duration (a clear definition of the user session is required in the context of the experiment). This ensures users don’t switch between experiment groups mid-session, preserving the experiment’s integrity. Any flag should have the ability to be switched off if it hurts the business.
Operational — long-term caching with manual invalidation. These flags toggle operational aspects of the application, such as feature maintenance modes, back-end service endpoints, logging levels, etc. These flags change infrequently. Since operational flags don’t change often, caching their states for extended periods (hours, days, months) until an update is manually triggered can significantly reduce system load.
Permission — user-specific caching with dynamic invalidation. Permission flags control access to features based on user roles, which can change as user segments are updated (user transitioning from casual to professional or VIP tier). Such caching should be done per user, with mechanisms to invalidate the cache when the user segment changes. This requires a more sophisticated mechanism, potentially integrating with the system’s authentication and authorization layer to detect permission changes.

The caching strategy should not lead to stale or inconsistent flag states across different parts of the application or multi-device applications. The caching mechanisms must scale with the application, considering the number of flags and the volume of requests.

Optimized flag evaluation with strict SLAs

Lazy evaluation — evaluate flags only when necessary, rather than at the start of the application, to reduce unnecessary loading times.
Consider Flag SDKs — most flag management platforms provide an SDK, which often comes with all the required optimizations (local caching, background polling, push updates, etc.)
User segmentation — pre-compute segment/flag/environment combinations and store them for efficient access to reduce the need for complex evaluation logic upon each request.
Monitoring and alerting — track the performance impact of feature flags and set up alerts based on the SLA breaches. A close to real-time feedback loop allows for quick adjustments to flags, especially for release and experimentation types.

Data model efficiency

Different flag types can require different data modeling from an efficiency point of view. Here are a few examples to consider when modeling them based on the type of flag:

Release — to support gradual rollout strategies, these flags should be able to track targeted segments, rollout percentages and timing (10% ramp up per day), state tracking (not started, in progress, completed)
Experiment — they require a model that can handle variant assignment and track user engagement metrics as defined by the experiment or A/B test. They should be able to track values per user or user segment assignments, user engagement metrics, and others.
Operational — requires a straightforward and accessible data model. Such flags usually need to be easy to access and evaluate. Besides their current value, a timestamp of when the last change occurred can help synchronize their state in a distributed system.
Permission is standardized according to RBAC and supports dynamic updates to avoid edge cases with users moving between segments.

An example of a standard data model for feature flags:

type Flag = {
    id: String                  // GUID
    name: String
    type: String                // Release | Experimentation | Operational | Permission
    description: String
    scope: []                   // short-lived | long-lived | user-based | system-based
    environments: {
        development: Boolean
        staging: Boolean
        production: Boolean
    }
    targeting: {
        segments: []            // segment identifiers
        conditions: {
            attribute: String   // user attribute
            operation: Enum     // equals, greater than, lesser than, etc.
            value: String       // value if attribute meets condition
        }[]
    }
    rollout: {
        strategy: Enum          // gradual, targeted, all, etc.
        percentage: Number      // required
    }
    operationalParameters: {    // holds all the values required for the application to use this flag
        param1: value1
    }
    createdAt: Number           // timestamp
    updatedAt: Number           // timestamp
}

Monitor performance

The different flag types have very distinct purposes and reasons for existing. Thus, monitoring them can vary greatly. What could be considered “performant” for the different types of flags can be affected by the business niche and system constraints.

Release — rolling new system updates mainly affects the system stability and availability. Important metrics to track (before and after the update) include response times, error rates, user metrics, conversion rates, RUM, and APM.
Experiment — A/B testing requires close monitoring of user behavior and experiment outcomes. Important metrics to track include conversion rates, click-through rates, funnels, and other business-specific KPIs to assess the variants. In addition, system monitoring for performance degradation should be performed to understand if that can skew the results. Google Analytics, APM, RUM.
Operational — as these control operational aspects of the system (i.e., log level), they should be tracked in terms of CPU, memory, network, and other relevant resource utilization. In addition, error rates will be helpful to ensure operational changes do not introduce instability. ELK, Grafana, PRTG, etc.
Permission — monitoring focus should be on access patterns, access violations, malicious behavior (brute force attempts, unauthorized access attempts), and tracking how often and by whom such features are accessed.

Selective use

Feature flags should be applied with great care and attention; in many cases, using a feature flag can increase the TCO of doing something disproportionate to the benefits provided. You can read more about this topic in our article about risk mitigation.

Error rates, Availability, Failover time

Feature flags can significantly increase the complexity of tracking error rates due to the permutations of flag states. An application with multiple flags might see a manifold increase in potential error states. Ensuring high availability and swift failover in the event of errors requires the following:

Rate limiting — to manage excessive polling for flag values, the goal is to avoid service degradation or failures due to sudden spikes in traffic.
Circuit breakers — when the application detects a failure threshold has been reached, it should temporarily turn off the failure. In such cases, retries should implement an incremental back-off with a random seed to avoid multiple clients stressing the back end simultaneously.
The previous section already covers real-time monitoring (RUM) and alerting.

Feature flags that control critical system aspects can complicate the failover mechanisms. System downtime can be prolonged if there is no easy way to switch to the feature flags’ confirmed working state/values.

Third-party services integrated via feature flags must not compromise the system’s performance and Resilience. Implementing rate limiting and circuit breakers can mitigate the risk of third-party failures. Additionally, asynchronous loading of feature flags can enhance startup times without sacrificing user experience.

Often, a third-party system would handle the heavy lifting related to the operational aspects of using feature flags.

Measuring the performance of feature flags involves assessing both the impact of the flags on the system and the effectiveness of the flags in achieving their intended goals. A few starting points:

Collect system performance metrics before/after a flag is enabled: response times, error rates, resource usage, etc.
Measure feature usage and how often the end users use the code behind the flag.
Business metrics such as session length and click-through rate.
Compare behavior and revenue data during A/B testing.
Flag management metrics include how often it gets toggled, related incidents, how long it remains in the codebase, etc.

Glossary

Lazy Evaluation: A strategy to delay the checking and applying feature flags until necessary, minimizing the performance impact on application startup and operation.
Flag SDKs: Software Development Kits provided by feature flag management platforms, offering built-in optimizations like local caching, background polling, and push updates to facilitate efficient flag handling.
Pre-compute Segment/Flag/Environment Combinations: The practice of calculating and storing the results of complex feature flag evaluations in advance to speed up flag checks during runtime.
RBAC (Role-Based Access Control): A method of managing users’ permissions based on their organizational role relevant to permission flags for controlling access to certain features.
Failover Time: The duration required to switch to a stable system state in case of failure, which can be extended by the complexity introduced by feature flags.
Throughput: The rate at which a system can process requests, which can be impacted by feature flag checks and the management of flag values, particularly in high-load scenarios.
Caching: Storing copies of feature flag states to reduce system load and improve performance. Different strategies (micro-caching, session-based, long-term, user-specific) are used based on flag type and use case.
Error Rates: The frequency of errors occurring within a system or application, often used to measure reliability or quality.
Availability: The proportion of time a system is operational and accessible to users, typically expressed as a percentage of total time.
Failover: Switching to a standby database, server, or network if the primary system fails or is temporarily unavailable.
Rate Limiting: The practice of controlling the amount of incoming or outgoing traffic to or from a network or application to prevent overload.
Circuit Breakers: A system design pattern that stops the flow of operations in an application to prevent failures from cascading to other system parts.
A/B Testing: Comparing two versions of a webpage or app against each other to determine which one performs better.
RUM (Real User Monitoring): A type of performance monitoring that captures and analyzes every transaction by users of a website or application.
APM (Application Performance Monitoring): Tools or platforms that monitor and manage the performance and availability of software applications.
TCO (Total Cost of Ownership): The purchase price of an asset plus the costs of operation, reflecting the total cost of ownership over the life of an asset.
User Segmentation: The practice of dividing users into groups based on specific criteria, such as behavior or demographics, to provide tailored experiences.