Using the LaunchDarkly Relay Proxy for Fun and Profit
By: Regis Wilson
TrueCar has been using a “dark launch” culture to launch features into production so that we can either test code with live traffic or perform verification and testing before it goes live to the world. During our migration from a legacy code base on premises to the cloud, TrueCar recognized early that a feature-flag-based “dark launch” philosophy would be necessary to turn features on and off as we tested and tuned our new code and platforms.
From their website, “LaunchDarkly enables development and operations teams to deploy code at any time, even if a feature isn’t ready to be released to users. Wrapping code with feature flags gives you the safety to test new features and infrastructure in your production environments, without impacting the wrong end users.” TrueCar has been using LaunchDarkly feature flags for a while to power our new features.
We use feature flags for our applications in two key ways: first, we have our application backend and frontend monoliths that turn specific product features on and off based on percentage of traffic, partner affiliation, and specific targeting rules. The backend is written in Ruby on Rails and is serviced with a Puma web server; second, we have a rules routing engine running with Lambda@Edge via AWS CloudFront that can route requests to specific endpoints where specific features are enabled. You can read about our implementation in a previous blog post.
Delights, Problems, and Solutions
Using LaunchDarkly for our “dark launch” culture was effective and efficient for the several years during which we migrated to our new cloud-native platform. The LaunchDarkly team and product were well tuned and responsive. It really is enlightening when a product “just works” and does what it says it does.
However, with any new technology or vendor product that is implemented or adopted, there is a pattern that consists of the following phases:
- The promise of what the product or technology offers in general
- The promise of what the product or technology offers to us in particular
- The proof of concept and/or testing phase
- The implementation and rollout phase
- The realization or reality of what the product or technology actually offers and does not offer in the real world
- A new wisdom surrounding what the product or technology and what it does, does not do, when and how to use it, and a measurement of pros and cons in great detail
When we finally reached the last stage after a few years or implementing LaunchDarkly in our code base, we had a list of positive items that we enjoyed:
- Feature flags were easy to control
- Implementation of new feature flags was relatively painless in our Ruby codebase because we had abstracted out feature flags into our own gem library wrapper
- Targeting and testing feature flags was quick, surgical, and accurate
- The technology that LaunchDarkly uses in their client libraries was efficient and had no discernible overhead in our codebase.
Of course, there were some problems as well:
- While the core feature flag client libraries do not use outbound connections for targeting, using the full feature set of analytics, events, and updating new flags as well as changing existing flags does require communication with LaunchDarkly servers upstream.
- Connection handling can be tricky and/or impossible depending on the way sessions are handled. For example, in our case, using Rack or Puma as an HTTP server with multiple threads gets quite involved. Another example of our implementation was delegating frontend feature flags to the backend because that is where the session logic was handled. In another example, we were using Lambda@Edge to read feature flag state on every invocation of a CloudFront request. This not only takes time to initialize but occasionally fails.
- Relying on an external vendor for correct operation of virtually all our rendered web traffic made us uncomfortable, and, indeed, we did see some unexpected errors and interactions with code for session handling and feature flags and during Lambda@Edge invocation.
Thus, after a few years of learning what works and doesn’t work with feature flags and our “dark launch” culture, we still had some nagging problems with using an external API for a core functioning of every single web request we process in our stack.
After working closely with LaunchDarkly and finding out that other customers were in a similar situation, they suggested we try their new feature called the “LaunchDarkly Relay Proxy”. This new feature was an open-source client that we could run closer to our infrastructure, which would offload outbound connections and updates to a proxy we could run ourselves. The relay proxy consists of a go-client that intermediates some (or all) of our connections to LaunchDarkly’s APIs and even allowed us to experiment with a new idea called the “Feature Store,” which could write feature flag status into an external database (like Redis, DynamoDB, Consul, or others). The LaunchDarkly client was able to then use the local proxies and Feature Store to move the operations of LaunchDarkly’s API a lot closer, as well take their API and updates out of the primary path of our site operation.
From the documentation, “The LaunchDarkly relay proxy is a microservice that establishes a connection to the LaunchDarkly streaming API, then proxies that stream connection to multiple clients.” Also from the documentation, “In most cases, the relay proxy is unnecessary.” We were working closely with the support team so we bravely forged ahead.
Two modes are possible. The first is the so-called “proxy mode,” where the microservice intermediates the requests to upstream LaunchDarkly servers. We could also allow the microservice instances to query a local cache or Feature Store for fast start-up and increased resilience. This is the image from the documentation.
This diagram shows a load-balanced set of relay instances to offload requests to the LaunchDarkly servers.
The above diagram seemed suited to our Ruby on Rails application backend, which could offload event streams and API calls locally for better stability and lower latency.
Alternatively, we were attracted to a slightly different mode called “Daemon Mode” for Labmda@Edge, which offloads all API calls to a local Feature Store which is kept up to date with several relay instances.
This diagram shows how requests are completely offloaded to a local Feature Store.
The above scenario was especially applicable to Lambda@Edge because we would only need access to feature flags stored in DynamoDB for remote lookup at the edge and lower latency to the DynamoDB endpoints.
These relay instances are conveniently available in a docker image, which would make deployment relatively straightforward for us. Realizing that we wanted to utilize Relay Mode to offload outbound requests for our backend while also utilizing the fast Redis lookups available with Daemon Mode (again for the backend), and utilizing extra bonuses like the Datadog metrics integration, we knew we had a solid design to move forward with. For Lambda@Edge, we had settled on using pure Daemon Mode with DynamoDB.
One key consideration we ran into early on was the fact that the LaunchDarkly relay proxy does not allow us to write to two Feature Stores simultaneously. Thus, it was necessary for us to run two separate (but equal) relay proxy “clusters” with two different configurations: one for using Redis with Rails; the other for using DynamoDB with Lambda@Edge.
Redis for Ruby on Rails
Our first step was to decide which Feature Store backend to use. We could have used any one of Redis, DynamoDB, or Consul since we were familiar with all of them and had used them all in our infrastructure patterns. However, we gravitated toward Redis for the backend because we already use Redis for our Rails and Ruby applications. The LaunchDarkly relay proxy footprint is not very large at all and does not require its own infrastructure since we already have several Redis instances in current use. Also, Redis is fast.
Next, we took the docker repository on GitHub and added it to our existing build and deploy pipeline to create the docker image and push it to an AWS Elastic Container Registry (ECR). Once we do that, we have an internal deployment tool called Spacepods that starts a container using Elastic Container Service (ECS) in each of our environments. We use the ECS environment variables integration with Parameter Store to keep configuration and encrypt secrets (like the SDK key). The proxy has certain settings we create in the task definition. We set two environment key prefixes for development and production. For example:
We then create an AWS Application Load Balancer (ALB) Target Group pointed at the ECS service to create a load-balanced service. We can test it by visiting the /status endpoint to view confirmation that the service is alive.
Next, we configured our feature flag gem client initialization as shown in this code snippet:
In the code sample above, we set the environment variables for our application service definition in Spacepods via Parameter Store as follows (note that USE_LDD is used to configure the client for LaunchDarkly Daemon mode):
Here are the definitions for the wrapper gem:
And then, we add this section to our puma.rb file:
DynamoDB for Lambda@Edge
We gravitated toward DynamoDB for Lambda@Edge because we wanted our routing solution stateless and serverless. Using DynamoDB global tables allows us to replicate the Feature Store data across regions, so that we can lower latency to each of the edge locations where users request our site. We used the following configuration data in Parameter Store for the Lambda@Edge daemons.
Then we used configuration code with our Lambda@Edge (written in Node.js) to initialize the LDD mode and point at the DynamoDB Feature Store as shown below:
Using this configuration, we can see that the client is starting up in LDD mode as expected.
We also tested the application with several feature flags: turning flags on and off works instantly; changing the targeting percentages works immediately as well. We even created and deleted flags to verify they start to function and then are removed immediately from our application. All of this happens quickly with updates and reads from Redis, with almost unnoticeable delay!
And after running this configuration for several days, we can confirm that our connection count via the LaunchDarkly statistics dashboard shows a drop in active server connection as expected (see if you can spot the deployment):
This graph shows a steep drop in server connection counts using LDD mode.
We can also verify that the connections from our Ruby backend have dropped (showing a move to the Go microservice proxy instead):
This graph confirms that our Ruby backend client connections have all but stopped.
Most LaunchDarkly customers would probably be very happy without having to implement a proxy relay. However, due to our special use cases with Lambda@Edge and Ruby on Rails with a Puma web server, we solved all our issues with stability and latency by accessing a local proxy and/or retrieving feature flags from a local fast Feature Store.
We are hiring! If you love solving problems please reach out, we would love to have you join us!