Dark launching with analytics

Will Parks
Vanguard Tech
Published in
5 min readJan 19, 2023
Photo by SpaceX on Unsplash

As the enterprise world moves from legacy services to cloud-based data stores, how do you verify that both your business logic and your data migration strategy are providing consistent results that your clients can rely on? Does the cloud offer an alternative to a traditional monolithic implementation?

Our engineering team at Vanguard found an opportunity to break up a monolith and take advantage of an all-cloud call chain. To remove data manipulation at the service level and take advantage of new data stores, we wanted to revamp business logic and turn a webservice’s direct REST call into a dependency — into an indirect REST call.

Implementing this change didn’t appear difficult, but putting it in place would mean extraordinary changes in data migration and business logic across all our dependencies. We realized that the only way we were going to be able to verify that the system was working as intended was to use a dark launch strategy for the entire call chain.

Read on to see how our Vanguard team leveraged state-of-the art tech to meet this important opportunity head on!

Our Proposal

While dark launching looked like the right way to go, we knew that traditional suggestions for executing weren’t going to serve our purposes. Our clients’ uniqueness meant we couldn’t simply split our traffic. Instead, we opted for a prod-parallel test, evaluating our new feature by testing with every client, all while keeping the current system intact. Rather than using a proxy to split our traffic, we forked within an Elastic Container service (ECS), making two REST calls in parallel — one call for the client requesting data, and the other for the feature test. We made the two calls concurrently, so as not to introduce latency into the prod experience.

Logging presented another challenge. The new feature we needed was owned by another team within Vanguard and available only within a separate downstream dependency. To overcome this hurdle, we reached beyond simple logging and developed a Java library that forwards REST requests and responses to a Kinesis Firehose, which in turn removes any personal identifiable information and stores the data in our data lake!

All that remained was a diffing strategy. At first we were stumped on an approach to this, until we started talking to some of our peers outside of our product family. Working with a cloud architect who already had previous experience with Data Analytics, they proposed the following idea. Assuming that a given REST request body should yield the same response body and status when tested with and without our new feature, we designed our own “parity engine” to help with the diff.

The Parity Engine
  1. An ECS Batch Job runs all our clients on a weekly basis, during non-work hours.
  2. The batch job calls the forking service: one path through the current system, and the other for the feature test.
  3. These paths synch back up with our shared dependent Java-library service.
  4. Kinesis Firehose ingests REST requests and responses, normalizes the data, and stores the data within the data lake.
  5. Purpose-built process compares request objects across testing paths, taking into account the business logic that enhances the client experience. (Though we considered using Kinesis Data Analytics from AWS, the SQL queries involved didn’t seem a good match for our team of Java developers. Instead, we set up an Elastic Map Reduce Cluster, using Jupyter Notebooks to interface with Apache Spark.) EMR was chosen, as we could effectively emulate business logic in dependent services to remove false positives and developers could rapidly create reports using programming languages they were already familiar with or could quickly learn and become fluent in a short amount of time.
  6. The report generated by our notebook is available for data analysis, production support, and quality assurance.

Results

We learned quickly that achieving complete parity would be an impossible task. Exploring the data more closely revealed that looking at the response payloads would yield a more accurate result and better define what success looks like for our team. We simply compared prod-path and test-path values, and if the delta was over 5%, we considered the feature a failure for that client and prioritized the test case for investigation.

After three months of investigations, conversations and adjustments, the reliability of the 150,000 clients in our test group rose from 50% to 90%.

Notice the anomalies midway through and late in our testing effort. We discovered that in these instances, a service we depended on was introducing a breaking bug for specific clients. We reported the bug and pursued a patch.

Further testing helped us uncover additional aspect of business logic, and after 10 months into our dark launch, 99% of our clients were able to take advantage of our new feature.

The road ahead

We came up with this strategy to safely test our new features without having to impact our current client experience. With that, it also came with its own separate challenges.

When we shifted from focusing on parity and instead focused on client response, we knew we were onto something.

A key learning was to be more flexible in defining how we achieved success. When we shifted from focusing on parity and instead focused on client response, we knew we were onto something. In retrospect, focusing less on parity of request objects and instead directing our attention towards the client’s response would have helped us prioritize discrepancies found.

During the middle of our analysis, Amazon released EMR Studio. This would have reduced our architecture burden as we no longer would have to manually manage our controller and worker nodes. If we were to do this again, we would be exploring EMR Studio for the cost savings, debugging, and real time collaboration on notebooks, aside from the maintenance.

Regardless of these challenges we ran into, we still accomplished running a production parallel test, where we ensured that the data migration and business logic provided results within acceptable thresholds which did not impact the client’s current experience of the system they trusted.

While rapidly designing, prototyping, and using a unique way to solve a production parallel problem, we were given an opportunity that is unique for an application development team: we got to create something completely new and learn technologies that would have otherwise been excluded from our daily toolset.

Come work with us!
Vanguard’s technologists design, architect, and build modernized cloud-based applications to deliver world-class experiences to 30 million investors worldwide. Hear more about our tech — and the crew behind it — at vanguardjobs.com.

©2023 The Vanguard Group, Inc. All rights reserved.

--

--