Enhancing Quality and Resiliency: DraftKings Isolated Testing Solution — Part 1

Published in

DraftKings Engineering

11 min readApr 16, 2024

Introduction

When dealing with a microservice-based architecture, it's often the case that the complexities of a system are hidden in the integration between services more than inside individual components. Given a fixed-scope solution, the rule of thumb is that the simpler the logic defined for each service, the more complex the interaction between them would be. This is one of the reasons why the implementation of nano-services is considered an anti-pattern. Lots of challenges lie in the shadows when different systems need to be orchestrated to provide a solution, for example :

Communication: services can experience network issues and must be able to react correctly in case of temporary disruption of communication.
breaking changes: microservices that evolve independently may unexpectedly produce a change in their contract that is not compatible anymore with the contracts implemented by other services that depend on them;
Separation of concerns: The logic for implementing functional requirements must be spread through multiple components. This can be challenging (primarily when different teams own different components) and requires additional care around testing and work breakdown;
transactionality: operations that could be easily made atomic in a monolith infrastructure are now more complicated to handle due to the system's distributed nature.
Fault tolerance: each service may experience failures at any point in time, and a proper resiliency strategy must be implemented to reduce the impact on the solution as a whole at a minimum.
Observability: the state of a distributed system is only possible to report and troubleshoot with a powerful and flexible monitorability strategy.

One of the most critical aspects for DraftKings is the reliability of our services: there is no sense in further improving the system capabilities if what already exists is not up to the quality standards that we adhere to — the system must be stable, performant, and accessible at all time, and it's not acceptable for new functionalities to disrupt the existing ecosystem and worsen the customer experience throughout the rest of the product.

With all this in mind, covering each service in isolation, even if very well covered, is no longer a guarantee that the system as a whole can behave as expected. It is crucial to cover scenarios that span a long list of different components with diverse characteristics in terms of languages, communication protocols, storage providers, etc.

At DraftKings, we also faced these complexities. We were determined to find a way to solve them efficiently and cost-effectively, as our existing solutions did not give us the reliability we needed. To solve this problem, we built an isolated testing environment called CleanRoom, utilizing fundamental Kubernetes components, infrastructure as code, and a CI/CD system. This enabled our engineers to quickly spin up and tear down customized isolated environments to address the complexities of our microservice architecture. This article will walk through some of our CleanRoom case studies and dive deeper into the system's architecture.

Introducing CleanRoom

Over the past few years, many of our efforts have been dedicated to migrating our existing software infrastructure to Kubernetes. Due to this transition, our development teams became familiar with working with Kubernetes manifests and Helm charts. In turn, CleanRoom was designed to leverage our engineers' existing knowledge. The solution we devised entails the following key components:

Creation of a New Kubernetes Cluster for Isolated Testing: We established a dedicated Kubernetes cluster specifically tailored for isolated testing purposes. This cluster mirrors the configurations of our shared environment clusters to ensure consistency and compatibility of their software.
Unique Namespace for Each Environment: Every environment an engineer creates has its unique namespace within the Kubernetes cluster. This isolation ensures that each testing environment remains independent and does not interfere with others.
Infrastructure as Code Configuration: The configuration of isolated environments is stored as Infrastructure as Code (IaC) in a Git repository. This approach allows for versioning, reproducibility, and easy management of environment configurations.
Time-to-Live (TTL) for Namespaces: Engineers must define a Time-to-Live (TTL) for each namespace before deployment. Upon reaching the TTL threshold, the namespace is automatically deleted, promoting efficient resource utilization and preventing clutter.

In the sections below, we will discuss the setup of CleanRoom further and some of the benefits we received from it here at DraftKings.

What Clean Room Enabled us to do

CleanRoom is a versatile system. DK used this technology to cover a wide range of different needs creatively. In this section, we're going to cover four different scenarios where CleanRoom proved invaluable and allowed our development teams to provide :

on-time delivery
great quality at launch
maintenance of acceptable performances during the lifecycle of a system
ongoing assurance of software robustness even during several external disruptions.

We will explore different developer use cases below, starting from the ones that provided us with more benefits.

Automated Load Testing

The usual approach in DK for load testing is to reserve a certain number of developers and QA capacity to provision data and execute load testing on a shared testing environment. This process is often executed during particular situations during the year:

launch of a new product
significant changes over a system logic
In particular times of the year, the traffic expectation for the domain as a whole is expected to grow significantly (e.g., the first week of the NFL or Super Bowl night).

The reason for these cases to be so sparingly and carefully chosen is that a comprehensive load test that can mimic a natural production environment as closely as possible is a task that requires considerable effort by multiple teams, with necessary coordination between different verticals. What we wanted to achieve with Cleanroom was the possibility of automating our load testing system in a way that would not require any more of this capacity strain, generating acceptable results in a steadier way that could fit nicely into the existing Software Development Lifecycle.

DraftKings launched a new product in 2023 called Pick6. The architecture team wanted to provide the development team with a way to continuously validate that the improvements and features implemented in the day-to-day application maintenance kept the system performance intact when under load. For example, it may be straightforward to seriously impact the performance of a stored procedure by implementing a new index for a completely unrelated task (ex., reporting).
Running a complete suite of manual load tests for each change (completely mimicking the production environment) is not feasible due to the required effort mentioned before. However, with Cleanroom, it is possible to completely automate a load test suite running in an on-demand environment for each and every release candidate. The system would:

Start up an on-demand environment with a scaled-down version of the system.
Provision this environment with a critical mass of data using automated scripts.
Run a load test and produce a detailed report of the performance of the key areas. This report will then be used as a baseline.
Allow developers to compare their release candidate report against the baseline to quickly notice any relative improvement or degradation due to the changes introduced in the release candidate.

Because it wouldn't be feasible to provision the exact same amount of data that a real production environment has to deal with, a process like this cannot be 100% reliable when it comes to reproducing production behavior. However, it provides fast feedback to the team, signaling possible issues that could occur on production BEFORE the new version was to be shipped. Combining this system with traditional manual load testing, processed on a shared environment, provides several advantages. This is the final process that the Pick6 development team adopted:

A particular software version has been manually tested on a shared load environment, provisioned with 400 million entities in the database, and scaled up to the same Database instance size and number of replicas used on production. The result of this load test has proven that this version of the application can withstand a peak of 20 thousand entries per minute.
The identical version of Cleanroom, with 30 million entities stored, a much smaller database instance, and a much smaller number of replicas for the microservice, can withstand a peak of 5 thousand entries per minute. This will be our baseline, and a report will be generated.
From now on, every new release will be tested with the same configuration. One of the reports shows that a release candidate could handle a peak of 2 thousand entries per minute. In that case, the development team will notice this as part of their software lifecycle in a matter of minutes. It will be allowed to investigate the reason and take corrective actions, significantly reducing the risk of regression in production.
On the other hand, it will be possible to quickly validate that a change meant to improve the system's performance did its job correctly: CleanRoom can be used as an initial validation before the actual change, if necessary, gets tested following the manual load testing approach. At this point, a new performance baseline will be generated and used from now on as a comparison against subsequent releases.

A complete automated test, from the commit of the release branch up to the generation of a report, will take, at most, a few minutes. This is a great tool to confirm that a system is always in the best shape possible, moving from a reactive approach to a proactive one. This avoids cases where significant degradations are noticed only a few days before an expected stressful period for the system — or even worse, DURING the stressful period — leaving no time to react.

The following images show two examples of a load test report for two separate releases. The second release performances have worsened, requiring more investigation from the development team before shipping to production.

this is our initial baseline, used for comparison against other subsequent releases of the application.

the performances here are clearly degraded — something in the new version needs to be checked again before shipping.

Automated system functional testing

As mentioned earlier in this article, when dealing with a microservice-based architecture, it's often the case that the complexities of a system are hidden between the integration of different services more than inside each component. If a team is ready to release multiple features and they want to ensure everything is working as expected:

The team creates a branch ready to be promoted to a shared environment (or multiple branches if the feature requires changes to several services).
In CleanRoom, a dedicated environment tailored to the release is automatically provisioned with a unique namespace and short Time-to-Live (TTL).
A set of functional system tests is kicked off.
Results are promptly reported back to the team for immediate insights into system performance and reliability.
The team retains the flexibility to manually test use cases and visually inspect system elements within the CleanRoom environment. This allows for real-time debugging and manual validation of feature functionality if necessary.

High-level diagram of the automated testing infrastructure

To show an example of this, we can discuss the implementation of a 3rd party odds provider into our sportsbook system. When a new provider is integrated, several concerns need to be handled, and usually, for each of them, a separate component takes care of that responsibility. Some of these concerns are :

Communication with the 3rd party following specific protocols and rules;
Processing of the fast changes of probabilities for different sports and events that the 3rd party provides;
Association between the 3rd party entities (games, teams, leagues) and DraftKings entities.

Here is a small diagram showing the integration's high-level architecture (up to a certain point). Each box represents a different deployed microservice that takes part in the integration.

Each one of the components can be heavily covered with Unit tests, usually executed during the Continuous Integration process. However, using CleanRoom opens up opportunities to test the system end-to-end : it will be up to the engineering team to understand to what level of coverage they should aim for, and to balance coverage advantages with the efforts required to implement them.

The way this was implemented for this particular integration would look like this :

Specifically, we agreed to cover the integration between distribution and odds processing and the integration between distribution and events mapping. The decision is usually made following the 80/20 principle — 80% of the coverage would be completed with 20% of the effort needed to cover the solution in every possible aspect.

This kind of testing is entirely functional — end-to-end testing aims to describe and validate behaviors, not implementations. For this reason, we used SpecFlow to define the system behavior without referencing the actual internal service implementations or communication protocols. This is a sample of some scenarios for the first set of tests involving odds processing. Notice there is no reference to communication protocols, storages, or how many components are involved in fulfilling the business requirement.

Scenario: Update DoubleChance markets only when 3rd party provides all selections 
       Given 3rd Party has new available fixture
        When 3rd Party provides selection update for market <market> with the price <homePrice> and points <point> and selection type Home
        Then eventTypeId <eventTypeid> should not be updated 
        When 3rd Party provides selection update for market <market> with the price <awayPrice> and points <point> and selection type Away
        Then eventTypeID <eventTypeId> should be updated with values 1: <homePrice> 2: <awayPrice>

    Examples:
      | market                 | homePrice | awayPrice | eventType |
      | DoubleChance	       | 120       | -110      | 0         |
      | 1st Half Double Chance | 130       | -120      | 3302      |

This scenario covers a slightly convoluted test that ensures DK internal updates are correctly generated based on the 3rdParty specific logic. This confirmed not only that the internal logic of each component works as expected but also that the communication between the various components behaves according to specifications and is not disrupted by any internal implementation change.

This is instead a sample of a scenario for the second set of tests involving events mapping :

Scenario: 3rd party event is mapped to DK event
    Given a DK event exists for the 3rd party event
     When 3rd party produces event update 
      and there is an association between the 3rd party event and the DK event
     Then the 3rd party is mapped to the DK event

A simple scenario ensures that the communication between the components is valid and that any change in the mapping system (maybe implemented to cover a different 3rd party's use case) does not affect the expected business behavior.

Thanks to this system, our sportsbook team could substitute many detailed tests with a much smaller number of end-to-end test and maintain a similar, if not higher, coverage of use cases. These tests validate the behavior of a system as a whole and can catch issues that cannot be easily found by covering each component in isolation. While this solution cannot remove the need for more detailed tests, it's a great addition to our tools that change how systems are tested to focus on a holistic approach.

What's next

In the second part of the article, linked here, we will discuss the architecture of Cleanroom and how it allows us to cover the use cases just described.

Want to learn more about DraftKings’ global Engineering team and culture? Check out our Engineer Spotlights and current openings!