Enhancing Quality and Resiliency: DraftKings' Isolated Testing Solution — Part 2

Published in

DraftKings Engineering

10 min readApr 16, 2024

Introduction

In the first part of the article, we explored some interesting use cases that the CleanRoom system was able to solve. For the second part, we will explore the architecture of the CleanRoom system and how the component works internally.

CleanRoom Architecture

The Kubernetes Cluster

The Kubernetes cluster is configured to mirror the production-level cluster, ensuring consistency across environments and allowing developers to deploy their production-ready manifests with limited to no changes. Some of these critical features in our clusters, which are heavily utilized in CleanRoom, are the following:

Cluster Auto-Scaling — Prevents resource contention amongst different engineers as they want to deploy their environments without worrying about occupying resources.
Logging and Metric Collection — Allows developers to reuse existing queries and dashboards to observe what is going on with their systems while in the environment.

Organizing the IaC — Git Repository Structure

Due to the intricate dependencies within an SOA, we've set up a Git repository structure to ensure the services and infrastructure are operational before dependent services are deployed. Our directory structure to the following hierarchy, which provides software and infrastructure are deployed in the specified order:

Dependency Folder

The initial dependencies are deployed sequentially in this folder according to the numerical prefix assigned. Also, the deployment process ensures that the subsequent tier is only initiated once the current tier has been confirmed as ready based on the Kubernetes API readiness check.

For instance, in the example provided, items prefixed with "00" will be deployed initially, ensuring their stability before deploying items prefixed with "01" and so forth. This approach to deployment guarantees that dependencies are established in a controlled and orderly manner, minimizing disruptions and ensuring the environment's readiness for subsequent deployment tiers.

System Under Test

After completing the previous step of deploying dependencies, the contents of this folder are then deployed. The main difference is that these services are deployed concurrently, and the deployment process progresses to the next step only after all services in this folder report readiness. We ensure that the environment advances to the next stage only when all services are confirmed as ready for operation.

TestApp (Optional)

In this optional folder, developers execute the automated tests within the environment by deploying them as a Kubernetes job deployed within the same namespace. Tests can verify that an API call adds a message to a queue within the environment by calling the API and immediately scanning the queue to ensure the message has been sent. Also, the reports generated from these tests are sent to the developer as an output from the cluster deployment job, providing valuable insights into the system as a whole.

Deploying to the Cluster

We've established a Jenkins job that requires an Infrastructure as Code (IaC) repository they intend to deploy to the environment. This Jenkins job validates the repository's format and deploys the manifests using Jenkins agents. To optimize cost-effectiveness, calls to this job must also specify the desired lifespan in hours of the planned environment. This enables us to automatically delete namespaces after the specified duration, preventing unnecessary resource consumption. The average time to spin up an environment is ~ 2 minutes.

Accessing Infrastructure in the Namespace

Since these environments are deployed remotely in a Kubernetes cluster, we prioritized infrastructure accessibility from developer machines. Leveraging Kubernetes ingresses, we established unique and custom DNS entries to pods supporting HTTP and HTTPS protocols. Additionally, for traffic not transmitted over HTTP/HTTPS, Kubernetes port forwarding enabled the routing of a port on the user's machine to a corresponding port on the cluster. By utilizing these two Kubernetes components, developers can now interact with infrastructure components directly from their local machines, simplifying the testing and debugging process of the environment.

Customizing the Namespace

The isolated namespaces provided by CleanRoom offer users a wide range of customization options without affecting shared environments. Engineers can test new resource limits and configurations, ensuring consistent performance of their software. We can also quickly deploy public docker images or any custom images into the cluster. This flexibility is seen when we set up dependent infrastructure such as databases, messaging systems, and mocking tools by initial seed data, scalability, and performance.

Other noteworthy use cases

Fast Experiments

Usually, during the first phase of designing a software solution, the architectural team assumes a specific component's capacity to have acceptable performances during high load. Without a solution like Cleanroom, these assumptions can be verified only after a big part of the infrastructure and the software have been implemented. More often than not, a wrong assumption caught this late during the development can lead to significant disruptions to the development roadmap and could jeopardize the launch of a new product.

CleanRoom allows for fast implementation of Proof of Concepts in isolation, even complex ones requiring the combined interpolation of software applications and specific infrastructure topologies. More importantly, this can be tested directly by an architect or by the technical leads of the team without the need to involve other departments (for example, requesting the creation of a new database resource on a shared environment or the setup of a set of exchanges for a RabbitMQ cluster).

During the first design phase of Pick6, one of the products launched in 2023, the designing team had to choose a database access strategy for inserting customer entries. The architecture involved two separate microservices, a set of RabbitMQ queues and exchanges (to allow communication between the two services), and one RDMS database (where the customer entries had to be persisted). Cleanroom allowed the quick testing of 36 different combinations of logic and configuration changes to quickly understand the optimal setup. Some examples of what was tested and fine-tuned included :

size of RDMS connection pool
use of Temporary DB tables and cursors against application internal loops for database access (so complexity on the app side against higher complexity on the DB side)
degree of parallelization on the application level
minimum number of threads for the microservice apps
the total amount of replicas for the microservice apps
amount of CPU and RAM necessary for the applications to perform optimally without overscaling.

All these combinations were then tested with various degrees of traffic, varying from a few concurrent users up to several hundred. This is a small sample showing some results for the different approaches taken. The absolute numbers are insignificant in this case; they are only the relative differences between them.

Thanks to this, the assumptions taken by the design team were validated much earlier in the product's development lifecycle, and a proper set of guidelines was provided to the development team in terms of system configuration for high load. Similar POCs were implemented in all the critical aspects of the product, starting with an excellent skeleton before the actual development of the various features was undertaken. This significantly reduced the risk of finding a fundamental issue in the design very late in the roadmap of the system and allowed the Pick6 product to be launched on time without the latest-hour surprises. All the key aspects were then re-validated once the solution was up and running 100% in a pre-production environment.

Resiliency testing

Another excellent use case for CleanRoom is resiliency testing. In real-world scenarios, applications can suffer all sorts of disruptions: Kubernetes pods can get evicted, a dependency could fail, an instance of the service could experience an out-of-memory exception, the network can get delayed, and so on. It is essential to ensure that in such cases, our components can notice a disruption, take appropriate actions automatically to minimize the potential issues, and return to shape as soon as the disruption has ended.
Given the fact that CleanRoom runs on Kubernetes, we have implemented a tool called Chaos Mesh (https://chaos-mesh.org/) that would allow us to run several simulations of different disruption cases and confirm that our system would behave correctly.

Chaos mesh allows for several experiments, and based on the component, we would concentrate on one or more of the most likely cases :

crash of a needed dependency
loss of network connectivity
crash of one of the application instances (if the application works with an active/passive approach, crash of the leader instance)

For some of our 3rd party integration services, we have built a comprehensive resiliency test suite that would correctly cover cases when the 3rd party suffers a temporary disruption. Situations like this would often require specific actions on the DK side (for example, turn off betting on the affected games or markets), and we do not always have the luxury to request the 3rd party to simulate these disruptions on their side on a lower environment to confirm that the system proceeds with the expected behavior. CleanRoom allows testing not only all these cases before launch but also an automated set of regression tests to verify that any change done during the system's lifecycle (bug fixing or feature development) would not jeopardize the application's resiliency.

Chaos mesh defines so-called 'experiments' that can be applied to different targets in a Kubernetes cluster. An experiment can be left running for a certain period to simulate a temporary disruption, and the automated testing system can then validate that the end-to-end system has recovered correctly once the experiment has ended. An example of an experiment would look like this :

kind: PodChaos
apiVersion: chaos-mesh.org/v1alpha1
metadata:
  namespace: {currentNamespace}
  name: kafka-pod-fault-{iteration}
spec:
  selector:
    namespaces:
      - {currentNamespace}
    labelSelectors:
      app.kubernetes.io/name: kafka
  mode: all
  action: pod-failure
  duration: 1m
  gracePeriod: 0

This experiment is applied to all pods currently associated with the label 'Kafka.' The experiment type is 'pod-failure,' which would make the pods unavailable for a certain period. In the CleanRoom environment, the Kafka dependency is hosted in these pods, so this experiment effectively disrupts the communication with a Kafka cluster for every application that consumes or produces messages.

With this in mind, here is an example of some basic resiliency scenarios used for several of our 3rd party integrations. The implementation specifics would then vary based on the integration.

Feature: 3rd party Resiliency Tests
 
Scenario Outline: 3rd party integration health status gets updated correctly when dependencies experience issues
    Given 3rd party integration is healthy
    When we apply experiment "<experiment>"
    Then the application exposes a "<initialhealthstatus>" health status
    When the experiment ends
    Then the application exposes a "<finalhealthstatus>" health status
Examples:
    | experiment              | initialhealthstatus | finalhealthstatus |
    | 3rd-party-fault.yaml    | warning             | passing           |
    | kafka-fault.yaml        | critical            | passing           |
 
 
Scenario Outline: 3rd party Leader election system is able to switch leader instance when necessary
    Given 3rd party integration is healthy
    When we apply experiment "<experiment>" to leader instance
    When the experiment ends
    Then the leader instance should have changed
    And new leader instance exposes a "<finalhealthstatus>" health status
     
Examples:
    | experiment                    | finalhealthstatus |
    | leader-fault.yaml             | passing           |

The first scenario confirms that the exposed health status of the application will correctly visualize the system's status. The health status is critical in DraftKings, as it is monitored by external processes triggering various mechanisms that would protect the downstream environment when cases like this arise.

The second scenario covers cases relevant to an active/passive component (multiple instances deployed, but only one of them actively processing data): with this topology, it is crucial to test that if something happens to the current leader instance, another one can quickly take its place and continue processing as expected.

Once the testing has ended and the results are published via our CI/CD pipelines, the entire CleanRoom Kubernetes cluster is decommissioned, and each new test will start from scratch in a brand-new environment. This will remove any risk that a previous set of tests may have left the environment in a corrupted state. This is standard when testing resiliency on shared environments that are not destroyed after testing.

Experience has told us that maintaining this robustness on fast-changing services is not trivial, especially with complex integrations — and that this behavior can be easily broken by changes that, at first glance, may seem unrelated. An automated test like this will give developers and product representatives the peace of mind that a fundamental resiliency requirement is always appropriately respected, no matter how fast the product evolves to accommodate new features or update existing ones.

Conclusion

In conclusion, CleanRoom is a vital component in our SDLC due to its flexibility, and it is primarily used to ensure the robustness of our systems. Our investment in CleanRoom has reaped substantial benefits, as seen in our improved software quality, operational efficiency, and quick feedback loops. By enabling fast experiments, automated load tests, resilience tests, and functional tests, CleanRoom helps our engineers with quick and cost-effective insights. Integrated seamlessly with Kubernetes and Infrastructure as Code principles, CleanRoom allows our teams to utilize a familiar deployment framework while ensuring consistency and reliability across different deployments. As we look ahead, our commitment to refining CleanRoom remains near the top of our priorities, with a focus on aligning its capabilities with the evolving demands of our organization. While we've only begun to scratch the surface of testing with CleanRoom, its numerous features, and flexibility ensure that it will continue to play a significant role in our software in the future.

Additional Resources

Cleanroom is one of the main components used by the DK Engineering teams as part of their k8s transformation. More details on the overall initiative can be found here:
- AWS re: Invent 2023 — Modernize .NET apps at scale: DraftKings principles for success
How DraftKings measures site reliability KPIs and how we act against them to ensure customer satisfaction.
- SLO Management | by Douder | DraftKings Engineering | Medium
How is the resiliency of our 3rd party integrations handled in DK Sports?
- Resiliency in Feeds Integration | by Nicola Atorino | DraftKings Engineering | Medium
How DraftKings approaches the aspect of performance and its components overall.
- Performance 101 | by Martin Chaov | DraftKings Engineering | Medium

Want to learn more about DraftKings’ global Engineering team and culture? Check out our Engineer Spotlights and current openings!