Saving costs while improving KPIs with AMD EPYC

Martin Kostov
DraftKings Engineering
6 min readNov 28, 2023

One of the engineering teams at DraftKings faced the challenge of optimizing operational costs while ensuring a top-tier betting experience for the customer. During their discovery tests, they found a promising area for improvement in the bonus management system, a system with a throughput KPI of tens of thousands of bonuses per minute, primarily operated by a Redis cluster. This article shares their journey, the challenges they encountered, and the improvements they made. The result were enhanced Redis cluster performance, cost savings, and maintained system integrity, all confirmed under stress conditions.

Understanding the Bonus Management System

The bonus management system consists of several services and a Redis cluster. Redis, an open-source in-memory store, serves as a cache, substantially enhancing performance. Everything is deployed in a Kubernetes cluster, which includes a primary node and three replicas. Bonuses are initially created in a relational database and subsequently cached in Redis for quick access and processing. Crucially, Redis operates as the primary data source for bonuses, emphasizing the importance of data integrity within the Redis cluster — it must not serve stale data. Therefore, the team's Redis cluster operates under the "replica-serve-stale-data no" configuration, ensuring that only the most recent data from the primary node is served and outdated data from the replicas is avoided.

The images below illustrate what happens to the cluster in the case of a consistent stress test above the regular rate limit.

Working Redis Cluster.
Working Cluster
Redis Cluster that lost sync.
Cluster that lost sync to replicas

The result is a loss of synchronization between the primary and its replicas, as visible in the next image.

Unlocking Cost Savings with AMD EPYC

The discovery tests used the same configuration in a new node pool with different types of processors, as shown in the following image.

Migration for GCP Node Pool

It was a transition from the Intel-powered N2 instances on the Google Cloud Platform (GCP) to Compute-Optimized C2D instances equipped with AMD EPYC processors. N2 instances in GCP are general-purpose compute instances, providing a balanced combination of memory and compute power. These instances are powered by the 2nd generation Intel Xeon Scalable processors, making them a versatile option for a wide range of workloads.

After the initial discovery tests, they ran a series of stress tests and performance benchmarks on the new setup. Crucially, this setup disabled hyper-threading, allowing for the utilization of real CPU cores exclusively. While hyper-threading often improves performance by allowing a single core to handle two threads concurrently, for a single-threaded application like Redis, disabling hyper-threading ensures each Redis process has a dedicated physical core, optimizing CPU resource use. You can see the results of the change in the image below.

The transition yielded highly encouraging results. The new node pool managed the high load during peak betting periods, proving their efficiency and reliability. Despite the nodes showcasing up to four times lower CPU usage, operational costs were within 10% of their original setup's. They significantly enhanced their performance without inflating their costs, demonstrating a successful cost optimization strategy.

Summary and Comparison

The following table compares the initial Node Pool and the EPYC Node Pool, focusing on the cost and CPU usage for achieving the required KPI.

The table shows that while the initial setup required 0.9 CPUs for the stress test, the EPYC setup considerably reduced CPU usage to 0.28 CPUs. A 3.1 times improvement to the throughput came at a 10% price increase.

The engineering team's setup for stress testing could not fully load the newly deployed Redis Cluster. As a result, subsequent explorations will involve the expansion of testing capabilities to stress test the system under higher loads more effectively. Stay tuned for our next article, where we will delve into the design and execution of these enhanced stress tests, aiming to push our system to new limits and derive valuable learnings for further cost and performance optimizations.

Appendix 1

Potential Redis Limitations

While the new setup is far from loaded against the required throughput KPI, it is essential to acknowledge the current limitations and possible improvements.

Adding more replicas would undoubtedly increase the overall capacity of the Redis cluster. However, it wouldn't inherently enhance the crucial synchronization between the primary and its replicas. The presence of more replicas could exacerbate the issue. The primary node, tasked with maintaining data consistency across an increased number of nodes, would be subjected to an even greater workload, potentially leading to further performance degradation. Furthermore, given their 'replica-serve-stale-data no' configuration, all reads would still default to the primary node during synchronization loss, leaving the additional replicas idle. In contrast, the primary node continues to bear the load.

Similarly, while adding more CPUs may seem an instinctively viable solution, it's essential to consider Redis's operational nature. Redis operates in a single-threaded mode, primarily utilizing a single CPU core for processing. Consequently, despite an abundance of CPUs, Redis cannot execute more operations concurrently. It maintains its performance by processing requests sequentially, handling thousands of concurrent connections but operating them one after the other. Therefore, increasing CPUs would not significantly improve their synchronization issue, and the extra CPU resources might largely remain underutilized, given Redis's single-threaded operation.

Possible solutions to overcome these limitations include:

Transitioning to KeyDB

KeyDB, a multithreaded fork of Redis, has been gaining attention due to its superior performance and added features while maintaining compatibility as a drop-in replacement for Redis. The engineering team conducted an in-depth comparison between KeyDB and Redis, closely evaluating key performance metrics such as latency, throughput, and CPU usage under different load conditions. While the results were promising, with KeyDB displaying potential for performance uplift, it was also observed that KeyDB used more CPUs than Redis under similar load conditions. This indicated a particular inefficiency in KeyDB's resource utilization. However, given KeyDB's promising performance under high load conditions, they remain open to transitioning to KeyDB once its efficiency at lower loads has been optimized.

Implementing Redis Sharding

Cluster Sharding

This approach involves establishing multiple Redis clusters and implementing sharding at the application level. However, this strategy would introduce complications. Incorporating sharding at the application level would require significant modifications to the existing application code, adding a layer of complexity to the data management processes. From a cost perspective, this solution would considerably increase baseline costs, as they would now have to maintain not just one but multiple separate Redis clusters.

Want to learn more about DraftKings’ global Engineering team and culture? Check out our Engineer Spotlights and current openings!

--

--