Case study: Betting history data propagation

Martin Kostov
DraftKings Engineering
10 min readMar 31, 2023

The NFL season of 2021 officially kicked off after months of preparation. The team had been working arduously for the past six months to make sure everything was ready for the launch. Despite their efforts, they were met with unexpected challenges. In the weeks leading up to the start of the season, the team felt confident and believed that the launch would go off without a hitch.

However, their assumptions were proven wrong when they encountered issues they had not anticipated. One crucial metric had been overlooked and its performance had never been measured or improved. Despite everything appearing to function normally at first, the team soon discovered the significance of this metric when they experienced unexpected issues after the first batch of games. This realization shed light on the impact this metric had on the overall performance.

What really happened

A quote from how it started:

Hello, We are still having many … bets with Opened status from the early games. The events have been settled … Can someone check?

At the moment the team realized there was an issue, they knew they had missed something crucial. The problem wasn’t just with how quickly bets are settled after games were finished, but how quickly the updated bets were propagated to the Reporting Database. This delay was causing customers to see their winnings much later than expected, and in some cases, their balances were updated but their betting history was still lagging behind. This was a major concern that required immediate attention.

Some nerdy baseline metrics

These metrics indicate the number of operations and the time required to record them in our Reporting Database.

A basic diagram that shows our baseline of 110,000 operations per minute, it will take more than 90 minutes for all updates to be visible to customers when dealing with 10 million operations. This means that based on the number of operations, some customers may have to wait for hours to see their updated history.

Steps during the 1st weekend

When the team realized the impact on the customer’s experience, it was easy for them to identify the problem by checking the monitoring metrics that they had previously considered unimportant for the start of the season. They found that the updates propagation to the Reporting Db, where they kept customer bets to respond to betting history requests, was slow. This “Hot” database was only able to handle 110,000 updates per minute, and with many small games finishing at different times, this was not a problem. However, during the peak hours of the NFL games, this caused significant issues.

The reporting system is simple; a Reporting (Db Writer) listens for settlements via Kafka Updates Stream and writes them to Reporting Db.

Why not scale the component that writes

The team’s initial plan was to scale the component responsible for writing to the Reporting Db, but they encountered technical limitations. The Updates Stream, a Kafka topic that provides updates for the Reporting service, already had instances running for each partition. Adding more partitions or making code changes during high load periods was not a viable option, so the team had to focus on finding a solution for the following weekend.

Preparation for 2nd Weekend

It was the first working day after the games ended. It was now time for the team to define and execute their plan. They needed to take into account the time constraint they had — only 5 working days until the next weekend. The team already had a draft from the incident, so they started from there and prepared an optimization plan.

Option 1. Add Kafka partitions & scale up the Db writer

Option 1 was just a regular Scaling. This was the first action item, mentioned even during the games, but executing it without understanding its limitations was not worth the risk. The team added partitions to the topic and scaled the Reporting Db Writer. After this, they saw improvement, but it was not linear. Additionally, after a certain number of instances, the Reporting Database became the bottleneck. The team was now limited by the Reporting Database’s IOPS. Even though it was only at 50% during the games, they were not able to double their throughput.

Pros

  • Fast time to market, doable for Weekend 2.
  • Minimal risk as there were no code changes.

Cons

  • The performance was improved by 72% while the cost was doubled, which was a perfect example of non-linear scaling.
  • Doubled cost, because the number of instances were doubled.

Option 2. Batching in the Writer

The team was reading from a stream, so they decided to batch their writes to the storage. The solution was to introduce pipeline processing (ex. TPL Dataflow), which batches writes and also includes additional changes to the Database queries. The most challenging aspect was to introduce a proper way to manage storing the Kafka offset, which was necessary because they couldn’t guarantee the execution order, and missing a message could result in a customer missing information in their history. Initially, the team thought that they could get away with having only one partition per application, so they wouldn’t have to be very sophisticated with their offset commit, but in terms of performance, the reduced batch size, due to only one assigned partition, was comparable to the option where they just added more instances.

Pros

  • The number of instances was reduced by 50%, resulting in a 2 times reduction in overall cost compared to the baseline from before the first improvement.
  • The performance was improved 2.7 times.

Cons

  • It was highly unlikely that they would be able to deliver for Weekend 2.
  • The solution was risky as a change in the commit strategy may lead to issues that could cause missing an update.

Option 3. Introducing CDC

With the introduction of batching, the performance improved 2.7 times, but for 10 million operations, the wait time was still over 33 minutes. This led the team to realize that they would need an additional magnitude of improvement to effectively resolve the issue. They then shifted their approach and decided to implement Change Data Capture (CDC). CDC involves monitoring changes in a database and making the changes available in a format that other systems can access, such as a stream of events.

The team had to weigh the pros and cons of this new solution carefully. On one hand, CDC had the potential to greatly improve the performance of the system. On the other hand, it also came with a number of uncertainties and risks, such as the amount of time required for the initial data transfer and the reliability of the new solution. The team had to consider all these factors before making a final decision on the course of action. The time to transfer billions of historical records to Reporting Db2 was unknown at the time. Their best guess for initial transfer: let’s say there were 1 billion records with 10 times improvement — this would take more than 15 hours of constant writing, and if something went wrong, in the worst-case scenario, they would have to start from scratch.

Pros

  • Cost was reduced 10 times compared to the baseline, excluding the period when they would run the 2 solutions in parallel.
  • Can run in parallel, can be used by small customer testing group initially.
  • Easy rollback.
  • The performance was improved 8 times, compared to baseline.

Cons

  • Realistic delivery was for Weekend 3 or 4.
  • New replication tooling for the organization.
  • New database cluster Reporting Db2 was required in order to have fast and easy rollback strategy.
  • Unknown time to populate the new Reporting Db2.
  • Double Db cost for short period of time.

Option 4. Sharding

When dealing with vast amounts of data, sharding became a necessary step. Previous PoCs demonstrated that the next bottleneck was the Reporting Database, which was a Replica Set MongoDB. Sharding was possible, but it would require additional modifications to the Reporting Db Writer.

The drawbacks of this approach:

  • Further changes to components that had already undergone alterations.
  • Couples to MongoDB.
  • Poor choice of sharding key could lead to degraded performance due to slow read operations or uneven data distribution.

It became evident that this option was not feasible for the upcoming second weekend, and there was uncertainty if it could be completed by the third weekend.

Weekend 2 Summary

In an effort to stabilize the product and minimize associated risks, the decision was made to only add more partitions and instances of the Reporting Db Writer for the second weekend (Option 1). This resulted in a doubling of cloud costs, but only a 72% improvement was observed, which was not sufficient to meet the required level of improvement.

Preparation for 3rd Weekend

As preparations were made for the third weekend, the team considered scaling the Reporting Db with faster disks, as the bottleneck was determined not to be CPU or memory. The option of using dedicated NVMe disks and additional instances of the Reporting Db Writer was investigated, but it proved to be cost-prohibitive, costing more than 100 times the current cost of the Database.

It was clear that simply increasing spending was not the solution to the problem, as the current setup was still too slow. A proof of concept involving batching to the Reporting Db had been conducted during the preparation for the previous weekend, and performance tests had been performed. However, several issues were encountered. The biggest challenge was the requirement for a stateful application, which was something the team wanted to avoid in the ETL process, as well as the need to change queries to support differential updates. In the end, it was decided to continue with Option 3 instead of pursuing this approach.

Option 3. Introducing CDC — Implementation

This option actually didn’t have real implementation; the team was not writing code. Instead of writing code, they were simply configuring metrics and alerting for an existing open-source solution. The transformation and filtration of data was the most challenging part of the process.

Our main tasks included:

  • Conducting a short resilience validation.
  • Setting up metrics, alerting, and logging.
  • Transforming data, which was all trivial and not very interesting.
  • Implementing retention.
  • Validating performance.

With this solution in place, we needed to run multiple types of tests. Previously, we had only worked with snapshots, so all tests were performed using full betting history objects. Now, differential updates were supported, but in reality not all updates would be differential, initial inserts would still be full messages.

In the image above with green, we see combined expected performance from both differential and full updates, based on their estimated volume.

Data Ingestion

The team faced a significant challenge during the initial data ingestion phase of their Change Data Capture (CDC) implementation. The task involved transferring hundreds of terabytes of data from one database to another using Kafka. Despite the fact that Kafka has the storage capacity to handle this amount of data, it would be cost-prohibitive to do so. The team had to find a balance between publishing data at a slow enough rate for the sink writing to the target Reporting Db2 to keep up, while also publishing at a fast enough rate to complete the process within a reasonable time frame.

Multiple restarts were required during the process, and the team’s initial optimistic estimate of 15 hours was off by a factor of two compared to the actual time needed for data ingestion.

Summary

The NFL season of 2021 had started and a crucial metric was overlooked, resulting in slow propagation of updated bets to the Reporting Database. This caused customers to view their winnings later than expected. During the season, the team encountered a technical limitation while scaling the component responsible for writing to the Reporting Database. To solve the problem, they considered two initial options:

  1. Adding Kafka partitions and scaling up the database writer, which improved performance by 72%, but doubled the cost.
  2. Batching in the writer, which reduced the number of instances by 50% and halved the overall cost, but improved baseline performance by 2.75 times.

A third improvement, change to a CDC approach for populating the Reporting Database, was planned but encountered problems with data ingestion. The team was unable to deliver this improvement for the third weekend, but were able to do so the following week.

Finally, sharding, or Option 4, was considered and proved to be the most effective solution for their case. The performance increase was close to linear for each shard when data was evenly distributed.

Cost

The team was able to successfully reduce their cloud costs for reporting by nearly 9 times, while also achieving an 8.2 times improvement. The reduced costs do not include the transitional period when they had both the additional Reporting Db2 and the old Reporting Db Writer still running. They maintained both systems for a period of time to ensure a rollback strategy in case of any issues with the new approach.

The table presents data comparing the performance and cost efficiency of the changes to the system. The left axis measures the throughput per minute while the right axis represents the cost efficiency compared to a baseline. The table provides a visual representation of the relationship between the system’s performance and cost efficiency in each improvement. It is evident that the improvement in performance and cost efficiency was impressive over a short period of time (approximately 4 weeks).

Want to learn more about DraftKings’ global Engineering team and culture? Check out our Engineer Spotlights and current openings!

--

--