Building Resilience Around Uncharted Technologies

Evan Resnick
DraftKings Engineering
6 min readOct 4, 2021

--

Over the last five years I worked across several engineering teams at DraftKings. I started on the Daily Fantasy Sports product, later transitioned to a Marketing Platform team, and most recently to the Architecture team. A few months ago I was offered an opportunity to lead a backend team in planning and launching a new product called DraftKings Marketplace.

In my previous experiences I became intimately familiar with designing software to handle errors from dependent systems. On the Architecture team I spent a lot of time building resilience libraries for calling common resources including databases, caches, and web clients. This allowed engineers across the company to include our libraries in their microservices and automatically handle timeouts, circuit breaking, retrying, and metrics logging. This experience set me up for success in my new role on the Marketplace Engineering team.

My start on Marketplace began when DraftKings decided to embark on the exciting journey to introduce blockchain technologies into our existing infrastructure. We were challenged to release a minimum viable product in just 3 months. This included primary sales through DraftKings, a secondary market for users to resell, and profiles to view purchased assets. Our team faced many challenges along the way and overcame them to release a high-quality product on time.

DraftKings Marketplace Primary Sales Page. Also Known as “Drops”.

When our team began, we were not yet experts of blockchains and encountered several unknowns. We considered several blockchain technologies before landing on the one we felt best fit our long terms needs. Polygon was the blockchain we chose to develop on, rather than Ethereum, due to faster processing time and lower fees. Polygon offers a layer-2 Ethereum based network built to handle scaling issues, which bridges to Ethereum. This allows assets to be easily transferred between the two networks.

At DraftKings, our backend microservices are typically written in C#. Our research lead us to an open source C# Ethereum library called Nethereum. This allowed us to write a backend service to integrate with the Polygon blockchain.

In order to ensure the success of our Marketplace, we brainstormed what failures might occur when developing against the blockchain, how to handle unexpected failures, and how to avoid cascading errors from toppling our system.

Resolving Expected Failures

As a starting point, I delved into the Nethereum library code and determined what types of exceptions could occur from calling various methods. This gave me a solid foundation for designing specific error handling for our blockchain service.

Retry, Retry Again

One common issue we encountered were timeouts when sending a transaction to the blockchain. Sometimes a network slowdown in Polygon caused the request to timeout. This also happened due to rate limiting from the Polygon client call from the Nethereum library. The client code for sending a transaction timed out after 20 seconds. This issue lent itself well to automatic retrying. In this scenario, we assumed the transaction did not run and retried until it succeeded. In the case where the transaction did run and a timeout still occurred from the client, we added an extra layer of protection. Each Polygon transaction uses an auto-incrementing number called a nonce. This prevents the same transaction from being run multiple times. When retrying a transaction we attached the same nonce to ensure Polygon ran the transaction only once.

DraftKings Backend Blockchain Service Retrying a Timed Out Transaction.

Dead End

Retrying is not always a viable strategy to resolve blockchain errors. Sometimes a transaction fails for a number of reasons. Once a transaction has executed and failed it cannot be retried; a new transaction needs to be created and run. Running a transaction requires specifying an upper limit of gas for your transaction. Gas represents the computational expense of running a transaction, which is paid in order to execute the transaction. Early on in development, we incorrectly set gas limits too low for our transaction. This caused our transaction to fail when it was run. We could not simply retry this transaction, since the gas limit would remain too low. In order to resolve this issue, we had to rebuild our transaction and run with a higher gas limit. Having the flexibility to not only retry the same transaction, but also rebuild a failed transaction gave us the ability to resolve a diverse set of problems.

DraftKings Backend Blockchain Service Rebuilding a Failed Transaction.

Even though we prepared for common blockchain failures, we knew other errors were likely to pop up after we launched our product.

Dealing with the Unknown

Sometimes understanding what you don’t know can be as valuable as what you do know. In order to prepare for unexpected errors, the Marketplace team spent a lot of time building general error handling and observing failures in our service.

The Cheapskate

A single error is not the end of the world, but many errors cascading spell disaster. When designing our blockchain service we worked to ensure a single transaction did not break later transactions. A transaction on a Polygon account cannot run until the previously sent transactions complete.

Originally, our batched transactions attempted to optimize the price by using the average network price at the time of running. This caused problems when the amount we captured was abnormally low. If the price remained higher for a long period, all pending transactions would eventually drop from the network. Trying to predict the network’s price trends proved to be a lot more complex than using a standardized price across all batched transactions. In the end this optimization did not save very much in costs and added significant complexity to our service, so we removed it in favor of a flat price.

Dropped Transaction After a Few Hours Pending, Due to a Low Gas Price.

Lincoln Logs and So Do We

Our blockchain service always included base exception handling and logging around core logic. This provided us with insight into any unexpected failure. With this information, our team was able to build specific error handling around these cases if we expected them to reoccur. The Nethereum library threw a general error for all unknown exceptions called RpcClientUnknownException. Adding specific handling for this exception we were able to break down the different types of unexpected errors. Additionally, each error contained important data for the transaction we were sending, for context and comparison with other similar failures. We could also obtain a count of how often each issue happened, so we could prioritize accordingly. This data was extremely valuable for our team to understand the overall health of the system.

Log From Unknown Exception Thrown from Nethereum Library

Tool Time

When all else failed, we had to get our hands dirty. As part of our blockchain integration work, we built internal tooling to manage our service. This allowed us to manually resolve unexpected production issues that arose when communicating with the blockchain. One error we encountered caused our service to get out of sync with the blockchain. Our system timed out when calling the blockchain and recorded an incorrect hash of the transaction in our database. This prevented us from recovering a successful transaction when we later retried, as we could not find a matching transaction on Polygon. After locating the correct transaction on the Polygon network, we leveraged our tooling to update internal records to match the blockchain, landing us in the correct state. As a follow-up, we were able to investigate the issue and create a longer term fix.

In Summary

It is critical to perform due diligence around error cases when designing software. This provides us with a baseline for handling expected failures. Even with forethought, we cannot predict everything that can go wrong and that’s ok! When working quickly to integrate with the blockchain we didn’t have time to become experts in the entire domain. I learned that it is a lot more difficult to undo a mistake on the blockchain than doing nothing at all and resolving later once the dust clears. Having a plan for handling those unknown errors set us up for long-term success. When it comes to external dependencies it is best to prepare for the worst and hope for the best.

--

--