Resiliency in Feeds Integration

Published in

DraftKings Engineering

21 min readJan 6, 2023

The DraftKings Sportsbook product is a state-of-the-art software suite that offers a wide range of sport events in order for players to enjoy endless gaming possibilities. To be able to provide such an experience, different teams are responsible for generation or integration of different kinds of content. While DraftKings sport content is a composition between internal intelligence, content creation and external content consumption, the Feeds Team is dealing mainly with digestion and processing of external content into the core sports domains, which includes integration with 3rd parties.

The Context

Now, one of the most important characteristics of these integrations is stability. We need to quickly react to problems and then come back to a functioning state as soon as possible after the disruption has passed, regardless of whether or not DraftKings is the source of it. We also need to understand if the disruption has the potential of causing incidents, or if it’s possible to continue processing while this is happening with a graceful degradation of some functionalities. While some standards exist, every 3rd party provider has its own integration mechanism and its own structure. Some of them work mainly via HTTP calls, others use AMQP protocols, or SSH connections. There are a wide variety of cases and, generally, a new solution is required every time.

We can split the integrations into two types: request-response style (think about a web API) and producer-consumer style (think about an application consuming a queue, processing the data, and publishing results to another queue). This is an important characteristic of the integration, as it dictates how it should react in case of issues. A request/response integration may need to stop accepting new requests, while a consumer/producer style of the application may require stopping the consumption of new messages, or delaying the publishing. There are several varieties of cases that could require different approaches, but what they share in common is the necessity to quickly know when and if work can be resumed safely.

Because of this, any kind of health monitor system has to be:

Self-healing: even if alerted immediately, relying on manual intervention is simply too slow when live data is involved, especially if happening during the night. A problem system needs the capacity of automatically restoring its functionalities as soon as possible.
Flexible: different providers have different necessities and different ways of confirming that they are working fine. In some cases, we need to get creative and find a way ourselves as the provider does not offer clear capabilities.
Generalized: the expected result is always the same, so finding a common way of approaching the issue ensures implementing the feature into a new provider won’t require reinventing the wheel every time.
Capable of real-time reporting: there is a huge advantage in providing real-time visibility over ongoing issues and degradation of services for each provider involved in potential incidents.

Example : SSH-based provider

Let’s think about a fictional provider of LIVE sport content. This company provides:

offering information: which events they have available in their system, for example football matches, or motorbike races
pricing information: what are the probabilities of something happening in the events they offer, for example a certain player to score, or a certain racer to win, etc.

This provider offers the live pricing information via an SSH feed, to which only one connection at any time is allowed as per technical requirement on their side. Their offering information, instead, is provided via an HTTP API that they expose. The pricing information must be available all the time because the data is time-sensitive: when the games are live, odds change in a matter of seconds. The offering data, instead, changes only once every week, when the provider decides to prepare the new set of events that they plan to offer. On top of this, the DraftKings application integrating the provider is required to get all this data, update a SQL server DB when some conditions are met (for example, a new event is discovered for the first time), and, after some internal processing, publish it as a stream via Kafka, available for consumption further along in the pipeline .

This system has at least 5 external dependencies that may affect its resiliency:

The SSH connection to the 3rd party provider. This is fundamental and should stay healthy all the time, otherwise we risk a potential incident
The HTTP endpoints of the 3rd party provider. This is less sensitive and our integration could survive some unavailability, because games are usually offered several days before they are needed
The DraftKings SQL Server Database. Based on the requirements, this may or may not be sensitive. For the sake of this example, let’s say it is accepted for this component to not be updated immediately, because offering discovery usually happens several days before the start of the game
The DraftKings Kafka cluster. Again, this is fundamental as it’s the main source of information for pricing that other components rely on. If this component is not available, or not reachable, our integration won’t work and we can risk a potential incident immediately
The DraftKings Vault secret storage. This is needed because DraftKings uses a dedicated solution for storing secrets (Vault by HashiCorp), and these secrets in our case are the credentials required to connect to DraftKings SQL Server

So we can already split our dependencies into 2 groups: dependencies that when missing will bring the integration into an unhealthy state, and dependencies that when missing will bring the system into a degraded state.

The Solution

Resiliency Architecture

The vast majority of our integrations are written in .NET core and deployed in Linux-based containers, orchestrated by a Kubernetes cluster. On top of this, they are registered by a service discovery system based on Consul ( by HashiCorp). By taking advantage of the flexibility and power of the Kubernetes orchestrator, two different architectures could be implemented in order to increase our resiliency: an active-active approach, where the processing is split evenly between several instances of the same integration, and an active-passive approach (implemented with a leader election system), where one instance is actively processing the data and a second one is ready to take over in case of any trouble with the first. The one to use is dependent on the integration requirements, the expected load of the system, and generally a balance between complexity of the integration and resiliency capabilities. It is important to strike a balance between near-zero recovery time as opposed to a system with less strict requirements but simpler to develop and maintain, while still fitting into desired SLAs.

Regardless of the approach chosen, it is important to understand when an instance, active or passive , is healthy, where healthy means ‘capable of processing data and covering all expected functionalities.

Example : architectural resiliency

In our SSH-based provider we can already notice how, at least for the processing of the pricing, we won’t be able to have more than one instance active because of the SSH limitation. However, having a single instance deployed is problematic. A restart (due to crash, deployment, k8s cluster maintenance, k8s node compaction) could take minutes, and this kind of downtime is not acceptable if it happens during a live game, as it could generate a potential incident. The simplest option seems to be to use an active-passive approach. In case of restart, another instance already available and ready to take over will reduce our potential downtime from minutes to a few seconds, completely automatic without human intervention. A big improvement.

Health Monitoring System

An integration can be considered healthy when:

The instances themselves are up and running
All the external dependencies of the instances are also up and running. The number, type, and level of importance of the dependencies varies with the integration

Health checks

Both Consul and Kubernetes use a similar way of discovering integration health. They constantly ping specific HTTP endpoints that return health info about the application, and together they define the overall status of the application. There are 2 different kind of healthchecks we have to discuss: the liveness healthcheck and the readiness healtcheck.

Liveness healthcheck is interrogated by a Kubernetes’ liveness probe and will trigger a restart of the container if unhealthy. It lets the system know if a container is deployed and if the instance running in it is operational. These checks should return basic health information for the service, together with the status of any dependency for which we know a restart is required, in case of troubles.

Readiness healthcheck is interrogated by a Kubernetes’ readiness probe and Consul. Consul uses this health check for service discovery and leader election purposes. It lets the system know if the application is not only running, but is also ready to process information and accept traffic. An instance can be alive but not ready, such as if it has a long startup routine that requires the application to build up its own internal state before being able to properly process requests. These checks will be built around every registered dependency, whose faulty or unavailable status would make our application incapable of processing business requests (of any kind). This is where we should check all dependencies that our service can reconnect to without a restart, like Rabbit, Kafka, SQL, external HTTP services, and so on.

Example : Liveness and Readiness dependencies

Going back to our example integration, our 5 dependencies can be split in 2 groups:

Liveness dependencies:

Vault. Given how this system is implemented in our infrastructure (a sidecar container), problems with this dependency will probably require a restart of the Kubernetes pod. Majority of Vault-related issues require reestablishment of trust between Vault and the container, and this means execution of startup logic is necessary.

Readiness dependencies:

SSH. There is no need to restart the container if the SSH channel experiences problems. The integration should solve this issue seamlessly, or wait for the 3rd party to recover.
HTTP webapi : this is a stateless kind of connection so each request is independent. As soon as the 3rd party accepts requests again, we are good to go.
SqlServer. Also in this case, if the problem is not in the credentials (for which we check Vault), there is no reason to restart the container
Kafka : the kafka broker and the client library are smart enough to automatically recover from several kind of issues without any intervention. Our application can simply check the status of the consumers and once the distruption is solved it will just need to start again the consumption of messages.

Stopping and Starting Business Logic-Based on Health Status

Based on the importance of the dependency going haywire, an instance should be marked as degraded (working, but not at 100% functionality) or unhealthy, when it is often safer to stop any processing. This could prevent pushing forward wrong data, or flooding our logging system with errors. The best option, in this case, is to notify our alerting system that something is not working as expected, to clearly log the nature of the problem, and to put the instance in an unhealthy state, stopping any further data processing. As discussed before, there are 2 kinds of integrations that we are covering right now:

HTTP-Based Request/Response

These kinds of applications work by exposing some endpoints that the 3rd parties call to update DraftKings with their latest information. In this case, if the readiness monitor shows that the instance is unhealthy, the instance is automatically put away from the load balancer pool and won’t be reachable by HTTP traffic. This works out of the box thanks to a very clever load balancing system that uses real-time service discovery information from Consul to decide if an instance should be routed traffic or not. With this system, for example, if we have 3 active-active instances all accepting HTTP traffic and one of them suddenly reports unhealthy, the remaining instances will see an increase in traffic but the overall pipeline should be unaffected.

Producer/Consumer

Several of our integrations are not pushing data through HTTP endpoints, but are, instead, receiving information through other channels such as AMQP queues they are subscribed to, or Kafka topics. In these specific cases, we cannot take advantage of the dynamic HTTP load balancing capabilities of the infrastructure. Even if the instance is marked unhealthy in the service discovery mechanism, the instance itself won’t have any knowledge of that out of the box, and won’t automatically stop any subscription or any connection it did on startup. Because of this, the application may keep trying to process data while not having all the necessary dependencies available to do so, often with problematic results. A better, more proactive approach would be for that instance to react to its own health state changes, stop any processing, and allow other instances, if available, to pick up from where they left.

The library that the Feeds team implemented exposes specific events that can be used to react to health status changes of any dependency, understand the overall status of the system, and react accordingly. The biggest advantage of this kind of approach is being able to split the health monitor from the business domain of the application. A processing pipeline won’t necessarily need to check each dependency every time it tries to use it, or to implement complex resiliency mechanisms. Relying on the health monitor system will be enough to ensure that everything expected is available, greatly simplifying the implementation of required functionalities. This system will also allow the application to react to the dependencies going back online, and resume the processing automatically, without any manual intervention.

Example : stop SSH processing

Going back to our example integration, the application is an active-passive, consumer/producer application. This means that in order to improve the resiliency of the application we will be required to actively stop the SSH consumption and allow another healthy instance to be able to pick up the work.

For example, we could have a network partitioning and only one of our instances is able to reach the kafka cluster. In this case, it makes sense for the active instance to lose the leadership status, stop the consumption of the SSH channel, free up the only connection available, and allow another healthy instance to gain the leadership status and resume the processing of the data.

The feeds team has a separate library for leader election that integrates seamlessly with the healthmonitor system and takes advantage of Consul, to allow dynamic and automatic leader switching based solely on the health status information. Of course, if all instances are unhealthy, then in this cases consumption is impossible and an alert is sent to out monitoring channels for further management. More on this later in the article.

Core & Extensions

The idea behind the whole implementation is for the core logic of the monitoring module to be the agnostic of any dependency, while instead accepting a generic contract that can be then fulfilled by dependency-specific implementation. These implementations are called ‘monitors’ . Without monitors, both the liveness and readiness logic will always return healthy, because there is no dependency that could fail.

Then, both checks can be easily extended by adding dedicated monitors based on what the application relies upon (for example, a Kafka monitor, that would check exclusively if the connectivity to the Kafka cluster is established at all times). The monitor will also allow configuration regarding the status that should be exposed in case of issues : for example, in some cases when losing Redis the application won’t be able to process data anymore, in other cases losing Redis won’t be a blocked and the system can continue with degraded functionalities. The integration may want to restart if it cannot connect to Kafka, may want to stop everything but stay alive and keep trying to reconnect, or may want to degrade gracefully and simply go into a degraded status until the problem is resolved. All of this will be configurable, without the need of touching the infrastructure configuration (Consul, K8s, load balancing, etc.). This is usually enough to control the behavior of consumer/producer style applications that we don’t want to execute on an instance that has issues with some of its dependencies.

Also, if the application is not on a request/response style, and does not need to expose the health checks via http, we won’t require any integration with .net core middleware-based http health check system. For this reason this extension will be optional, and will be exposed as a separate package.

Implementation

The implementation is simple: a background service is configured to routinely collect the status of both liveness and readiness monitors, and keep in memory the aggregated status of the system. A monitor is a piece of logic dedicated to returning the status of a specific dependency. We will have a Kafka monitor, a Redis monitor, a Sql Server monitor and so on. The system will allow us to define a list of monitors, and configure them to decide if they should describe the liveness or readiness status, if they should mark the instance as degraded or unhealthy in case of troubles, all of this together with any other dependency-specific configuration. There will be two separate collections; one for readiness monitors, and one for liveness monitors. The overall status of a monitor collection will define the status of that specific health check. It is worth mentioning that the same monitor can be configured to be used as a liveness or readiness monitor based on the application use case.

Overall status

The logic is as follows:

At least one unhealthy monitor — overall health check is unhealthy
Otherwise, at least one monitor degraded — overall health check is degraded
Otherwise, overall health check healthy

This runs for both monitor collections and a separate result is stored for each of them. In addition, an event is raised every time the status of the readiness monitor changes, to allow the business logic to start or stop accordingly. In case a health check endpoint is requested in the .net host, an extension is attached to simply recover the latest aggregated status of the corresponding monitors.

Timeout Policy

Some of the monitors may require some time to be able to understand if a dependency is available or not, and to be able to avoid status flickering in case of a brief interruption of the connection. For this reason, and also to prevent the system from getting stuck in case any of the monitors hang for any reason (including bugs in the monitor implementation) there is a timeout policy in place in the overall monitor status collecting system. This timeout period is configurable at application level and, in case of triggering, will prevent the entire system from hanging out waiting for a response from one of the monitors that will never come, and instead will show the overall status as unhealthy, allowing the application and the overall infrastructure to react accordingly.

Extensibility of the Solution

The monitors are exposed as extension packages so the core does not have unnecessary dependencies. New monitors can be added if needed, and custom monitors can be implemented for specific applications that require an ad-hoc solution. The integration with the .net core health check system is exposed as a separate package for the same reason.

Example : SSH monitor

Going back to our example, it is very possible that the monitor used for checking the status of the SSH connection is very application-specific. Because of this, it would make more sense to implement it as a custom monitor directly in the application code instead of creating a new extension. This will allow us to reduce the time-to-market and to avoid overengineering the solution, while still keeping this logic separated from the main flow. In the future, in case of centralization needs, that implementation can be eventually extrapolated from the context in order to accommodate more generic requirements.

Reacting to Health Status Changes

Active/Active : Splitting the Work and Dynamic Rebalancing of Traffic

When the requirements allow it, an active/active approach is preferable as it would also help in other important aspects like scalability. However, there is always the need of implementing how the work can be repartitioned between the healthy instances in case one of them is not capable of handling traffic anymore, and how to return traffic properly to that instance once the disruption has passed.

Usually for request/response applications this is handled automatically by the load balancing system. Traffic is routed only to the healthy instances, and as long as we make sure that every instance is capable of processing every request, this works out of the box. An unhealthy instance will be removed from the pool, and then put back seamlessly once its deemed ready again.

For consumer/producer based applications in DraftKings, the preferred way of ensuring this is to take advantage of the Kafka rebalancing mechanism. All applications are consuming the same Kafka topic and the Kafka system itself makes sure to repartition the traffic only to the active consumers. For this reason, is very important to take advantage of the health monitor events to stop any consumer when the application is unhealthy, and start it back when the application is healthy again. This will notify the Kafka broker that there is a change in the consumer’s group membership status, and the broker will react accordingly, ensuring that traffic is rerouted properly to the remaining instances. This has the huge advantage of not needing the application instances to be aware of the status of any other instance apart from themselves.

In case of other integrations, the solution is not so straightforward. For RabbitMq, for example. it may require some explicit reconfiguration of binding and routing between the remaining instances, together with the possibility for the application to request the overall status of each instance to Consul in order to better orchestrate the consumption. This is undoubtedly more complicated to handle and more error prone, but it is doable, or we could fall back to an active/passive system or a sharded system (more manual maintenance but more straightforward architecture) . As said before, is a matter of balancing between resiliency and complexity.

Active/Passive : Integration with the leader election system & additional challenges

The Feeds Team uses Consul also for Leader Election when implementing an active-passive approach. This combines perfectly with the health monitoring system, as consul uses the same health check, on its side, to allow an instance to maintain its leadership status. If the readiness health check of the application returns an unhealthy status, the instance is immediately demoted as a leader and any other healthy instance will take its place. This will trigger an event in the application, and this event can be used to start or stop the processing of the data that require leadership status for consumption.

While this system works very well out of the box, there are indeed some challenges that arise in this case and that need to be taken care of :

How to handle monitors that should check a dependency that should be connected only if an instance is currently a leader
How to avoid constant switching of leadership in case of troubles with a fundamental dependency

These are all cases that are currently handled with application-specific code when the need arises. A more general solution is expected to be scoped and discussed in the future. The monitoring system is pretty much alive and constantly evolving.

Http Based Active/Passive system

On a side note, an interesting requirement that was necessary in some of our past integrations was the necessity to combine HTTP-based request/response system together with active/passive architecture. In other words, we wanted only our active instance to be reachable via HTTP (to an internal load balancer — our system won’t allow internet connection directly, there is a gateway solution for this) . This is not as straightforward as it seems, because our load balancer by default exposes ALL healthy instances, regardless of them being active or passive, and we did not want to modify the existing infrastructure to accommodate for this.
After some experiments with degraded statuses for passive instances, or by dynamically configuring the weight of the load balancing routing based on the metadata of the application (both unsuccessful for various reasons), the solution was found by taking advantage of Consul and the way its DNS system can be configured.
Without going into too much detail, when registering applications into the Consul service discovery system, it is possible to optionally add ‘tags’ to the instance registration. We use this functionality to mark which instance is currently the leader in order to track it down in our monitoring system, using a tag with the key ‘Leader.’ The tag is added and removed by the leader election library based on the actual acquisition or demotion of leadership.
Consul has a DNS templating system that allows it to take these tags into consideration for traffic routing. When a service is called, for example, MyService, the service will be reachable in our network by calling MyService.Services.Consul. However, it is also possible to use the aforementioned tags to better filter the instances to route the traffic. So, when calling Leader.MyService.Services.Consul the traffic will be routed ONLY to the instances that are registered with that specific tag.
Given that the tag is dynamically added or removed as previously explained, this allows any client trying to connect to Leader.MyService.Services.Consul to hit the leader directly, and to reach the new leader just few seconds after the switch.

Visibility

The status of each component is visible in various ways:

Consul Dashboard — Consul provides a very clear dashboard giving us the status of every healthcheck in realtime, including the response that the application returns upon request
Grafana Metrics — Grafana is the main monitoring tool used in the Sportsbook product, and Consul also produces health metrics that can be used for alerting
Kubernetes Dashboards
Kibana — The Sportsbook system uses the Elastic/Logstash/Kibana stack for logging purposes, and the monitoring system takes advantages of it by inserting clear and detailed logs about any kind of issues encountered for each monitor. It is very useful for troubleshooting purposes of both the application and the monitoring system itself
Kafka — A dedicated Kafka Topic containing the status of each provider integration. This is a separate component that can be used to allow our Sportsbook to dynamically react on health issues, for example by switching the provider used to offer a certain game for further calculation, requiring as little as possible to handle such cases manually.

Usage Example

To use the system, all that is needed in the startup.cs class of the application.

A configuration example with several monitors.

Once this is done, it will be necessary to add a line available in the extension package to integrate with .net core health check. The returned status of the health checks system takes automatically in consideration the status of all the monitors registered, for readiness and liveness.

this will add default health check routes /health-readiness and /health-liveness

Finally, to be able to start/stop business services based on the readiness of the application, simply register to the event that the HealthMonitorService exposes.

this can be easily combined with Background or Hosted services.

Recap of capabilities and advantages

Thanks to the monitor plugin system, the implementation of monitoring is easy and quick, and brings immediate advantages to a new project in terms of resiliency. Developers are quickly capable of writing custom monitors, and configuration is simple and easy.

The seamless integration with the leader election package of feeds makes the creation of a self-healing, leader-election system based on Consul straightforward, as several cases are handled out of the box.

Already existing metrics and dashboards that have been refined with past experiences allow us not only to react quickly in case of troubles, but also to rule out false positives due to metrics infrastructure issues. Moreover, a quick call to any of the endpoints or a query in the Kibana logging system will return clear data about which monitor is firing and the result of the overall monitor aggregation.

The fact that the monitors are pluggable and reusable also means that the metrics and alerting logic that can be attached to them can be easily reused in different applications.

As a result, this allows us to have a self-healing system that will stay down only the minimum amount of time for the dependencies to recover, and provides automatic resolution of problems that before would have required manual intervention from developers and/or operations. In most cases, IT and business do not even notice that a (short-lived) problem has occurred. Only the alerting and auditing system keeps track of this.

A business service (intended as an entity capable of processing business logic) can utilize the health-monitoring system in various ways:

It can subscribe to the health event triggered by it, and react accordingly to shutdown a consumer, fire an alert , trigger another event, etc.
It could trigger a health check explicitly as part of a catch statement and control an error caused by the health status of the application as soon as possible.

The overall infrastructure instead can use this system to allow:

Automatic restarts of containers UNTIL the dependency for which the restart is required does not get back online
Automatic split of load if only some of the instances are unhealthy (for example, connectivity disruption in some nodes), not only in terms of HTTP traffic but also in terms of Kafka rebalancing
Automatic leader switch if a leader gets unhealthy, and a complete stop of the system if every instance is unhealthy

Example : results in our SSH provider integration

We achieved a very flexible resiliency system that will allow us to perfectly configure the behavior we would like to observe in our application. Our SSH monitor can be customized to allow the application to keep trying the connection for 30 seconds before giving up; The HTTP monitor could mark the instance degraded (allowing us to register a warning) but not do anything for several hours before marking the instance as unhealthy; and the integration with the leader election system will ensure a smooth transition to a healthy instance and continuation of processing as soon as problems with 3rd parties or DraftKings infrastructure are resolved.
All the monitors can then be transformed in extension packages and reused, as the Kafka, SqlServer, Vault, and HTTP monitors have pretty standard requirements, and a package will be very useful in other cases in the future, reducing the time needed to implement our architectural requirements in our next integrations.

Conclusion

This library, together with the integration with the leader election system and the overall infrastructure at DraftKings, is the result of 2 years of constant improvement and discussions between lots of people in various teams. We’re proud to say that this system reduced the amount of incidents and manual interventions drastically, significantly increasing the confidence in the overall Sportsbook solution. The self-healing capabilities of the system reduced the amount of times people had to be paged due to infrastructure problems, improving the quality of life of developers and the quality and consistency of DraftKings’ offering over time. The considerable investment of time spent in designing and implementing the solution has been fully repaid by allowing people to focus more on product and less on solving the same technical requirements every time, and reduced the time to develop a solid integration by 2 to 3 times, making our roadmaps more flexible and more capable of adjusting to new challenges and requirements.

Want to learn more about DraftKings’ global Engineering team and culture? Check out our Engineer Spotlights and current openings!