Standardization and control in data migration

Published in

DraftKings Engineering

8 min readJan 16, 2024

Introduction

Data migration is common in nearly all software development organizations as they scale and expand their operations. Moving data from one system to another involves complex tasks that require careful planning and execution to ensure the data remains accessible and usable. Moreover, data migration presents several challenges, including security, integrity, compliance, and quality. This article explores software development challenges and strategies associated with data migration that we faced at DraftKings. We also explore some of the latest tools and techniques we use to execute seamless data transfers while ensuring the integrity and security of the data throughout the migration process.

Data Warehousing
Data warehousing is one of the most significant projects that require extensive data migration. Data warehousing is critical for effective data management and analysis, especially for organizations like ours with large and complex datasets. To address this challenge, we need to develop robust and comprehensive data migration strategies that incorporate the latest tools and techniques, such as Extract-Transform-Load (ETL) processes, to ensure the integrity and security of the data throughout the migration process.

Streaming
Another critical project that requires extensive data migration is streaming. Streaming involves using internal and external data streams for various use cases. Given the high volume and frequency of data involved in streaming, data migration must be performed quickly and accurately to ensure that the data remains accessible and usable. To execute seamless data transfers, we must leverage advanced tools and techniques, such as Change Data Capture (CDC) and data replication, to ensure that the data remains consistent and up-to-date throughout the migration process.

Regulatory Integrations
We are a software development organization that operates in numerous countries and faces regulatory requirements that necessitate external integrations based on regulatory institutions’ needs. As such, data migration is a critical aspect of ensuring compliance with these regulations, making it a frequent and essential task. To address this challenge, we must develop comprehensive data migration strategies incorporating the latest tools and techniques, such as data encryption and anonymization, to ensure the data remains secure and compliant throughout the migration process.

Challenges We Face

From data warehousing to data streaming and regulatory compliance, effective data management is critical to the success of modern software development organizations. However, despite the benefits of proper data management, data migration is often fraught with challenges that hinder an organization’s ability to scale and innovate.

Repeated Code in our Codebase
Several sections of code are repeated throughout the codebase. This repetition can result in challenges such as more extended development and maintenance times, reduced efficiency, and a higher likelihood of introducing errors or bugs into the system.

Monitoring, Logging, and Troubleshooting
Another challenge we need to work on during data migration is troubleshooting. Monitoring and logging can vary between applications, making identifying and resolving issues difficult. This lack of uniformity in monitoring and logging can lead to delays, potentially affecting business operations.

Scaling and Elasticity
Scaling up or down can be challenging during data migration, especially for older applications deployed on virtual machines (VMs). Older applications may not have been designed to be easily scalable, making migrating data from these applications challenging. This can increase the cost of the infrastructure or, in other cases, affect performance incorrectly.

Goals for Standardizing Our Solution

Organizations must set clear goals to migrate data successfully, focusing on scalability, resilience, stability, and global monitoring and logging.

Scalability
Scalability is essential for applications that handle large volumes of data. We prioritize a configuration-based, out-of-the-box solution to ensure that applications can take data migration at scale with the support of scheduled scaling and scaling based on the load. By automating the scaling process, we aim to ensure that applications can handle large volumes of data without experiencing performance issues or downtime.

Efficiency, Resilience, and Stability
Our data migration solution aims for optimal efficiency, resilience, and stability. This can be achieved by minimizing the need for repetitive implementations and maximizing the utilization of pre-existing code features and development practices.

Common Monitoring and Logging Out-of-the-box
Monitoring and logging are critical components of any data migration solution. Applications should include ETL-specific metrics out of the box, especially performance metrics. This allows for faster and easier investigations and troubleshooting, reducing the likelihood of issues and downtime.

Solution

.Net-Based Solution
Most unit and company developers are already familiar with .Net, and most legacy applications are also based on .Net. Therefore, choosing a .Net-based solution significantly reduces the complexity of the migration process and facilitates a smooth transition.

Scalability and performance
Scalability and performance are critical factors in any ETL DataFlow implementation. These can be achieved through the design and implementation of the blocks, the algorithms and techniques used for data processing, and the use of parallel and asynchronous processing.

A library with some features
At first glance, developing a custom library may seem like an unfavorable choice, considering the availability of several established solutions and frameworks like Apache Spark and Apache Flink. However, after careful consideration and review of all possibilities, our decision to opt for a custom generic library is mainly based on the following reasons:

Most of the engineers in the unit are C# developers with experience in the .Net framework.
All legacy projects use the .Net framework, and the migration should be as fast and easy as possible.

An ETL (Extract, Transform, Load) library is recommended to optimize the migration process, which implements the basic ETL flow with additional features. This library provides the ability to define the input and output types of the object and the specific implementation of blocks. It also allows for individual internal block scaling on a thread level, with the number of instances per block as a runtime parameter. Additionally, performance metrics are available out-of-the-box for every ETL, and scheduling features are included.

Features:

Migration and Refactoring: Migrating to a custom generic .NET library is feasible for several reasons. Firstly, we already have numerous existing ETLs developed in .NET and hosted on Kubernetes, and the required refactoring with this library is minimal. Secondly, the company has already hired experienced software engineers proficient in developing Microservice ETL applications hosted on Kubernetes, which provides a competitive advantage.
Thread parallelism using Microsoft Dataflow (Task Parallel Library).
Plugin Design Pattern. A plugin design pattern is recommended to define the ETL’s input and output types and the blocks’ specific implementation.
Performance metrics are stored in InfluxDB and visualized in Grafana.
Scheduler Quartz.NET integration.
Job Scheduling
Control over Job Execution
Job Persistence
Job Clustering

Migration and Refactoring
Migrating to a custom generic .NET library is feasible for several reasons. Firstly, we already have numerous existing ETLs developed in .NET and hosted on Kubernetes, and the required refactoring with this library is minimal. Secondly, DraftKings has already hired experienced software engineers proficient in developing Microservice ETL applications hosted on Kubernetes, which provides a competitive advantage.

Hosting

Kubernetes is our primary hosting solution across the company, which fulfills all requirements for hosting an ETL. We are at an advantage with a well-established solution and highly experienced professionals supporting it.

Monitoring

The Block Processing Rate, denoted in items per second, is a key performance metric that measures the throughput of the extract-transform-load (ETL) process for individual data blocks. This metric indicates the efficiency of the ETL process and is essential in evaluating system performance.

The Block Processing Duration is a performance metric that quantifies the time required to execute each data block within the extract-transform-load (ETL) process. This metric provides valuable insight into the efficiency and effectiveness of the ETL process and serves as a critical indicator of system performance.

As designed, the buffer size metric denotes the number of items within each data block’s buffer. This metric provides insight into the data processing capacity of the system and aids in identifying potential bottlenecks within the data flow.

Kubernetes

Benefits

Horizontal Scaling.
A robust and scalable platform for deploying and managing containerized applications is essential for data processing and analysis applications. Using Kubernetes, you can quickly deploy and scale the ETL implementation as needed without worrying about managing the underlying infrastructure.
Regarding an ETL, this can be particularly useful for handling large amounts of data and ensuring that the data processing can keep up with the rate at which data is extracted and loaded. With Kubernetes, you can easily add more nodes to your cluster to increase the processing capacity of your ETL pipeline, and you can use Kubernetes’ horizontal scaling features to automatically add and remove nodes as needed.
Maintainability & Resilience.
DraftKings already uses it as the primary hosting solution for most applications. Many experienced people in the company have a lot of confidence in it. It has an active community and detailed documentation.
Kubernetes can also help with the reliability and availability of your ETL. Kubernetes can automatically reschedule failed jobs and ensure that your data processing continues even if individual nodes fail. This can help prevent data loss and ensure your ETL dataflow can handle even the most extensive data volumes without interruption.
Consistent & Easy to Use
Well-known in the company, a consistent and standardized environment for deploying applications can help simplify the development and deployment process. Using Kubernetes, your .NET developers can focus on developing the ETL implementation.

Challenges

Resilience
Kubernetes is designed to scale horizontally, quickly spinning up new service instances to meet demand. However, this can also create challenges when it comes to managing the health and availability of these instances. Kubernetes is often used to deploy microservices composed of many small, independent services. This introduces additional complexity when it comes to managing the health and availability of these services, as failures in one service can impact others.
Lack of built-in support for ETL.
Kubernetes is not explicitly designed for ETL processes and does not include built-in support for ETL tasks.

Summary

Selecting the most appropriate data processing framework is a critical decision that requires a thorough analysis of specific requirements and trade-offs. While many robust and reliable out-of-the-box solutions exist, each key may have limitations or challenges, which may only be suitable for some use cases. Therefore, it is essential to carefully evaluate an organization’s unique needs before making any decision.

After a comprehensive analysis of our data processing requirements, it has become evident that a custom library may be the most suitable solution for our needs. While it may seem unconventional, several factors have led us to this decision.

Developing a custom library poses challenges, including compatibility and versioning. We address this by creating versioned agreements/interfaces and conducting automated tests for each commit.

Overall, while it may seem like an unconventional choice, developing a custom generic library can provide us with several significant advantages over out-of-the-box solutions. We can achieve a more efficient, agile, and cost-effective data processing pipeline by leveraging our existing technology stack, gaining greater control and flexibility, standardizing our development approach, and improving our code’s quality and reliability.

Want to learn more about DraftKings’ global Engineering team and culture? Check out our Engineer Spotlights and current openings!