Migrating Hulu’s Hadoop Clusters to a New Data Center — Part One: Extending our Hadoop Instance

Published in

Hulu Tech blog

8 min readMay 15, 2018

By Qianxi Zhang, Senior Software Developer and Xicheng Dong, Principal Software Development Lead

As part of our larger migration to a sustainable data center in Las Vegas, Hulu migrated our Hadoop data center. This migration posed unique challenges due to the volume of data we had to migrate, our complex software stack, and our use of mixed applications. In this blog, we will outline the complexities we faced and discuss the most challenging aspect of the migration — migrating the HDFS cluster.

Challenges

We handle large amounts of data and have complex applications running on this data. At Hulu, we work with several Hadoop clusters, the largest of which stores multiple petabytes of data, runs at least 30,000 applications, and performs different functions every day (including MapReduce, Pig, Hive Spark, Presto, Impala, etc). The data migration needed to maintain the integrity of these Hadoop clusters and cause little to no impact to our daily operations.
We needed to keep downtime of our Hadoop system to a minimum. We wanted a maximum of 6 hours of downtime to ensure the migration had a minimal impact on daily operations.
This project needed to be budget friendly. We did not want to purchase an equal number of machines for the migration, so we needed to devise a way to reduce the number of machines while maintaining our data during the migration.
This migration was a cross-functional effort — collaboration was key. Hulu’s data-related teams sit in different offices around the globe, and this migration impacted all areas of the business that leverage Hadoop for data analysis. This migration had to be a team effort, and cooperation and strong communication were critical to our success.
This migration should be simple and create the least amount of work for our team. We wanted to make application migration as easy as possible and reduce the workload on our team — to do this we decided to create custom tools and enhance our infrastructure.

Separate solutions for separate clusters

For the migration, we classified our systems into two categories:

Stateless systems YARN, Presto, and Impala are standard systems that could be easily migrated from one data center to another by setting up a new version of the system.
Stateful systems which we can further divide into three sub-categories: systems that manage small amounts of data (Hive and Zookeeper are examples), systems that manage windowed data (Kafka is an example), and systems that manage big data (HDFS is an example).

Different Hadoop clusters required our team to build two solutions for the migration. Larger, load-heavy clusters required a solution that allowed us to customize Hadoop to run in multiple data centers. For smaller, less complex clusters, we approached the migration differently, and set up a mirrored Hadoop instance in our new data center.

Two Solutions for the Hadoop Data Center Migration

The most difficult system to migrate was HDFS, which stores our big data and required a multi-data center approach. Because Hadoop does not natively have the ability to support multiple data centers, our team needed to do additional work to complete the HDFS migration, which we will discuss in this post.

Building our multi-data center solution for migration

Architecture

We extended HDFS to support multiple data centers as one whole cluster. By adjusting replica numbers in each data center, we were able to migrate HDFS data from one data center to the other gradually.

There are 3 components in this system, each with different responsibilities:

DCNameNode (extends NameNode): Manages DataNode and block replicas across data centers
DCBalancer (extends Balancer): Balances the DataNode within data center
DCTunnel: Replicates blocks across data centers asynchronously

Implementation

We enhanced and added multiple HDFS components to the data center level topology:

DCNameNode

Data center level topology
Read locality at the data center level
Write only in primary data center
Replica management at the data center level

DCTunnel

A new high-performance block replication engine across data centers

DCBalancer

Balancer within data center

DCNameNode

Data center level topology. To achieve locality read, we built DCNameNode to be aware of the topology of data center levels. It provides the following functions:

Understands the data center location for the specific DataNode.
Checks whether the DataNode is in the specific datacenter.
Calculates the weight, or how far the node is from the reader.
Sorts the DataNodes according to the distance weight.

Read locality at the data center level. To avoid network issues when replicating large amounts of data across the data centers, we added a feature to support read locality for the data center and determine locality level after node-locality and rack-locality. We designed a new strategy to control candidate DataNodes where NameNode returns to the client for read operation.

If read across data centers is allowed, DCNamenode sorts DataNodes according to distance weight, which means the sequence to select one DataNode as a source becomes: The DataNode in the same node, the DataNode in the same rack, the DataNode in the same data center, or the DataNode in the remote data center.
If read across data center is not allowed, DCNameNode returns the DataNode to the local data center. This protects the bandwidth across data center from being congested when mass request replicas are not available in the local data center.

Write only in primary data center. To better control for data consistency, we assigned one data center as the primary one and data is forced to write to that data center. The primary data center was changed during the migration: During the first stage, all new data was written to the source data center, which was set to the primary datacenter. The DCTunnel then syncs blocks to the destination data center asynchronously, and the primary label is switched to the destination data center when all applications are migrated.

Inter-data center unified replica management. We introduced two new factors for each data center, minimum replica factor (MINRP) and maximum replica factor (MAXRP), which represent expected global minimum and maximum data replicas in the specified data center. Additional expected replica numbers for each file can be calculated by:

The minimum replica for a file in the specific data center(minR) = min {file replica factor, MINRP}
The maximum replica for a file in the specific data center(maxR) = max {file replica factor, MAXRP}

NameNode is responsible for replica management within the data center. When the replica in one data center is more than 0 but less than minR, NameNode will increase the replica to minR, and delete the redundant replica if the number is more than maxR.

DCTunnel

DCTunnel is a high performance block replication engine that transfers block replicas across data centers asynchronously. It performs the following functions:

Replicate blocks by priority. Directories are assigned with different priorities. We determined high priority directories that needed to be migrated faster, and were able to do this by consuming more bandwidth. Additionally, we were able to set hard limitations around the bandwidth we used, controlling the entire process.
Support blacklist and whitelist. We’re able to control which directories are not replicated by adding them into blacklist and control which directories are replicated by adding them to a whitelist.

Low overhead to existing NameNode. DCTunnel is similar to HDFS-version rsync, it calculates the gap between two data centers before starting block replication. To calculate the gap, it needs all file metadata which is located in NameNode. To avoid bringing huge overhead, we designed a new strategy to obtain metadata: we parse fsimage to get file and block mapping information, and retrieve block location from DataNode disks.

Since the file mapping and block location change over time, DCTunnel labels and versions the location information, then uses MVCC(Multiversion Concurrency Control) to generate copy tasks.

Fully control replication speed. We used PID controller to adjust parallel replication thread numbers by aligning the current speed with the expectation speed. In our production environment, the results were phenomenal, and the network utilization could be fine-grained controlled.

DCBalancer

Each data center has its own DCBalancer, which balances data between DataNodes by moving blocks. Thorough this tool, we could avoid replica movement across data centers that is sometimes triggered by a global balancer.

Five Stages of Migration

We divided the migration process into 5 stages, and our core objective was to migrate machines by batch and carefully control the replicas in each data center.

In the image below, the left represents our original data center (marked DC1) and the right represents our new data center (marked DC2).

In the first stage, we set up new clusters in DC2.
In the second stage, we replicated data from DC1 to DC2. Two replicas are stored in the data center.
In the third stage, almost all of our data is synced to the new data center. We incrementally move all remaining data, shut down the Hadoop cluster in DC1, and migrate all remaining systems to DC2.
In the fourth stage, we ship a partial number of our machines to the new data center and replicate our data again. We now have three replicas in DC2.
In the fifth stage, we shut down all old clusters and ship the remaining machines to the new data center.

Migration process

Best Practices

Dynamic configuration

All configurations are designed to be re-loadable dynamically, which allows us to adjust configurations without restarting any NameNodes or other services. This is critical because the process of restarting NameNodes is time consuming.

Long tail problem

Once the majority of blocks (e.g. 99.9%) are migrated to the new data center, finding the remaining ones (e.g. 0.1%) becomes more difficult. This potentially causes long tail problems, as it takes more time to find blocks to replicate. To solve this problem, we built a cache for each whitelist folder to track blocks that haven’t been migrated.

Data validation

Data validation is crucial to ensure the data migration completes without data consistency issues. We built a distributed tool to validate the metadata and data in different data centers.

Fully control across-data center network usage

The data center bandwidth is shared between different services. As big data infrastructure migration consumes huge network resources, the network usage needs to be fully controlled. We maximize the network utilization to ensure data is replicated fast enough without impact to other services.

Throughout the migration process, we maintained the same DNS addresses for our Hadoop components so there was no change for our applications. All blocks and files were accounted for and there was no impact on the performance of HDFS and other applications. The entire migration process was almost imperceptible to our users and the applications.

In our next post, we’ll discuss our migration strategy for smaller, less complex clusters, where we set up a mirrored Hadoop instance in a new data center.