Migrating Hulu’s Hadoop Clusters to a New Data Center — Part Two: Creating a Mirrored Hadoop Instance
By Anning Luo, Software Developer and Xicheng Dong, Principal Software Development Lead
As part of a larger migration to a new data center in Las Vegas, we migrated our Hadoop clusters using two approaches. The first approach we discussed in our last blog post involved extending our instance of Hadoop to migrate our largest and heaviest cluster of HDFS. In this blog post, we’ll discuss migrating smaller HDFS clusters (100–200 nodes) for special purposes by creating a mirrored Hadoop instance.
When migrating these smaller clusters, the major challenge is some have different vendor versions. For example, some of them are Hortonworks distribution (HDP), while the others are Cloudera distribution (CDH). It requires a lot of work to maintain different versions of clusters, but the migration gave us the opportunity to unify the cluster versions. We therefore wanted to build a migration solution that not only migrated the clusters, but switched their versions as well.
To migrate less complex clusters of HDFS while switching their versions, we developed two systems:
- DCMirror: This is a distributed, high-performance replication tool between heterogeneous Hadoop clusters. DCMirror is different from DCTunnel (which we discuss in Part One of this series) as it focuses on inter-cluster file-level replication and can be used to replicate files between Hadoop clusters with different versions. DCMirror divides the replication process into two stages: full sync and incremental sync. In the full sync stage, DCMirror will parse NameNode, FSImage, and fsedit from the source cluster to generate parallel replications tasks. In the incremental sync stage, DCMirror will listen to the latest inotify event from the JournalNode in the source cluster and replay the event on the target cluster.
- DCValidator: This is a validation tool to verify whether two Hadoop clusters contain the same data and metadata. It parses the FSImage of two clusters and rebuilds the entire directory hierarchy tree. DCValidator then compares all directories and files recursively to generate a “diff” report. For example, it will flag file and folder differences between two HDFS.
Architecture and Implementation
Operator in HDFS
Before diving into an explanation on the architecture, we will provide background about operators on HDFS to give context on how operators work.
The operator sets from inotify event streams are as follows:
- Create (file or directory): Opens a file to create a new file, overwrite it, or make a directory
- Append: Opens an existing file and appends data on it
- Close: Closes a file after writing data is complete
- Rename: Renames a file or directory (similar to the “mv” command in Linux.
- Unlink: Deletes a file or directory
- MetaData: Modifies the metadata of a file or directory
DCMirror operators can be generated from HDFS inotify events. The mapping is as follows:
We mapped the inotify events to DCMirrors to:
- To split the create operator in HDFS to two: the create-file operator and create-directory operator. Create-file operator will be executed for an extended period of time and we wanted to control bandwidth for this operator and distinguish the two operators in our new system.
- To replace the create-file operator, append-file operator, and close operator with sync-file operator. The original three operators indicate a file lifecycle that covers creating, writing, and closing in HDFS. DCMirror however, can only start migration after the lifecycle is complete, so we merged all operators as one.
DC Mirror Architecture
The DCMirror architecture is made up of four components: BaseTaskGenerator, IncrementalTaskGenerator, DCMetaMirroServer, and DCFileMirrorServer.
TaskGenerator: Responsible for generating data replication tasks, which will be stored in MySQL
- The BaseTaskGenerator will parse FSImage and create file replication tasks to migrate old data
- The IncrementalTaskGenerator will listen for inotify event stream and fetch changes on the HDFS cluster, then translate these changes into a DCMirror task.
MirrorServer: We built two types of MirrorServers — FileMirrorServer for file replication and MetaMirrorServer for metadata replication. We decoupled mirror servers for two reasons:
- File replication executes over a longer time, which may block or slow metadata replication if we combine the two
- File replication is dependent on metadata replication. Decoupling them is beneficial for concurrent control and network utilization improvements.
DCValidator is a validation tool that tells us whether all directories or files are consistent between two clusters. It also reports the difference between source and target clusters.
There are three steps in DCValidator
- FSImage parsing stage: This step fetches and parses FSImage from both the source and target clusters, then builds the entire directory hierarchy tree.
- Metadata check: This step compares the metadata of tree nodes recursively without touching the file contents. A report about metadata “diff” will then be generated after this stage.
- Content check: For files with identify metadata between two clusters, this step will further compare the content in the two clusters by verifying their MD5 checksums. We leverage MapReduce to parallelize this stage.
Key Learnings and Best Practice
Throughout this process, there were several best practices that we identified.
- Last modify time change rules: Some operators will impact parent files when they change directories or files. For example, the Create File operator will change its parents’ last modify time. Because of this, we needed to distinguish directory and files when handling data replication. For the directories, the last modify time is modified if a child is added or deleted. For files, the last modify time is changed only when the content is changed.
- Capture inotify events from JournalNode: The built-in inotify in HDFS helps capture HDFS changes and transforms this information into an event stream. This uses a batch-based event cache, which frequently causes OOM issues. To improve this process, we built our own tool to fetch an event stream directly from the JournalNode cluster. As JournalNode only retains an edit log for a short period of time, we needed to capture this information as soon as possible, or put it into a message queue to avoid missing this data.
- Incremental validation: Validating the MD4 checksum of all files is extremely time consuming, which in turn could cause a longer downtime of our Hadoop system during the migration. To optimize the validation process, we introduced incremental validation, where data is validated by time range. When migration Hadoop to the new data center, we only needed to validate the latest data.
In summary, we built DCMirror to handle data replication in a dynamic environment with unpredictable delays. DCMirror also does not invade the HDFS kernel and has no Hadoop version restrictions. While challenges do exist when replicating large files, setting up a mirrored instance of Hadoop in our new data center allowed us to migrate the less complex clusters of HDFS efficiently.