Large Scale Hadoop Upgrade At Pinterest
Yongjun Zhang | Software Engineer; William Tom | Software Engineer; Shaowen Wang | Software Engineer; Bhavin Pathak | Software Engineer; Batch Processing Platform Team
Pinterest’s Batch Processing Platform, Monarch, consists of more than 30 Hadoop YARN clusters with 17k+ nodes built entirely on top of AWS EC2. At the beginning of 2021, Monarch was still on Hadoop 2.7.1, which was already five years old. Because of the increasing complexity in backporting upstream changes (features and bug fixes), we decided it was time to invest in a version upgrade. We settled on Hadoop 2.10.0, which was the latest release of Hadoop 2 at the time.
This article shares our experience of upgrading Monarch to Hadoop 2.10.0. For simplicity, we use Hadoop 2.10 to refer to Hadoop 2.10.0 and Hadoop 2.7 to refer to Hadoop 2.7.1.
The Challenges
Since the beginning (circa 2016) of Pinterest’s batch processing platform, we have been using Hadoop 2.7. Over time, the workloads our platform processed grew and evolved, and in response we made hundreds of internal changes to address these needs. The majority of these internal patches are Pinterest-specific and required a significant time investment to port them to Hadoop 2.10.
The most business critical batch workloads run on Monarch, so our highest priority was to perform the upgrade in such a way that there was no cluster downtime or performance/SLA impact to these workloads.
Upgrade Strategy
Because many user-defined applications are tightly coupled to Hadoop 2.7, we decided to split the upgrade process into two independent stages. The first stage was to upgrade the platform itself from Hadoop 2.7 to Hadoop 2.10, and the second stage was to upgrade user-defined applications to use 2.10.
During the first stage of the upgrade, we would allow users’ jobs to continue using Hadoop 2.7 dependencies while we focused on the platform upgrade. This added additional overhead as we needed to make Hadoop 2.7 jobs compatible with Hadoop 2.10 platform, but it would allow us additional time to work on the second stage.
Because of the scale of the platform and the user applications we have, both the stages mentioned above need to be done incrementally:
- We would need to upgrade the Monarch clusters one by one
- We would need to upgrade the user applications to bundle with 2.10 instead of 2.7 in batches
At the time, we didn’t have a flexible build pipeline that would allow us to build two separate versions of job artifacts with separate hadoop direct and transitive dependencies. Likewise, it would be unreasonable for us to ask all of our users (other engineers in the company) to individually validate over 10,000 separate job migrations from 2.7 to 2.10. In order to support the incremental upgrade described above, we needed to run many validations ahead of the migration to ensure the vast majority of the applications built with Hadoop 2.7 would continue to work in 2.10 clusters.
The high-level steps we came up with for the upgrade process were:
- Hadoop 2.10 release preparation: port all patches on our internal fork of Hadoop 2.7 to vanilla Apache Hadoop 2.10
- Upgrade Monarch clusters to Hadoop 2.10 incrementally (one by one)
- Upgrade user applications to use Hadoop 2.10 incrementally (in batches)
Hadoop 2.10 Release Preparation
The Pinterest 2.7 release included many internal changes made on top of open source Hadoop 2.7, which needed to be ported to Hadoop 2.10. However, there were significant changes made between Hadoop 2.7 and Hadoop 2.10. Because of this, applying the Pinterest Hadoop 2.7 changes onto vanilla Hadoop 2.10 was a non-trivial task.
Here are a few examples of internal patches we made on Hadoop 2.7 and then ported to Hadoop 2.10:
- Monarch is built on top of EC2 and uses S3 as the persistent storage. The input and output for a task is typically on S3. A DirectOutputFileCommitter was added to enable tasks to write results to target location directly to avoid the overhead of copying result files in S3.
- Add application master and history server endpoint for getting values for a specific counter for all tasks of a given job.
- Add range support when serving a container log, which allows getting part of a specified container log.
- Add Node-Id partitioning for log aggregation, such that the logs of different nodes of a cluster may be written to different S3 partitions, which helps avoid hitting S3 access rate limit.
- Newly created Namenodes may have different IP addresses, an added feature to resolve NN RPC Socket Address on Failover.
- Disable preempting reducers if the ratio of the number of assigned mappers over the total mappers exceeded a configured threshold.
- Add disk usage monitor thread to AM, such that an application will be killed if the disk usage exceeded a configured limit.
Upgrade Monarch clusters to Hadoop 2.10
Cluster Upgrade Approach Exploration
We evaluated multiple approaches for upgrading Monarch clusters to Hadoop 2.10. Each approach has its own pros and cons, which we outline below.
Approach I: Using CCR
As mentioned in the Efficient Resource Management article, we developed cross-cluster routing (CCR) to balance workloads among the various clusters. To minimize the impact to existing 2.7 clusters, one option was to build new Hadoop 2.10 clusters and move workloads to the new clusters gradually. If any issues occured, we could route the workloads back to their original cluster, fix the issues, and route back to the 2.10 cluster again.
We started with this approach and evaluated it on some small production and dev clusters. There weren’t any major issues, but there are some disadvantages we discovered:
- We had to build a new parallel cluster for each cluster migration. This becomes costly with large YARN clusters (up to several thousand nodes)
- Workloads need to be migrated in batches which is time consuming. Because Monarch is such a big platform, this upgrade process could potentially take a long time to finish.
Approach II: Rolling Upgrade
In theory, we could try rolling upgrades of the worker nodes, but rolling upgrades would potentially impact all workloads on the cluster. If we hit any issues, a rollback would be costly.
Approach III: In-place Upgrade
Leveraging a similar approach as we do to inplace upgrade a cluster from one instance type to another, we:
- Insert several canary hosts of the new instance type as a new autoscaling-group (canary ASG) of nodes into a cluster
- Evaluate the canary ASG relative to the base ASG (of the existing instance type)
- Scale out the canary ASG
- Scale in the base ASG
Generally, this works well for small infrastructure level changes where there is no service level change. As an exploration, we wanted to see if we could do the same with the Hadoop 2.10 upgrade. One critical assumption we had to make was that the communication between Hadoop 2.7 and 2.10 components are compatible. The steps with this approach are:
- Add Hadoop 2.10 canary worker nodes (that runs HDFS datanodes, and YARN NodeManager) to the Hadoop 2.7 cluster
- Identify and solve issues that arose
- Increase the number of Hadoop 2.10 worker nodes and decrease the number of Hadoop 2.7 worker nodes until the 2.7 nodes are fully replaced with 2.10 nodes
- Upgrade all manager nodes (Namenodes, JournalNodes, ResourceManagers, History Servers etc). This process works similarly as replacing worker nodes by replacing them with Hadoop 2.10 nodes.
Before applying this risky approach to a production cluster, we performed extensive evaluation on dev Monarch clusters. It turned out to be a seamless upgrade experience except for some minor issues which we will describe later.
Finalized Upgrade Approach
As mentioned earlier, job artifacts were originally built with Hadoop 2.7 dependencies. This means they may carry Hadoop 2.7 jars into the distributed cache. Then at runtime, we put the user class path ahead of the library paths that exist on the cluster. This could potentially cause dependency issues on Hadoop 2.10 canary nodes as Hadoop 2.7 and 2.10 may have dependency on different versions of third party jars.
After using Approach I to do upgrades for a few small clusters, we determined that this approach would take too long to finish upgrading for all Monarch clusters. Additionally, we could not acquire enough EC2 instances to replace these clusters with such short notice given the scale of our largest monarch clusters (up to 3k nodes!). We evaluated the pros and cons and decided to go with Approach III since we could possibly speed up the upgrade process significantly and most of the dependency issues could be addressed quickly. In case we couldn’t solve the problem quickly for some jobs, we could use CCR to route jobs to another Hadoop 2.7 cluster and then take time to fix the issues.
Issues and Solutions
After we finalized on Approach III, our main focus then became to identify any issues and address them as quickly as possible. Broadly speaking, there were three categories of issues we ran into: service level issues due to incompatibilities between Hadoop 2.7 and Hadoop 2.10, dependency issues in user-defined applications, and miscellaneous other issues.
Incompatible behavior issues
- Restarting Hadoop 2.10 NM caused containers to get killed. We found that Hadoop 2.10 introduced a new config yarn.nodemanager.recovery.supervised that defaulted to FALSE. In order to prevent containers from being killed when restarting NMs, we need to set TRUE. When this configuration is enabled, a running NodeManager will not try to cleanup containers as it exits with the assumption it will be immediately restarted and recover containers.
- Job stuck when AM scheduled on 2.10 canary node: Application Priority added in MAPREDUCE-6515 assumes this field is always set in the PB response. This isn’t the case in a split-versioned cluster (2.7.1 ResourceManager + 2.10 worker) as the PB response returned by RM will not contain appPriority field. We check to see if this field is in the protobuf, and if not we ignore updating applicationPriority.
- HADOOP-13680 made fs.s3a.readahead.range to use getLongBytes as of Hadoop 2.8 and supports values in the format “32M” (memory postfix K,M,G,T,P). However, Hadoop 2.7 code can’t handle this kind of format. This breaks jobs in the mixed Hadoop version cluster. We added a fix to Hadoop 2.7 to make it compatible with Hadopop 2.10 behavior.
- Hadoop 2.10 accidentally introduced spaces in between multiple values of io.serialization config, which caused a ClassNotFound error. We made a fix to remove the spaces in the configured value.
Dependency Issues
As we performed the inplace Hadoop 2.7 to 2.10 upgrade, the majority of dependency issues we faced were due to different versions of dependencies shared between Hadoop services and user applications. The solution is either modifying the user’s jobs to be compatible with the Hadoop platform dependencies or shading the version within our job artifacts or Hadoop platform distribution. Here are some examples:
- Hadoop 2.7 jars are put into the distributed cache and cause dependency issues on Hadoop 2.10 canary nodes. We implemented a solution in the Hadoop 2.7 release to prevent those jars from being added to the distributed cache so that all hosts use the Hadoop jars already deployed to the host.
- Woodstox-core package. Hadoop-2.10.0 depends on woodstox-core-5.0.3.jar, whereas some of the applications has dependency on another module which has dependency on wstx-asl-3.2.7.jar. The incompatibility between woodstox-core-5.0.3.jar and wstx-asl-3.2.7.jar caused job failures. Our solution is to shade woodstox-core-5.0.3.jar in Hadoop 2.10.
- We have some internal libraries or classes implemented based on Hadoop 2.7. They cannot run on Hadoop 2.10. For example, we have a class named S3DoubleWrite which writes output to two s3 locations at the same time. It was developed to help us migrate logs betweens3 buckets. Since we didn’t need that class anymore, we deprecated it to solve the dependency issue.
- Some Hadoop 2.7 libraries are packaged into users’ bazel jars and cause some dependency issues during runtime. The solution we take is to decouple user applications from Hadoop jars. More details can be found in the relevant sections later.
Miscellaneous other issues
- One of the validations we performed on dev clusters was to ensure that we could rollback midway through the upgrade process. One issue with the revert came up when we tried to roll back NameNode to Hadoop 2.7. We found out that NameNode did not receive block reports from the upgraded DataNodes. The workaround we identified was to manually trigger block reports. We later found the potential issue HDFS-12749 (DN may not send block reports to NN after NN restart) and backported it.
- When Hadoop streaming jobs bundled with Hadoop 2.7 jars were deployed to Hadoop 2.10 nodes, the expected 2.7 jars were not available. This is because we use cluster-provided jars to satisfy the dependencies for the majority of our user artifact distribution to cut down the artifact size. However, all of the Hadoop dependencies had versions encoded in the jar name. The solution was to make Hadoop streaming job bundle Hadoop jars without the version string so the provided Hadoop dependencies were always in the classpath at runtime regardless of the node it runs on being Hadoop 2.7 or 2.10.
Upgrade User Applications to Hadoop 2.10
To upgrade user applications to Hadoop 2.10, we needed to ensure that Hadoop 2.10 is used at both compile time and runtime. The first step was to make sure Hadoop 2.7 jars were not shipped with user jars so that the provided Hadoop jars deployed to the cluster were used at runtime (2.7 jars in 2.7 nodes, and 2.10 in 2.10 nodes). Then we changed the user application build environment to use Hadoop 2.10 instead of 2.7.
Decouple user applications from Hadoop jars
At Pinterest, the majority of data pipelines use fat jars built by Bazel. These jars contain all the dependencies including Hadoop 2.7 client libraries before the upgrade. We always honor classes from those fat jars instead of classes from the local environment, which means while running those fat jars on clusters with Hadoop 2.10, we will still use Hadoop 2.7 classes.
To solve this issue (using 2.7 jars in 2.10 cluster) permanently, we decided to decouple user’s Bazel jars from Hadoop libraries; that is, we no longer ship Hadoop jars in the fat user Bazel jar, and the Hadoop jars already deployed to the cluster nodes will be used at runtime.
Bazel java_binary rules has an argument named deploy_env whose value is a list of other java_binary targets that represent the deployment environment for this binary. We set this attribute to exclude all Hadoop dependencies and their sub-dependencies from user jars. The challenge here was that a lot of user applications had common dependencies on the libraries on which Hadoop also depends. These common libraries were tricky to identify because they were not explicitly specified because they were already provided on the Hadoop workers themselves as part of the NodeManager deployment. During testing, we put in a lot of effort to identify these kinds of cases and modified users’ bazel rules to add those hidden dependencies explicitly.
Upgrade Hadoop bazel targets from 2.7 to 2.10
After decoupling user applications from Hadoop Jars, we then needed to upgrade Hadoop bazel targets from 2.7 to 2.10 so that we can ensure Hadoop versions used in build & runtime environments are consistent. There are some dependency conflicts between Hadoop 2.7 & Hadoop 2.10 in this process again. We identified those dependencies through our build tests and upgraded them to the correct version accordingly.
Summary
Upgrading 17k+ nodes from one Hadoop version to another while not having significant disruptions to the applications has its challenges. We managed to do it with quality, reasonable speed, and cost efficiency. We hope the experience we shared above can be beneficial to the community.
Acknowledgement
Thanks to Ang Zhang, Hengzhe Guo, Sandeep Kumar, Bogdan Pisica, Connell Donaghy from the Batch Processing Platform team who helped a lot through the whole upgrade process. Thanks to Soam Acharya, Keith Regier for working on issues hit in FGAC clusters. Thanks to Jooseong Kim, Evan Li and Chunyan Wang for the support along the way. Thanks to the workflow team, the query team and our platform user teams for their support.
To learn more about engineering at Pinterest, check out the rest of our Engineering Blog, and visit our Pinterest Labs site. To view and apply to open opportunities, visit our Careers page