Lianghong Xu | Software engineer, Storage and caching team
Pinterest has one of the largest HBase production deployments in the industry. HBase is one of the foundational building blocks of our infrastructure and powers many of our critical services including our graph database (Zen), our general-purpose key-value store (UMS), our time series database and several others. Despite its high availability, we periodically backup our production HBase clusters onto S3 for disaster recovery purposes.
In this post we’ll cover how we dramatically improved HBase backup efficiency by both removing the intermediate HDFS cluster from the backup path and by using an offline deduplication tool to eliminate duplicate snapshot files. Ultimately this simplified the backup pipeline, cut the end-to-end backup time in half, lowered operational overhead and reduced S3 storage usage by up to two orders of magnitude.
Before we dive in, we wanted to note that for historical reasons, the backup pipeline consisted of two steps:
- Exporting HBase table snapshots and write ahead logs (WALs) to a dedicated backup HDFS cluster.
- Uploading data from the backup cluster to S3. As the amount of data (on the order of PBs) grows over time, the storage cost on S3 and the backup cluster continues to increase.
Simplifying the backup pipeline
When we first built the backup pipeline based on HBase version 0.94, there were no existing tools to directly export HBase snapshots to S3. The only supported method was to export snapshots to a HDFS cluster. This was the main reason why we originally adopted the two-step approach. However, as the system evolved, the dedicated HDFS cluster became the single point of failure and caused much operational overhead. As a result, we had to keep adding machines to accommodate the ever increasing volume of data.
Recently we completed a HBase upgrade from version 0.94 to 1.2. Along with numerous bug fixes and performance improvements, the new version of HBase comes with native support to directly export table snapshots to S3. Taking this opportunity, we optimized our backup pipeline by removing the HDFS cluster from the backup path. In addition, we created a tool called PinDedup which asynchronously deduplicates redundant snapshot files on S3 across backup cycles (described later) to reduce our S3 footprint.
Challenge and approach
One major challenge we encountered in the migration was minimizing its impact on production HBase clusters since they serve online requests. Table export is done using a MapReduce job similar to distcp. To increase the upload throughput, we use the S3A client with the fast upload option. During the experiments process, we observed that direct S3 upload tends to be very CPU-intensive, especially for transferring large files such as HFiles. This happens when a large file is broken down into multiple chunks, each of which needs to be hashed and signed before being uploaded. If we use more threads than the number of cores on the machines, the regionserver performing the upload will be saturated and could crash. To mitigate this problem, we constrain the maximum number of concurrent threads and Yarn containers per host, so that the maximum CPU overhead caused by backup is under 30 percent.
Removing duplicate files on S3
Large files rarely change
The idea of deduplicating HBase snapshots is inspired by the observation that large HFiles often remain unchanged across backup cycles. While incremental updates are merged with minor compactions, large HFiles that account for most storage usage are only merged during a major compaction. As a result, adjacent backup dates usually contain many duplicate large HFiles, especially for read-heavy HBase clusters.
Based on this observation, we proposed and implemented a simple file-level deduplication tool called PinDedup. It asynchronously checks for duplicate S3 files across adjacent backup cycles and replaces older files with references. To make deduplication more effective, we tuned our major compaction policy to be less aggressive and to only trigger when necessary.
Despite the simplicity of PinDedup, there were a few design choices we had to consider.
File- vs. chunk-level deduplication. The file-level dedup approach used in PinDedup provides good enough compression ratio. Variable-sized chunking is also implemented, but it only brings marginal benefits with increased system complexity. This is primarily because during major compaction, merged changes (although small in total data volume) spread across the entire file and modify most chunks.
Online vs. offline deduplication. Online deduplication means deduplication happens before data is uploaded to S3. This could potentially avoid transferring large files onto S3 when a duplicate is present. We chose offline deduplication because it allows us to control when deduplication occurs. Since client teams often use the latest snapshots for offline analysis, we could delay the deduplication until the analysis jobs are finished.
Keep the latest copy unchanged. When identifying two duplicate files, one important question is whether to replace the older or newer file with a reference. We chose the former, because the latest files are much more likely to be accessed. We call this approach “forward dedup chain.” Compared to its counterpart, this method incurs no overhead when accessing the latest data while avoiding the dangling pointer problem when old snapshots are deleted due to retention.
The backup pipeline was upgraded smoothly without causing any incidents (thankfully!). Removing the HDFS backup cluster reduces the operational overhead and shrinks end-to-end backup time by around 50 percent. PinDedup was rolled out to all HBase backups on S3 and achieved a storage reduction from 3 to 137X. The two combined led to significant infrastructure cost savings.
Acknowledgements: Many thanks to Tian-ying Chang and Chiyoung Seo for the support and technical guidance, Chengjin Liang and Laura Biester for knowledge transfer and code reviews, Andrew Dunlap for helping set up test environments, and the other members in the storage and caching team for valuable feedbacks during the design and implementation.