Open-sourcing Terrapin: A serving system for batch generated data

Published in

Pinterest Engineering Blog

5 min readSep 14, 2015

Varun Sharma | Pinterest engineer, Infrastructure

We employ Hadoop batch jobs extensively, from building recommendations to computing features and models for machine learning, and joining various pieces of data around Pinners, Pins and Boards. These data sets are generated by batch ETL jobs and serve as the foundation for recommendations, machine learning and several products powering content discovery on Pinterest.

Today at the Facebook @scale conference, we announced the open-sourcing of Terrapin, a serving system for handling large data sets generated by Hadoop jobs. Terrapin provides low latency random key-value access over such large data sets, which are immutable and (re)generated in entirety. Terrapin can ingest data from S3, HDFS or directly from a MapReduce job, and is elastic, fault tolerant and performant enough to be used for various online applications at Pinterest, such as Pinnability and discovery data.

Status quo

Previously, we used Apache HBase for serving batch generated data. We found that writing directly to HBase was expensive, slow and only worked well for smaller data sets (10s of GBs). We also tried using the bulk upload feature to directly upload HFiles from our compute Hadoop clusters to our serving HBase cluster. While the data transfer was substantially faster, the system had no data locality (i.e. the data for an HBase region was distributed all over the cluster rather than being colocated on the region server). Without data locality, the latency suffered, particularly at the 90th and 99th percentile. We had to run compactions to restore data locality and for garbage collecting previously uploaded data. The compactions added nontrivial load to the cluster and also affected latency.

To work around these issues, we built a system, which would pull HFiles, generated by MapReduce jobs from S3 and distribute them to a pool of servers. It had some nice features, such as live swaps and ease of changing the number of MapReduce shards from one version of a data set to another. The system was fast, since data was always accessed locally but we lost the elasticity, fault tolerance and ease of operation that came with HBase and HDFS. It was hard to scale the system.

From prior experience, we realized if we could somehow marry data locality with HDFS, we could achieve all our requirements for elasticity, fault tolerance and performance. On top of that, Hadoop ETL jobs can load data into HDFS fairly quickly, thus keeping the running time of our Hadoop pipelines under control.

Architecture

Unlike HBase, which simply treats HDFS as a distributed file system, Terrapin takes the distribution of HDFS blocks into account and brings up corresponding serving shards on the same nodes that physically store the HDFS blocks. This is possible since Terrapin serves immutable data sets that can only be periodically regenerated through batch jobs.

The above figure illustrates the various components of Terrapin. A Hadoop job writes data in the form of HFiles to a Terrapin cluster, backed by HDFS. The Terrapin cluster consists of the following:

A ZooKeeper quorum stores the cluster state, propagates it to the client and helps drive cluster coordination.
A Terrapin server process runs on each HDFS data node. The Terrapin server is responsible for serving key-value lookups against HFiles. The Terrapin server interacts with the ZooKeeper to receive notifications as to which HFiles need to be served. The Terrapin controller periodically queries the current HDFS block locations for each HFile, computes the appropriate notifications to be sent to Terrapin servers and writes them to ZooKeeper. The controller ensures data locality as nodes get added/removed from the cluster or as HDFS rebalances data. The controller is also responsible for performing live swap and garbage collecting older versions. Note: We write HFiles with a larger block size of 4G so that each HFile spans only 1 HDFS block. We haven’t found any issues with using a larger block size.
The Java client library reads the sharding information from ZooKeeper and routes the key value lookups appropriately. The sharding information tells which Terrapin server should be queried for keys stored in a particular HFile shard. Since there can be multiple replicas of an HFile shard, the Java client also retries failed reads on another replica.

In this design, the Terrapin controller drives the system toward 100 percent data locality, and Terrapin achieves low latencies while fully exploiting the ease of operation, scalability and elasticity of HDFS and Hadoop.

Design choices

Our design choices enabled us to satisfy our requirements by providing the right building blocks and saving precious engineering cycles.

We chose HDFS as the underlying storage for its elasticity, fault tolerance, ease of operation and tight integration with MapReduce.
We chose HFiles as the file format since we’ve had considerable success with HBase for online serving. It’s easy to consume and generate HFiles through Hadoop jobs.
We use Apache Helix for ZooKeeper based cluster coordination. Apache Helix is used by many companies for managing stateful services.

Terrapin offers the following features:

Filesets: Data on a Terrapin cluster is namespaced into “filesets.” New data is loaded/reloaded into a fileset.
Live swap & multiple versions: New data is loaded into an existing fileset with a live swap, and there’s no service disruption. Terrapin supports keeping multiple versions for critical filesets, which enables quick rollback in case of bad data load.
S3/HDFS/Hadoop/Hive: A Hadoop job can directly write data to Terrapin. Otherwise, the Hadoop job can write data to HDFS/S3 and it can be ingested by Terrapin in a subsequent step. Terrapin can also ingest tables on Hive and provide random access based on a certain column, marked as the key.
Easy to change number of output shards: It’s easy to change the number of output shards across different data loads for the same fileset. This gives developers the flexibility of tuning MapReduce jobs by tweaking the number of reducers.
Extensible serving/storage formats: It’s possible to plug in other (more efficient) serving formats such as rocksdb .sst and more.
Monitoring: Terrapin exports latency and value size quantiles as well as cluster health stats through an HTTP interface.
Speculative execution: Terrapin comes with a client abstraction which can issue requests against two terrapin clusters serving the same fileset and pick the one that is satisfied earlier. This functionality is pretty effective for increasing availability and reducing latency.

Take Terrapin for a spin!

Terrapin has been running in production at Pinterest for over a year now. Several systems including our web/mobile frontend, stream processing systems and middleware services query our Terrapin clusters. Our MapReduce pipelines dump several terabytes of data into Terrapin every day. We’ve been running Terrapin across hundreds of nodes on EC2, serving cumulative 1.5 million QPS, with a server side p99 latency of < 5ms. Today, our Terrapin deployment stores ~180T of data. The data is split across ~ 100 filesets and over 50 thousand files.

You can now access the source code, setup and usage instructions for your own use. If you have any questions or comments, reach us at terrapin-users@googlegroups.com

Acknowledgements: This work is a joint effort by Jian Fang, Varun Sharma and two software engineering interns, Connell Donaghy and Jason Jong. We’d also like to thank the Discovery team for being the early adopters of this technology and their help with hardening it in production.

Open-sourcing Terrapin: A serving system for batch generated data

Status quo

Architecture

Design choices

Take Terrapin for a spin!

Written by Pinterest Engineering