Powering big data at Pinterest
Mohammad Shahangian | Pinterest Head of Data Science
Big data plays a big role at Pinterest. With more than 30 billion Pins in the system, we’re building the most comprehensive collection of interests online. One of the challenges associated with building a personalized discovery engine is scaling our data infrastructure to traverse the interest graph to extract context and intent for each Pin.
We currently log 20 terabytes of new data each day, and have around 10 petabytes of data in S3. We use Hadoop to process this data, which enables us to put the most relevant and recent content in front of Pinners through features such as Related Pins, Guided Search, and image processing. It also powers thousands of daily metrics and allows us to put every user-facing change through rigorous experimentation and analysis.
In order to build big data applications quickly, we’ve evolved our single cluster Hadoop infrastructure into a ubiquitous self-serving platform.
Building a self-serve platform for Hadoop
Though Hadoop is a powerful processing and storage system, it’s not a plug and play technology. Because it doesn’t have cloud or elastic computing, or non-technical users in mind, its original design falls short as a self-serve platform. Fortunately there are many Hadoop libraries/applications and service providers that offer solutions to these limitations. Before choosing from these solutions, we mapped out our Hadoop setup requirements.
1. Isolated multitenancy: MapReduce has many applications with very different software requirements and configurations. Developers should be able to customize their jobs without impacting other users’ jobs.
2. Elasticity: Batch processing often requires burst capacity to support experimental development and backfills. In an ideal setup, you could ramp up to multi-thousand node clusters and scale back down without any interruptions or data loss.
3. Multi-cluster support: While it’s possible to scale a single Hadoop cluster horizontally, we’ve found that a) getting perfect isolation/elasticity can be difficult to achieve and b) business requirements such as privacy, security and cost allocation make it more practical to support multiple clusters.
4. Support for ephemeral clusters: Users should be able to spawn clusters and leave them up for as long as they need. Clusters should spawn in a reasonable amount of time and come with full blown support for all Hadoop jobs without manual configuration.
5. Easy software package deployment: We need to provide developers simple interfaces to several layers of customization from the OS and Hadoop layers to job specific scripts.
6. Shared data store: Regardless of the cluster, it should be possible to access data produced by other clusters
7. Access control layer: Just like any other service oriented system, you need to be able to add and modify access quickly (i.e. not SSH keys). Ideally, you could integrate with an existing identity (e.g. via OAUTH).
Tradeoffs and implementation
Once we had our requirements down, we chose from a wide range of home-brewed, open source and proprietary solutions to meet each requirement.
Decoupling compute and storage: Traditional MapReduce leverages data locality to make processing faster. In practice, we’ve found network I/O (we use S3) is not much slower than disk I/O. By paying the marginal overhead of network I/O and separating computation from storage, many of our requirements for a self-serve Hadoop platform became much easier to achieve. For example, multi-cluster support was easy because we no longer needed to worry about loading or synchronizing data, instead any existing or future clusters can make use of the data across a single shared file system. Not having to worry about data meant easier operations because we could perform a hard reset or abandon a problematic cluster for another cluster without losing any work. It also meant that we could use spot nodes and pay a significantly lower price for compute power without having to worry about losing any persistent data.
Centralized Hive metastore as the source of truth: We chose Hive for most of our Hadoop jobs primarily because the SQL interface is simple and familiar to people across the industry. Over time, we found Hive had the added benefit of using metastore as a data catalog for all Hadoop jobs. Much like other SQL tools, it provides functionality such as “show tables”, “describe table” and “show partitions.” This interface is much cleaner than listing files in a directory to determine what output exists, and is also much faster and consistent because it’s backed by a MySQL database. This is particularly important since we rely on S3, which is slow at listing files, doesn’t support moves and has eventual consistency issues.
We orchestrate all our jobs (whether Hive, Cascading, HadoopStreaming or otherwise) in such a way that they keep the HiveMetastore consistent with what data exists on disk. This makes is possible to update data on disk across multiple clusters and workflows without having to worry about any consumer getting partial data.
Multi-layered package/configuration staging: Hadoop applications vary drastically and each application may have a unique set of requirements and dependencies. We needed an approach that’s flexible enough to balance customizability and ease of setup/speed.
We took a three layered approach to managing dependencies and ultimately cut the time it takes to spawn and invoke a job on a thousand node cluster from 45 minutes to as little as five.
1. Baked AMIs:
For dependencies that are large and take a while to install, we preinstall them on the image. Examples of this are Hadoop Libraries and a NLP library package we needed for internationalization. We refer to this process as “baking an AMI.” Unfortunately, this approach isn’t available across many Hadoop service providers.
2. Automated Configuration (Masterless Puppet):
The majority of our customization is managed by Puppet. During the bootstrap stage, our cluster installs and configures Puppet on every node and, within a matter of minutes, Puppet keeps all our nodes with all of the dependencies we specify within our Puppet configurations.
Puppet had one major limitation for our use case: when we add new nodes to our production systems, they simultaneously contact the Puppet master to pull down new configurations and often overwhelm the master node, causing several failure scenarios. To get around this single point of failure, we made Puppet clients “masterless,” by allowing them to pull their configuration from S3 and set up a service that’s responsible for keeping S3 configurations in sync with the Puppet master.
3. Runtime Staging (on S3): Most of the customization that happens between MapReduce jobs involves jars, job configurations and custom code. Developers need to be able to modify these dependencies in their development environment and make them available on any one of our Hadoop clusters without affecting other jobs. To balance flexibility, speed and isolation, we created an isolated working directory for each developer on S3. Now, when a job is executed, a working directory is created for each developer and its dependencies are pulled down directly from S3.
Executor abstraction layer
Early on, we used Amazon’s Elastic MapReduce to run all of our Hadoop jobs. EMR played well with S3 and Spot Instances, and was generally reliable. As we scaled to a few hundred nodes, EMR became less stable and we started running into limitations of EMR’s proprietary versions of Hive. We had already built so many applications on top of EMR that it was hard for us to migrate to a new system. We also didn’t know what we wanted to switch to because some of the nuances of EMR had creeped into the actual job logic. In order to experiment with other flavors of Hadoop, we implemented an executor interface and moved all the EMR specific logic into the EMRExecutor. The interface implements a handful of methods such as “run_raw_hive_query(query_str)” and “run_java_job(class_path)”. This gave us the flexibility to experiment with a few flavors of Hadoop and Hadoop service providers, while enabling us to do a gradual migration with minimal downtime.
Deciding on Qubole
We ultimately migrated our Hadoop jobs to Qubole, a rising player in the Hadoop as a Service space. Given that EMR had become unstable at our scale, we had to quickly move to a provider that played well with AWS (specifically, spot instances) and S3. Qubole supported AWS/S3 and was relatively easy to get started on. After vetting Qubole and comparing its performance against alternatives (including managed clusters), we decided to go with Qubole for a few reasons:
1) Horizontally scalable to 1000s of nodes on a single cluster
2) Responsive 24/7 data infrastructure engineering support
3) Tight integration with Hive
4) Google OAUTH ACL and a Hive Web UI for non-technical users
5) API for simplified executor abstraction layer + multi-cluster support
6) Baked AMI customization (available with premium support)
7) Advanced support for spot instances — with support for 100% spot instance clusters
8) S3 eventual consistency protection
9) Graceful cluster scaling and autoscaling
Overall, Qubole has been a huge win for us, and we’ve been very impressed by the Qubole team’s expertise and implementation. Over the last year, Qubole has proven to be stable at Petabyte scale and has given us 30%-60% higher throughput than EMR. It’s also made it extremely easy to onboard non-technical users.
Where we are today
With our current setup, Hadoop is a flexible service that’s adopted across the organization with minimal operational overhead. We have over 100 regular Mapreduce users running over 2,000 jobs each day through Qubole’s web interface, ad-hoc jobs and scheduled workflows.
We have six standing Hadoop clusters comprised of over 3,000 nodes, and developers can choose to spawn their own Hadoop cluster within minutes. We generate over 20 billion log messages and process nearly a petabyte of data with Hadoop each day.
We’re also experimenting with managed Hadoop clusters, including Hadoop 2, but for now, using cloud services such as S3 and Qubole is the right choice for us because they free us up from the operational overhead of Hadoop and allow us to focus our engineering efforts on big data applications.
If you’re interested in working with us on big data, join our team!
Acknowledgements: Thanks to Dmitry Chechik, Pawel Garbacki, Jie Li, Chunyan Wang, Mao Ye and the rest of the Data Infrastructure team for their contributions.
Mohammad Shahangian is a data engineer at Pinterest.