Building a new experiment pipeline with Spark
Tien Nguyen | Pinterest engineer, Data
At Pinterest we rely heavily on A/B experiments to make decisions about products and features. Every day we aim to have experiment results ready by 10 a.m. so we can make fast and well-grounded decisions. With more than 1,000 experiments running daily, crunching billions of records for more than 175 million Pinners, we need a reliable pipeline to support our growth and achieve our service-level agreement. In this post, we’ll discuss how we revamped our legacy experiment pipeline to speed up the computational processes, make it more scalable and performant.
The legacy pipeline
Pinterest has always been a data-driven company. To support our decision making processes, we spent a great effort consolidating, building and maintaining a dedicated and reliable experiment pipeline. As we’ve quickly grown over the years, the old pipeline in Hive reached its designed life expectancy. To that end, the pipeline had several disadvantages:
- A longer computation time. The numbers of monthly active users jumped from 100 million in 2015 to 150 million in 2016 and 175 million today. Moreover, the number of active experiments doubled in the same period with more metrics to process. These increases led to many bottlenecks, contributing to the longer computation time.
- Job redundancy. Because of the size of our data, we need to parallelize a job by dividing it into smaller jobs. Dividing jobs led to unavoidable duplicated codes in our pipeline, making it harder to debug and maintain. The more jobs, the more dependencies, thus more potential delays.
To support our long-term growth, we had to address these shortcomings and build a new pipeline.
The new pipeline
As we designed the new pipeline, we wanted to not only address the previously mentioned challenges, but also look forward to supporting Pinterest’s growth in years to comes. Specifically, we wanted to simplify the logic, reduce our storage footprint and easily support our users’ requests.
To that end, we embraced Spark as our solution. (Check out this post which describes the components of our experiment framework.) A Spark workflow fits nicely in our framework. The figure below shows:
- Kafka for the log transport layer
- Pinball for orchestrating our Spark workflow
- Spark Streaming for real-time experiment group validation
- HBase and MemSQL to power the dashboard backend
- Presto for interactive analysis
With Spark, we also gain the following advantages:
- A faster execution time. For our particular needs, we found that given the same input data, our implementation in Spark is often faster than what we previously had with our legacy pipeline. The figure below shows the average execution time of a new job is well below two hours, where before the job would have taken more than four hours. We achieve a better running time by simplifying our pipeline to take advantage of in-memory executions of Spark. Moreover, with Spark we can tune the job parameters (such as number of executors, Java memory overhead) based on the sources needed by the job itself.
- Job abstraction. With Spark, we can build an experiment analytics framework that can be extended in the future, so we can abstract jobs and pass the appropriate parameters when needed. Job abstraction helps avoid duplicated codes and reduces the number of jobs in a pipeline to just a handful. We designed and implemented our pipeline with a mindset that our jobs need to be:
- extensible and generic enough so other teams at Pinterest can extend and add new metrics to the experiment dashboard.
- scalable to process, store and track a large amount of data generated by ever-increasing number of experiments, users and metrics.
- able to serve metrics via dashboards in batch/real-time with high quality, providing the optimal statistical test modules.
- Reduced storage footprint. As our data increases, we wanted to reduce our storage footprint. We use parquet format (which is native to Spark) to save costs and space.
With our new pipeline, we now see a huge time-savings in job execution. On an average, we gain at least three hours (as shown in the figure below), so experiment results are available before our colleagues start their workday.
The figure above show the delivery time of the two pipelines. Black is the legacy pipeline, and the blue line is written in Spark. The y-axis is the delivery time in Pacific Time (e.g. 15 means the pipeline finishes at 3PM). The new pipeline almost always finishes before 8AM, though there are still some spikes caused by delays in the upstream jobs.
We’re excited to see the tremendous benefits provided by our new Spark pipeline. If you’re interested in building a great service that impacts the entire Pinterest, join us!
Acknowledgements: This project is a collaboration of Tien Nguyen, Shuo Xiang, Jooseong Kim and Bryant Xiao from Data Analytics team. People across the whole company helped launch this feature with their insights and feedback.