Mohak Nahta | Pinterest engineer, Product Engineering
A few months back, we started building a way for businesses to create and showcase their latest and favorite content on their profiles. We also wanted to make it easier for businesses to understand how their Pins are performing. If a business can quickly get useful insights about their content, they can better understand the ideas Pinners are looking for. That’s why we built Pin stats — near-real-time analytics for Pins created by businesses. In this post we’ll cover how we developed Pin stats and the challenges of logging, processing and aggregating billions of events in near real-time.
There were two main problems we wanted to solve with Pin stats.
- Near-real-time insights: Businesses have told us they’d like to see how well their Pins are doing within the first few hours of publishing them. One of the biggest challenges in building Pin stats was processing tens of billions of events in a way that would allow us to cut down the analytics delivery time to two hours, an 18x decrease.
- Canonicalization: Every time someone saves a Pin to Pinterest, we log it as a separate instance. Before, we only gave businesses analytics for their instance of that Pin. This meant businesses never got the full view of how their content was performing. With Pin stats, now all the different instances of a Pin are aggregated into a canonical stat so businesses can see the full impact of their content on Pinterest.
The first part of the project involved real-time logging of all events on Pins originally owned by businesses and sending them to Apache Kafka — the log transport layer used at Pinterest. This was challenging because we get hundreds of thousands of events of all kinds every second.
In our case, we only wanted to log events pertaining to businesses (i.e. only log impressions on a Pin originally created by a business). Due to our strict requirement of surfacing Pin stats in under two hours, we not only had to log events in real-time but also filter them before logging. We couldn’t afford to filter events offline, because it would take hours to sift all events and extract only those related to businesses.
At the same time, online filtering is extremely expensive because it can involve various network calls and increase the latency of the front-end logging endpoint. We implemented various optimizations and heuristics to minimize the number of network calls necessary. This ensured we only made network calls after we were fairly positive the event belonged to a Pin originally created by a business. This reduced the burden on our front-end logging endpoint by several factors.
The high volume of events meant we also needed to process them in an extremely efficient way. That’s why we segment the new Pin stats by three different time range aggregations — hourly (sliding 24 hour window), last seven days and last 30 days.
For the hourly segment, we needed to process the data separately in order to surface it to businesses as soon as possible. We achieved this by having two different pipelines emanating from our Kafka topic, one handling hourly data and and the other handling daily data. At the same time, we ensure both pipelines are reliable and consistent with each other through various rules to avoid data inconsistencies. The former is a data ingestion pipeline that creates hourly tables of logged events in the past hour (approximately four billion events/hour), which we then process and aggregate using efficient MapReduce jobs. This means we’re able to persist data to our storage every hour and have data workflows that run by the hour to aggregate analytics on a very granular time-level. The latter pipeline generates a daily table (approximately 100 billion events/day) that’s processed and verified more thoroughly due to the longer SLA.
After logging, processing, verifying and aggregating tens of billions of events in a short span of time, we needed a low-latency storage solution capable of handling our extremely large data sets. We decided to use Terrapin — our in-house low-latency serving system for handling large data sets. It met our requirements of being elastic, fault tolerant and able to ingest data directly from Amazon S3.
During the process, we learned many invaluable lessons. One of the main challenges was to build a data pipeline that can support a real-time stream of hundreds of thousands of events every second in a way that’s reliable and scalable as Pinterest grows. Particularly, it was difficult to have our divergent pipelines work in a way that they are both consistent with data and reliable.
Another big challenge was filtering the high volume of events each second so we don’t populate our pipelines with data we’ll never process. This was complicated, because any sort of filtering usually requires network calls that ultimately slow down logging. To solve for this, we use various signals beforehand to ensure a network call is necessary.
Finally, we learned a lot while choosing the approach — whether to spend months building a truly real-time pipeline or creating a system that can serve data in a under two hours. We prototyped a real-time analytics service to see if it was possible given the current infrastructure, and ultimately decided on a near-real time system in to order to ship the experience more quickly.
Our next steps will be to build a truly real-time system that surfaces Pin stats within seconds of a Pin being uploaded. We also hope to provide additional metrics so that businesses can create better and more actionable Pins.
Acknowledgements: This project would not have been possible without the help and support of engineers across different teams at Pinterest. In particular, I would like to extend special thanks to Ryan Shih, Andrew Chun, Derek Tia, David West, Daniel Mejia, David Temple, Gordon Chen, Jian Fang, Jon Parise, Rajesh Bhatia, Sam Meder, Shawn Nguyen, Shirley Gaw, Tamara Louie, Tian Li, Tiffany Black, Weiran Liu, and Yining Wang.