Krishna Gade | Pinterest engineering manager, Data
As thousands of people gather in the Bay Area this week for Strata + Hadoop World, we wanted to share how data-driven decision making is in our company DNA. Most recently, we built a real-time data pipeline to ingest data into MemSQL using Spark Streaming, as well as a highly scalable infrastructure that collects, stores and processes user engagement data in real-time, while solving challenges that would allow us to achieve:
- Higher performance event logging
- Reliable log transport and storage
- Faster query execution on real-time data
Let’s take a deeper look.
Higher performance event logging
We developed a high performance logging agent called Singer, which is deployed on all of our application servers to collect event logs and ship them to a centralized repository. Applications write their event logs to local disk, from where Singer collects and parses these log files, and then sends them to a central log repository. Singer is built with at-least-once delivery semantics and works well with Kafka, which acts as our log transport system.
Reliable log transport and storage
Apache Kafka, a high throughput message bus, forms our log transport layer. We chose Kafka because it comes with many desirable features including support for high volume event streams, replicated durability and low latency at-least-once delivery. Once the event logs are delivered to Kafka, a variety of consumers from Storm, Spark and other custom built log readers process these events in real-time. One of these consumers is a log persistence service called Secor that reliably writes these events to Amazon S3. Secor was originally built with the purpose of saving logs produced by our monetization pipeline where 0-data loss was extremely critical. It reads the event logs from Kafka and writes them to S3, overcoming its weak eventual consistency model. After this point, our self-serve big data platform loads the data from S3 into many different Hadoop clusters for batch processing.
Spark + MemSQL Integration
While Kafka allows for consuming events at real-time, it’s not a great interface for a human to ask questions on the real-time data. We wanted to enable running SQL queries on the real-time events as they arrive. As MemSQL was built for this exact purpose, we built a real-time data pipeline to ingest data into MemSQL using Spark Streaming. The pipeline is in a prototyping phase as we continue to work with the MemSQL team on productionizing it.
Figure 1 shows the elements of the real-time analytics platform we’ve described so far. We’ve been running the Singer -> Kafka -> Secor -> S3 pipeline in production for a few months. Currently, we’re evaluating the Spark -> MemSQL integration by building a prototype where we feed the Pin engagement data into a Kafka topic. The data in Kafka is consumed by a Spark streaming job.
In this job, each Pin is filtered and then enriched by adding geolocation and Pin category information. The enriched data is then persisted to MemSQL using MemSQL’s spark connector and is made available for query serving. The goal of this prototype was to test if MemSQL could enable our analysts to use familiar SQL to explore the real-time data and derive interesting insights.
While we continue to evaluate MemSQL, we’ll be showcasing a demo of it at Strata + Hadoop World 2015 along with the MemSQL team on Thursday, February 19 at the San Jose Convention Center. Come visit us at the MemSQL Booth 1015 for more details.
Demo built with Spark & MemSQL
Krishna Gade is an engineering manager on the Data team.