Pinterest Trends: Insights into unstructured data

Pinterest Engineering
Pinterest Engineering Blog
5 min readApr 1, 2016

--

Stephanie Rogers | Pinterest engineer, Discovery

What topics are Pinners interested in? When are they most engaged with these topics? How are they engaging with those topics? To answer these questions, we built an internal web service that visualizes unstructured data and helps us better understand timely trends we can resurface to Pinners through the product. The tool shows the most popular Pins, as well as time series trends of keywords in Pins and searches.

One of the use cases for the tool is it helps us understand what topics Pinners are interested in, when that interest usually happens and how they are engaging with these topics. Specifically for when, we visualize keywords over time to more easily identify seasonality or trends of topics, but the most powerful insights come from understanding Pinner behavior through top Pins.

For example, with a simple search of a holiday, like “Valentine’s day,” we can see that interest starts to rise about two months before February 14. But interest in the keyword wasn’t enough; we wanted to determine when one should start promoting different types of products. We saw that male Pinners were looking at products towards the beginning of the peak. These were forward-thinking individuals, looking for gifts that would have to be preordered. Approximately 2–3 days before the holiday, male Pinners were primarily looking at DIY crafts and baked goods, things that didn’t require much time or could bought at the convenience store the night before. And finally, on the day of Valentine’s Day, we saw a lot of humorous memes around being lonely. We were able to find these engagement trends in a matter of seconds.

Male Pinning trends leading up to Valentine’s Day

January 2015 — Products

Early February 2015 — DIY & Baked Goods

February 14, 2015 — Lonely Memes

Motivation

A core part of any solution for keyword trends is being able to perform full-text search over attributes. While MapReduce is good for querying structured content around Pins, it’s slow when answering queries that need full-text search. ElasticSearch, on the other hand, provides a distributed, full-text search engine.

By indexing the unstructured data around Pins (such as description, title and interest) with ElasticSearch, we produced a tool that processes full-text queries in real-time and visualizes trends and related Pins in a user-friendly way. At a high level, the tool offers a keyword search over Pin descriptions and search queries to:

  • Find the top N Pins or search queries with the given keyword
  • Show and compare time series trends, including the volume of repins and searches daily

Additionally, the tool filters keyword volume by various segments including location, gender, interests, categories and time.

Implementation

  1. Extract all text associated with Pins
  2. Insert Pin text into ElasticSearch
  3. Index text data (ElasticSearch does this for us)
  4. Build a service to call ElasticSearch API on the application backend
  5. Visualize data on the application frontend using Flask and ReactJS

Challenges

Data Collection
Gathering all of the text related to a Pin, including description, title, tagged interests, categories and timestamps, as well as Pinner demographics, requires complicated logic that can scale. We use a series of Hive and Cascading jobs (both MapReduce-based frameworks) to run a Pinball workflow nightly to extract and dump all text associated with the Pins from the previous day into our ElasticSearch clusters, which then indexes this text.

Design

A major design decision was to use daily indexes (one index per day) since many high-volume time-series projects do this by default, including Logstash. Using these daily indexes had several benefits to the scalability and performance of our entire system, including:

  • Increased flexibility in specifying time ranges.
  • Faster reads as a result of well-distributed documents among various nodes.
  • Minimized number of indexes involved in each query to avoid associated overhead.
  • Bulk insertion or bulk reads through parallel calls.
  • Easier recovery after failure.
  • Easier tuning of properties of the cluster (# shards, replication, etc.). Smaller indices led to faster iteration on testing these immutable properties.

Scalability

Despite using big data technologies, we faced various scalability challenges with our workflows. There was simply too much data to run simple Hive queries, so we optimized our Hive query settings, switched to Cascading jobs and made trade offs on implementation choices.

With more than 14GB of data daily and around two years worth of data stored thus far (around 10TB of data total), a bigger issue of scalability came from our ElasticSearch clusters. We have had to continuously scale our clusters by adding more nodes. Today we have 33 i2.2xlarge search nodes and 3 m3.2xlarge master nodes. Although replication isn’t needed to gain protection against data-loss since ES isn’t the primary persistent storage, we still decided to use a replication factor of 1 (meaning there are two copies of all data) to spread read-load across multiple servers.

Performance

After launching our prototype, we saw a lot of room for improvement in application performance, especially as the number of users grew. We switched from raw HTTP requests to the ElasticSearch Python client and optimized the ElasticSearch query code in our service, which led to a 2x performance increase. We also implemented server-side and client-side caching for the added benefit of instantaneous results for more frequent queries. The end result of all of these optimizations is sub two second queries for users.

Outcomes

The innovative tool has been a tremendous success. Usage is pervasive internally to derive Pinner insights, highlight popular content and even to detect spam.

If you’re interested in working on large scale data processing and analytics challenges like this one, join our team!

Acknowledgements: This project is a joint effort across multiple teams inside Pinterest. Various teams provided insightful feedback and suggestions. Major engineering contributors include Stephanie Rogers, Justin Mejorada-Pier, Chunyan Wang and the rest of the Data Engineering team.

--

--