Adventures in big data wonderland: Going down the Pinterest Path

Pinterest Engineering
Pinterest Engineering Blog
6 min readAug 16, 2019

Tamara Louie | Data Scientist, Discovery

Have you ever started looking for plants on Pinterest, only to end up shopping for pillows instead? Ever find yourself layers of Pins deep into finding a new idea, not sure of how you got there or where you started?

When Pinners are in an exploration mindset, they might not be sure what they want until they see it. They could start with a general idea, explore adjacent interests, drill down to a more specific idea, or pivot to an entirely different focus.

One way this behavior manifests is through a “Pinterest Path,” a term for describing a chain of clicks on Pins (“close-ups”) within a session. For example, you might start searching for a hiking backpack, switch to looking at inspiration for a hike in Machu Picchu, and then see ideas for Peruvian food. Not only did you find a hiking backpack to buy, but now you are also inspired and making travel and food plans. Oftentimes, Pinterest Paths help connect ideas that might not be obviously similar, thereby helping Pinners discover new ideas.

How do I enter a Pinterest Path?

Pinterest is a discovery engine that connects ideas across a taste graph, so for every Pin on Pinterest, there are Related Pins (Pins that are visually and semantically similar to that Pin), which we are always working to keep fresh.

Related Pins allows you to browse idea after idea somehow directly related to an original Pin. Entering a Pinterest Path begins in Related Pins, which is what you see when you click on a Pin and see related recommendations under “More Like This” (see Figure 1 below).

Figure 1. Example of a Related Pins session.

How does a Pinterest Path play out?

An example of a Pinterest Path is featured in Figure 2 (below).

  1. The Pinner has clicked a Pin with plants.
  2. The person sees other ideas in the resulting “Related Pins” section below and decides to click a Pin of a living room with a similar color theme and overall aesthetic to the original Pin clicked.
  3. The Pinner then finds interest in a specific pillow in the living room and finds that exact pillow in the Related Pins section.
Figure 2: Example of a Pinterest Path.

Pinners can go hundreds of layers deep on the Pinterest Path, exploring for hours looking through thousands of Pins, finding inspiration along the way.

Constructing Pinterest Paths as a graph

When thinking about how to visualize and construct these complex Pinterest Paths, it can be useful to think of a Pinterest Path as a graph. The nodes in this graph are individual Pins, and the edges are the close-up actions that occur between one Pin and another. Constructing Pinterest Paths as a graph proved to be a bit tricky, given how the underlying Pin interaction data was stored.

It can be hard to create graphs directly from data in relational databases

We often access our very large data sources in the form of processed Hive tables. The information we are interested in is stored across many rows of a Hive table, and the rows that represent a single Pinner’s session don’t store information identifying the first Pin in a Pinterest Path. Thus, it can be difficult to attribute all of the different rows to the same original Pinterest Path Pin (see Figure 3 below).

Figure 3. Example of a Pinterest Path stored in Hive. In this case, it would be difficult to know in Hive that row 3 was part of the same Pinterest Path, originating from Pin A.

We wanted to avoid requiring new logging be added to the session data, so we needed to figure out how to use Hive to efficiently aggregate the information needed to identify and quantify Pinterest Paths. This would require identifying all of the rows in the Hive data that stemmed from an original Pin and creating summary information on the depth and breadth of each Pinterest Path. While straightforward to describe, that type of operation can be very awkward and inefficient in SQL.

Creating a solution using a Python map-reduce script

The solution was to create a custom Python map-reduce script in Hive to read in the Pinterest Path data, process it, and write aggregated network data to another Hive table.

Figure 4. Example of using a Python map-reduce script in Hive for a Pinterest Path example.

Documentation from the Apache Hive team was used as a resource. Figure 5 (below) shows an example of the Hive syntax used to call the custom Python map-reduce script

Figure 5. Example Hive syntax to utilize custom Python map-reduce script.

Identifying and fixing bugs when scaling to production

When testing with individual examples, our first solution worked fine, outputting the expected Pinterest Path statistics. However, the output was incorrect when running on production data.

One reason that we encountered this problem in production was the need to send the rows from the same sessions to the same reducers, to perform the custom reduce operation. If some rows for a given session were sent to different reducers, the result could be two or more entries for the same session (each entry resulting from a different reducer), with each entry underestimating the final network statistics.

Specifying the CLUSTER BY clause in Hive solved the issue by specifying which groupings of rows should be sent to each reducer. This solution achieved the expected output observed in offline testing, which allowed us to safely launch this work into a production data job that processes millions of rows of data every day to find Pinterest Paths.

Figure 6. Example of using CLUSTER BY syntax to send rows from the same session to the same reducers.

The final Hive query functioned like Figure 7 below.

Figure 7. Example Hive syntax to utilize custom Python map-reduce script, incorporating CLUSTER BY syntax. In this case, sent all rows with the same value of field “session” to the same reducers.

This work highlighted the importance of verifying data queries function the same in production as they do on example data used during development, as well as the challenges that can arise when building complex network statistics from Hive data.

Now that we have solved these data processing problems and productionized our workflow, we are able to better understand the tastes of Pinners who go down Pinterest Paths. Some common entry points into Pinterest Paths are from DIY, home decor, and women’s fashion Pins. Pinners can spend hours in these Pinterest Paths, going hundreds of levels deep into a single topic or across multiple topics, discovering new ideas along the way.

The very nature of Pinterest is to bring everyone the inspiration to create a life they love, allowing people to explore until they find the best answer for them. Whether it is jumping from idea to idea in Related Pins, narrowing in on a recommendation from a Lens visual search, or browsing similar Pins in the new “more ideas” tab, we are continuing to build ways to help people find the best ideas for them — right away, or venturing into a Pinterest Path of their own.

Acknowledgments: I would like to thank the Related Pins team, as well as Grace Huang and Dan Frankowski, for their extensive discussions, guidance, and help in making this project a reality.

--

--