When is data science a house of cards?

Pinterest Engineering
Pinterest Engineering Blog
7 min readOct 7, 2016


June Andrews | Pinterest engineer, Data Science

As data scientists, when we reach an answer, we often communicate that answer and move on. But what happens when there are multiple data scientists with varying answers? The expense of replicating and testing the quality of work often leaves critical business challenges unstaffed. At Pinterest, we lowered the cost of replication to the point that could afford to run an experiment. So we did. We asked nine data scientists and machine learning engineers the same question, in the same setting, on the same day. We received nine different results.

Reducing the costs of data science

In order to efficiently replicate results nine times, we used a new method of iterative supervised clustering. It’s phenomenally easy to grok and comes with a three step Python notebook with pre-loaded data. It makes analysis fun again. The algorithm is an extension of Klaufman and Kleinberg’s KDD paper and is explained in the following diagrams.

Stage 1: Use your favorite clustering algorithm to break up data into candidate clusters.

Stage 2: Ask a domain expert to interact with visualizations of each cluster, select the most human interpretable description and define that cluster. A cluster definition includes a name, a description and a short Python function determining if a point belongs to that cluster.

Stage 3: Now that we have a human interpretable cluster, we don’t need the machine to focus on data in that cluster, so remove the labeled data. Repeat from stage 1 and stop when the domain expert is no longer interested in the remaining data and labels remaining points as Unclassified.

There we go! The power of human interpretable clustering is now in your hands.

Digging into billions of data-rich Pins

There was one particular question we wanted to put this to use for. As a catalog of ideas, Pinterest is built on an interest graph of +75 billion Pins saved by +100 million monthly active users onto +1 billion boards. Boards articulate how ideas can become reality. This is incredible data. There’s a tiny detail that works so well, you rarely notice it — behind each Pin is a link into the wild wild web. For the sources behind Pins, we wanted to know how do Pinners engage with link domains?

Measuring Pin engagement

To answer this question, we pulled a sample of 100K link domains and looked at how Pinners engaged with content during its first year on Pinterest. In particular, we pulled Pins created from the Pinterest Save button, both on and off Pinterest. The volume of new Pins reflects how a domain is performing on the web, and repins reflect how a domain is performing on Pinterest. The data was cleaned, normalized and loaded into a Python notebook. (We love our app aesthetic, and couldn’t help but have our notebooks follow suit.)

Find additional clustering details in my talk at the from the O’Reilly Strata conference.

You’ll find link domains fall into an interesting set of clusters. My favorite is “Pinterest Specials”, domains whose popularity or reachability has greatly diminished on the web, but whose content lives on and thrives within the Pinterest ecosystem. Here are our the monikers of Link Domain Types:

Replicating data science

We asked the question of how Pinners engage with link domains and found an interesting and insightful answer that helps us understand what types of products to build. Let’s ask that question again. This is where we asked nine data scientists and machine learning engineers the same question. Each is an industry veteran and has been at Pinterest for more than one year. They work with Pinterest content and are part of the team helping surface great content to Pinners through 1.5 trillion recommendations every year. With the above algorithm and handling of the data, each person completed a clustering of link domains within an hour. The only remaining step before sharing the cluster with colleagues was pulling domain examples from each cluster.

Before we reveal the results, let us take a quick minute to review existing work touching on the replicability of analysis. Three incredible industry studies have surfaced in the last year. The first was a study on how skin tone affected the rate at which red cards attributed in soccer, published in Nature. Twenty-nine crowd-sourced researchers analyzed the same data and shared reviews of each other’s methods. While there’s a relatively consistent answer of yes, it makes a slight difference. Ten of the 29 teams have deviating results from the opposite conclusion to an astonishingly strong correlation.

The replication crisis in medicine and science deserves at least one citation in this context. Last year, Begley and Ioannidis [Reproducibility in Science] pegged 75 percent to 90 percent of preclinical research as irreproducible. If you care about the effectiveness of cancer treatments, you’re in for a scary read. While some flaws have arisen from scandals of fabricated data such as with Diederik Stapel a majority of shortcomings have been attributed to the analysis of data and the human error under pressure to produce publishable results.

In a recent test of asking the same question of the same data, The New York Times sent the same poll results to four other reputable pollsters. While the difference between Clinton at +3 and Clinton at +4 may seem negligible, one reputable pollster reported a conclusion of Trump winning Florida, which is an astronomically different outcome.

For data science, is the diversity of our results on the level of Clinton within one point, or are data science results on polar ends of the spectrum?

Going back to our test with nine data scientists and machine learning engineers, through the development of lightweight interactive algorithms and using Python notebooks with preloaded data, we lowered the cost of replicating data science work to the point we could ask everyone the same question: how do Pinners engage with link domains?


We received nine different results that were so different, they may as well be as diverse as the previous studies in reproducibility.

We found two reasons for the different results. The smaller influence was that some results contained bad answers. First, these answers were caused by two skills we can detect and level up people in:

  1. Preconceived notions of what the data entails before looking at the data.
  2. Cherry picking on a subset of features without understanding the larger picture.

The second cause comes from a difference in perspective. Some data scientists were intent on the viral aspects of growth while others focused on the return on investment within the Pinterest ecosystem. For a sample of different universes of perspective on Pinterest content, here are the unique monikers of clusters in different results:

House of cards

We asked the same question nine times and received nine astronomically different answers. When have we built irreproducible analysis on top of irreproducible analysis to the point that data-driven decisions are no longer supported by data? If we want to advance in the future we must ask the hard question, we must speak Lord Voldemort’s name. When is data science a house of cards?

There is an avalanche of supporting work I believe will enable data science as a field to answer this question in the near future. A key component is the infrastructural investment by many companies throughout Silicon Valley making experimental systems and fast access to data the standard. Another is that the industry-wide effort to recruit and train data scientists has taken data scientists from a scarce resource to within reach. The most recent key effort is that in reproducibility, a natural precursor to replicability is the ability to run the same analysis over the same data, with the same parameters, twice. Setting those parameters and designing models is still an expensive process, requiring a week or more for a broad question. With the development of faster Human-in-the-Loop algorithms, we’re lowering the cost of having multiple data scientists answer the same question. All of these components combined bring on the perfect storm to experiment and understand how different data science practices impact the business bottom line.

It’s a hard question. But as a field, I believe we can take it on. To stay informed and be involved in future efforts, join us.