A careful study of recommendation traffic on Amazon.com shows that oft-cited statistics overestimate the effect.
Recommender systems are everywhere: Netflix suggests which movies to watch, Amazon suggests which products to buy, Xbox suggests which games to play, and Medium suggests which stories you should read next.
You might assume that recommendation systems are so pervasive because they add value both for companies and users. If so, you’d be in good company: Industry reports frequently claim that recommender systems account for huge amounts of traffic, contributing anywhere from 10 to 30% of total revenue . On Amazon.com, nearly 30% of traffic is via click-throughs on recommendations.
These numbers aren’t wrong per se, but they are misleading. In a recent study with my colleagues Jake Hofman and Duncan Watts, published in the conference Economics and Computation, we analyzed activity on 4,000 different products on Amazon.com to reveal that only about a quarter of the clicks that are typically attributed to the recommender system are actually caused by it. In other words, if you took the recommender system away traffic would only drop by about 8% — still a decent effect but not nearly as large as most people have been led to think.
To understand this discrepancy, let’s think of an example. Suppose that, as the weather gets colder, I decide to go on Amazon.com to search for a winter hat. Once I navigate to a nice hat, Amazon shows me a recommendation for gloves, indicating that these two products are frequently bought together. Remembering that I also need some new gloves, I click on the recommended pair and end up buying gloves as well as the hat.
Intuitively my purchase of the gloves seems attributable to the recommendation engine, and indeed that’s precisely how those impressive numbers from above are computed. But let’s ask ourselves: what would have happened if there had been no recommendations from Amazon (see right panel in Figure 2)? Possibly I would have bought the hat and then stopped. But possibly my general interest in winter clothing would have kept me looking, and I would have ended up purchasing the gloves through some other sequence of clicks.
In the latter case, we couldn’t really say that the recommender caused me to buy gloves, because I would have done that anyway. Rather it was just one of many possible pathways leading to the same end.
The key question raised by this example is then:
How many observed clicks on the recommended product are “causal” clicks, in the sense that they wouldn’t have happened without the recommender, versus “convenience” clicks, which would have.
There is good reason to think that convenience clicks are common, if only because recommendations on e-commerce sites are often based on products that have previously been co-purchased (i.e. “people who bought this also bought that”). When users click-through on these similar recommendations, it is therefore quite likely that they already knew about many of them, or would have found out about them anyways in their browsing session. It could be, in fact, that most of the traffic that is naively attributed to a recommender is just convenience traffic, not caused by the recommender at all.
Unfortunately disentangling causal from convenience clicks is tricky. Ideally, we would do an experiment (or “A/B test”) in which some users (“treatment A”) were randomly selected to see recommended gloves while the rest (“treatment B”) would not. Since nothing differentiates these two groups of users except random chance it would be easy to estimate the causal effect of showing recommendations, simply by comparing the number of times the gloves were viewed by users in treatments A vs. B.
However such experiments are costly to perform in terms of time or revenue, and may impact user experience. They also require access to the full system, which may not always be available, especially to outside researchers. For these reasons, it would be useful to have a non-experimental approach, meaning that it relies only on data generated naturally, that nonetheless simulates the random assignment of a true experiment.
A data-mining approach to natural experiments
A common solution to this problem is to look for what are called natural experiments in which the researcher exploits some naturally occurring variations that are arguably random .
For example, in the case of Amazon recommendations, one might look at books that were featured on Oprah’s book club. Often these books experience large and sudden influxes of traffic, or “shocks.” Assuming that these shocks are uncorrelated with background demand either for the featured product or the products that are recommended on the featured product’s page, we ought to be able to estimate the casual effect of the recommender by counting how much of the shock flows through to the recommended products .
There’s an obvious problem with this approach, however — namely that an author’s appearance on Oprah might simultaneously increase demand for all her books, not just the featured book, and those books are also likely to be linked to by the recommender. So if we want to use naturally occurring demand shocks to simulate a random experiment, we must also somehow rule out the possibility that observed increases in recommendation traffic are not simply an artifact of the same demand shock.
Typically researchers solve this problem by making a logical argument. For example, if the author were a first time author and previously unknown, it would be much less likely that demand for the books recommended from her book’s Amazon page would have been directly affected by her appearance on Oprah.
Often, however, arguments of this sort are hard to verify. Moreover, finding examples of shocks like this requires a lot of ingenuity and ends up restricting researchers to very specific product categories (what is the equivalent of Oprah’s book club for socks or detergent?).
Motivated by these difficulties, we adopt a different approach for finding natural experiments — namely by discovering them directly in the data. Specifically, we propose a simple, scalable method, which we call Shock-IV, that searches for shocks and allows us to identify the causal effect over a range of different product groups, and possibly other websites.
Identifying the causal impact of Amazon recommendations
Because shocks tend to be rare, for our method to work well we need a large quantity of data covering many products over extended periods of time. For this reason, we chose Amazon’s recommendations, for which we have a large quantity of high quality log data from Internet Explorer users who have installed the Bing Toolbar and have explicitly agreed to share their browsing history through it. Amazon also provides detailed parameters in each URL that is accessed, allowing us to identify the precise source of each visit: which ones are due to recommendation, which ones due to search, and so on. Based on which product people were browsing before a recommendation visit, we are also able to identify the focal product for each recommended product. Thus, we can reconstruct the page visits for each session for a user.
Restricting only to page visits on Amazon.com, we obtained browsing data for 2.1 million anonymized users on 1.4 million unique products over a nine-month period in 2013–2014.
To find shocks in this data, we analyzed page visit time-series for each product and found days where the traffic increased over 5 times the last day’s traffic and also 5 times the median traffic to the product. To filter for unusual activity, we only considered products with least 10 unique users who visited the product’s page on “shock day.”
Critically, among the shocked products thus selected we then eliminated any that did not have constant direct traffic to their recommended products. By direct traffic, we mean traffic that comes from direct page visits and search, but not through recommendations. As an example, we would accept the shock shown in Figure 3, but reject those that do not have constant direct traffic in the second panel. More details on building the shock sample can be found in our paper .
Because we were searching for shocks by their observable impact on page visits, and not by tracing the effects of a certain event, we were able to sample a much larger, and diverse sample of products than we would have gotten had we used a traditional approach like, say, considering only books that had been featured on Oprah or reviewed in the New York Times.
Specifically, we found about 20k products that had at least 10 page visits on any given date during the nine month period. The Shock-IV method provides valid shocks for over 4000 of them. Since “Customers who bought this also bought” recommendations are shown across all product categories, we restrict our analysis to these recommendations.
Figure 1 shows the causal click-through rate (CTR) estimates obtained using the Shock-IV method. For comparison, we also show the naïve observational estimate, as a red dotted line. For different product categories, the naïve estimate inflates the actual effect of recommendations by up to 200%. For example, instead of the 10% or above observed CTR for recommendations, the actual CTR is closer to 5% for popular categories such as Books, eBooks, and Toys.
A related question is to estimate the fraction of observed click-throughs that are causal, instead of simply due to convenience. For this, we look at the recommendation click-throughs before the shock and make a generous assumption that there are no convenience click-throughs in that period. That allows to compute an upper bound for the fraction of click-throughs that are caused by the recommender system. Applying to all products across categories, we find that nearly three-fourths of the recommendation click-throughs may be due to convenience. Only a quarter of click-throughs are caused by the recommender system.
Data-driven causal inference
There are a number of possible problems with these estimates. For example, it could be that most of the shocks are due to discounts on focal products, which may attract a different set of users than the general Amazon.com population. Alternatively, it could be that most of the shocks are concentrated in certain periods, such as holidays, thus capturing effects that are specific to those periods. We tested for these concerns using additional data in our paper and found that our estimates were robust with or without discounts or shocks during the holiday period.
Nevertheless, our sample is unlikely to be representative of the general Amazon product population. For example, it is likely that we captured more popular products (because we require at least 10 visits in a day) or products that are more susceptible to large and sudden variations in traffic. As a result, we can’t claim that our estimate of causal clicks applies to all of Amazon.com, let along to other websites (a caveat that also applies to most other “instrumental variables” analysis in economics and the social sciences ).
Fortunately, we can show that our estimate does apply to a large and reasonable diverse sample of products, and so can still indicate the magnitude of overestimation by observational CTR estimates. Moreover, we believe our Shock-IV method is generalizable to other websites and problem domains. For instance, using the same principles, we can use Shock-IV to estimate the impact of other recommender systems, or the impact of online ads.
More generally, our method should be viewed as just one example of a general data-driven strategy of identifying causal effects in online systems. Unlike typical instrumental variable studies which must be justified by logical arguments that can be hard to verify, our method relies instead on fine-grained data from the recommendation system itself. In addition, having data for vast number of products also allowed us to mine for shocks and admit a large fraction of valid shocks, instead of being restricted to a few single-source shocks. These improvements boost both validity and generalizability of the causal estimate we obtained.
Looking forward, such data-driven strategies for causal inference hold promise as more and more fine-grained data becomes available from socio-technical systems.
 Grau, J., Personalized product recommendations: predicting shoppers’ needs. eMarketer, March 2009.
 Dunning, T., Natural experiments in the social sciences: A design-based approach. Cambridge University Press, 2012.
 Carmi, E., G. Oestreicher-Singer and A. Sundararajan, Is Oprah contagious? Identifying demand spillovers in online networks. Available at SSRN 1694308, 2012.
 Sharma, A., J. M. Hofman and D. J. Watts, Estimating the causal impact of recommendation systems from observational data. ACM Conference on Economics and Computation, pp. 453–470, 2015.
 Angrist, J. D. and J. S. Pischke, Mostly harmless econometrics: An empiricist’s companion. Princeton University Press, 2008.