You’re about to publish that blog post but really want more people to engage — read, share, and respond to it. What do you do? Naturally, turn it into a catchy list of items perfectly packaged for click-and-share bait on social networks; you make a Listicle. How long should it be? 5 items feels a bit short. 30 Feels a tad long, and way too even. But 29 seems like a good, shareable length. What if I tell you that using data we’ve found statistically significant difference between performance of odd vs. even numbers? Sounds odd? Read on.
Lists have been around for a long time. From the Bible to the Billboard charts, packaging items in lists is an effective way to gain heightened attention from a broader audience. The format makes content more easily consumable, promising an effortless way to get through a finite amount of information. Choosing the right length involves a dash of voodoo magic and a lot of speculation. If you search online you’ll find a plethora of myths and beliefs that one number performs better than another, specifically odd versus even.
Looking further into this, I discovered discussions on the “rule of odds” and how placing an odd number of items in a row in e-commerce sites may capture more attention. Take a look at Gilt, or the New York Times, where there are typically three items (or columns) in the main pages.
Numerous folks are claiming that odd-length listicles, especially on BuzzFeed, are the preferred length-du-jour. Is there a reason why editors choose them? Do they know something we don’t? And can we use data to understand this?
That is exactly what I set out to figure out.
Looking at ten thousand published BuzzFeed listicles over a period of three months I found a statistically significant difference in the performance of odd-length listicles compared to even ones.
The betaworks data team sits at an intersection of data streams around media production, consumption as well as social gestures. We have varied insight into content produced across mainstream media and blogs, what of it is actually read, and which stories are shared. Given this nuanced view we can attempt to identify content that is performing better than the norm for that publisher, domain or blog. An internal metric we call an audience score measures the quality of the users interacting with a piece of content. It is comprised of a number of parameters, but is heavily influenced by the unique number of users interacting with a link. The audience score helps us identify content that is performing well before it becomes heavily shared or very visible. The digg editors use this as one of the various signals that help them identify content at an early stage. This score also helps us score popular content for digg reader.
Inspired by Brian Abelson and Noah Veltman’s Listogram analysis of BuzzFeed listicle length, I decided to put together a similar chart, not only showing the distribution of published listicle length, but also using our audience score as a proxy for the article’s performance. (interactive version here)
My initial dataset included a list of all published BuzzFeed content over the period of 3 months. Using a standard classification technique I separated the listicles from other BuzzFeed content. I won’t describe much of this step as it is not the primary focus of this article. At the end of this process, my dataset included approximately 10k listicles. You’ll notice that the majority of articles were correctly identified as listicles, with a small number of false positives (for example: “What would happen if the world lost Oxygen for 5 seconds?”).
With this significantly larger dataset of listicles, the distribution of published listicles by length pretty much matches Abelson and Veltman’s results. There are many more listicles of length 10 published compared to other numbers. This is primarily because BuzzFeed is selling the 10-length listicle to partner brands, such as the Michael J. Fox Show, Nordstrom Topman, and Buick. The second most popular length is 15, followed by 12. Listicle length drops off quite rapidly in the 20's, although surprisingly, lengths 11-21 are far more popular than those under 10. A BuzzFeed community member posted “150 Brilliant Harry Potter GIFs” which is a fun outlier in this dataset.
If we look the bar chart by audience score we see a completely different picture — odd number length listicles (highlighted in red below) tend to have a higher audience score on average, where in our dataset, the number 29 tends to have an advantage over the rest.
Just by looking at the graph, prime numbers don’t seem to have a significant advantage over odds or evens in terms of average audience score (see below), certainly not compared to odd numbers more generally.
Now for the fun part, how can we tell that the difference in the data is statistically significant? Is this difference that we’re seeing in performance an actual noteworthy trend?
There are a number of statistical tools that can be used to determine whether two sets of data are significantly different from each other. Hypothesis testing is the cornerstone of statistical testing, a core method that is used in every branch of science and data analysis to make claims whether an observed correlation is happening due to a random occurrences or a “real” phenomenon. Explained broadly, the Null Hypothesis is stated and assumes that there is no difference between the two observed datasets. Then a level of uncertainty (typically 5%) is picked, reflecting the probability of concluding that there is a difference when there actually is none.
Finally we select a statistical test, and perform the analysis. Given the results of the computation we can make a decision about the validity of the Null Hypothesis using what’s called a p-value. If the P-value is lower than the desired uncertainty (0.05 — or 5% in many cases), then the Null Hypothesis is rejected. This means that there is a less than 5% chance that the result happened due to randomness, and thus there is likely to be an underlying difference between the two datasets. If the p-value is high, for example .2, that means that there is a 20% chance that the results happened due to a number of random factors, and hence we would fail to reject the Null Hypothesis, meaning — we can’t be certain that there is a statistically significant difference between the two datasets.
Calculating these tests, I observed a statistically significant difference in the performance of odd-length BuzzFeed listicles versus even ones.
I computed the t-statistic and its corresponding p-value, comparing between odd-length and even-length listicles for both raw article score data as well as average score per listicle length (articles grouped by list-length). In both cases the p-value was lower than 0.05, giving me the ability to reject the Null Hypothesis, and effectively showing that the difference between the observed datasets is not random or purely due to chance.
I used a number of other test statistics, such as Fisher’s Exact Test, and the Mann Whitney-U test, all pointed to the same conclusion about odd-length lists. When I tested the performance of prime number-length listicles versus even ones, the resulting p-value was borderline — 4.7% — no significant difference. And finally, the number 29 seemed to be consistently over-performing compared to other top odd numbers. I’ve published an iPython notebook with my calculations and results here.
We already know that the human brain has a preference towards images and visuals. As an industry we’ve been A/B testing page layouts along with catchy headlines, and are learning how to make experiences and content much more luring to optimize for clicks. Now we can say that odd-length listicles, over time, will generate more shares on social networks, hence attain more attention. Of course now that we publicize our finding, its effect will diminish significantly, depending on the visibility of this blog post.
If BuzzFeed editors are aware that certain tactics lead to more clicks, should they feel obligated to tell their users? Business-wise, it makes no sense at all. This obviously gives them a competitive edge over other publishers, especially at a time when there’s an ever growing battle over user attention.
The tougher question is where do we draw the line? As a data scientist I am tasked with finding techniques to optimize performance, not only for algorithms, but for businesses. Part of the commonly used tactics involves this type of behavioral analysis, comparing datasets based on parameters that may be descriptors of the data itself (such as listicle length) or based on user metadata (your typical user segmentation). By building a recommendation system that gets users to interact with more content than they typically would have and spend more time on my site, am I crossing an ethical boundary? What if I tweak the recommendation system to affect user purchase behavior? Or emotional state?
Hmm… that’s a whole other rabbit hole, fit for a different blog post.
Gilad Lotan | @gilgul