Was There Really A Fake News Epidemic? Three Reasons Why You Shouldn’t Rely On The Buzzfeed Analysis Of Fake News

If you have been following the fake news beat the last couple of months as I have been, it is easy to get down about the production of fake news. Countless intellectuals and journalists have lamented its presence. Facebook announced a suite of tools to flag fake news. Google banned 200 publishers. Both have given weight to claims of pervasive fake news. But the evidence of a fake news epidemic has largely relied on Craig Silverman’s analysis at Buzzfeed titled, “This Analysis Shows How Fake Election News Stories Outperformed Real News On Facebook.

While I applaud Silverman for his detective work, some important assumptions about the world are baked into the analysis, which in turn skew the findings and the resulting story. In the following analysis, I focus on three topics. First, I dig into how news is defined, then I explore why search terms matter, and end with an examination of the problems with Facebook data. For these reasons especially, I think we shouldn’t rely upon this study to make claims about the existence of pervasive fake news.

How Is News Defined?

Silverman puts the nutgraf of the analysis up front,

In the final three months of the US presidential campaign, the top-performing fake election news stories on Facebook generated more engagement than the top stories from major news outlets such as the New York Times, Washington Post, Huffington Post, NBC News, and others, a BuzzFeed News analysis has found.

But the list of news sites doesn’t match the top news sites. Even though the most prominent sites are featured in his list, many other top ranked news sites aren’t included like Forbes, Yahoo, Bloomberg, The Atlantic, CNBC, SFGate, and The Daily Beast. But consider what is suggested with Silverman’s list. These outlets make news. News, then, comes from specific institutions and isn’t a distinct, independent, and scrutable product.

By defining news so narrowly, everything else is skipped over, and so a lot is discounted. Conan O’Brien’s video featuring Louis C.K. garnered nearly 413,600 interactions in the last three months of the election, and the comedian lambasted Trump and endorsed Clinton. Yet, it isn’t included in the analysis as news. A story from the New Yorker titled “Queen Offers to Restore British Rule Over United States,” got over 1.8 million engagements, but nor was it included. Admittedly, it was a satirical piece, but it doesn’t even register in Silverman’s analysis. Business Insider got 746,900 engagements when they covered the New York Time’s print of 2 full pages of Donald Trump’s insults. Again, this doesn’t show up in the data. And what about sharing posts from the Clinton and Trump Facebook pages? Isn’t this news?

Why Search Terms Matter

Totaling up Silverman’s data highlights just what is missing. Including both the fake and the real news, total engagements from August 1 to election day only tops about 16 million engagements. Below is a chart of those numbers from his data:

Data Summarized From Silverman’s Google Doc

And yet, using a simple search of BuzzSumo, the same method as Silverman employs, the term “Trump” garnered almost 229 million impressions in just the month of October, while “Clinton” got 120 million. Below is a chart of just Clinton and Trump.

Author’s Calculations

Summing up the months of August, September, and October for just those two terms puts the total interactions at 729 million, of which fake news would be about 1 percent. And these interactions just include “Trump” and “Clinton,” not “election,” “Hillary,” or any number of other terms.

In truth, when I tried to replicate the study over the same time period, I wasn’t able to assemble the same kind of numbers as Buzzfeed. The story, “Pope Francis Shocks World, Endorses Donald Trump for President,” got 12,783 engagements for my replicated search compared to 960,000 from the original. Similarly, “ WikiLeaks CONFIRMS Hillary Sold Weapons to ISIS… Then Drops Another BOMBSHELL!” got 789,000 in the original tally, but I was only able to count 88,040. You can find more of that analysis here, and as I will explain later, I think there might be a simple reason for this.

BuzzFeed conducted this study by searching the keywords “Hillary Clinton” and “Donald Trump,” as well as combinations such as “Trump and election” or “Clinton and emails,” in addition to “Soros and voting machine,” which was a known viral lie, according to the piece. (I emailed Silverman asking for a complete list, but as of the writing of this piece, I have not gotten a response)

Notice the search pattern here. “Soros and voting machine” isn’t the same as “Soros OR voting machine.” One could easily imagine a story where voting machines were mentioned and were intended to be picked up as fake news but wasn’t collected because it didn’t mention Soros. Every study that uses Boolean operators such as AND and OR can either limit or expand a search, and so each search decision much be carefully considered.

Search terms are important because together they establish the bounds of the study. Here, the Buzzfeed Analysis doesn’t conduct a typical first step, which is to search for associated terms. Below is a wordcloud of associated terms for the term “election” currently on Twitter.

As you can see, the terms voter and vote are highly associated with elect. Moreover, presidential and political both show up as associated terms, which should have been included in the Buzzfeed analysis. Had this been run back in November, a number of other terms would surely have popped up.

But you might wonder, if this cloud focuses on the term election, why is the central term elect? For this quick example, I used an off-the-shelf stemming program, which cuts down words to their roots, so election, electors, and electing all become elect. Yet, there seems to have been a mistake with voter and vote in the program. This is just one kind of problem that a researcher faces, but still, it is a critical step in setting the limits for the search terms and mentioned to audiences.

The Problems With Facebook Data

So, how accurate is the Facebook data?

For one, Facebook is known to have a problem with bots and fake profiles. In the company’s first public quarterly earnings report in 2012, about 8.7 percent of all accounts were considered fake, which then added up to 83 million accounts. The vast majority were simply duplicate accounts or incorrectly classified accounts, where a personal profile has been set up for a business. Yet, 1.5 percent of all profiles were singled out as being undesirable or spam accounts in that first report.

Subsequent followups put the undesirable accounts around 0.4 to 1.2 percent, and while the company has done a lot to combat these accounts, it is a continual cat and mouse game. Facebook fans are sold by the thousands, and so are batch Likes. Software to create profiles sells on the dark web for cheap but it is just as easy to create one with some programming experience. There is a lot of money to be made with Facebook marketing, as Buzzfeed detailed in their report on Macedonian teens creating fake news. So the number of spam accounts likely hasn’t abated.

Ever since reading it, I have wondered how many real people were included in the Buzzfeed Analysis. Given the difference in my reproduced numbers compared to the original, it could be that a number of fake accounts were axed, similar to the 3.5 million followers that Justin Bieber lost overnight in a purge of Instagram accounts.

Fake accounts in turn drive traffic numbers, so even if the absolute percentage of spam accounts has been declining, their proportion of engagement could be increasing.

The accuracy of traffic data is a growing concern for the marketing community. As one practitioner put it to Facebook support,

We’re seeing a larger discrepancy than normal in tracking between Facebook and Google Analytics. The discrepancy is between the clicks in the Facebook platform and the clicks in Google Analytics platform (we set up UTM codes). Attached are screenshots denoting 11k more clicks in Facebook than we see in Google Analytics. What’s going on here?

A study of actual traffic statistics found that Facebook Likes are correlated to actual traffic by only 0.68. While this makes it a strong relationship, there are far better measures. In fact, Silverman used Alexa data for web rankings, but Alexa isn’t the most accurate. SimilarWeb has shown to be consistently better. Where Facebook Likes are correlated by about 0.67 to actual traffic, Alexa rankings are correlated by 0.7 while SimilarWeb data is correlated by 0.84.

Well, you might say, Facebook is important because that is where adults get their news, so even if the numbers are smaller than what is being reported, any fake news on the platform is important. In this vein, it is often repeated that 62 percent of US adults, a majority, get their news from social media. Some version of that claim has shown up here, here, here, here, and here just to single out a few. The problem is, it is hugely misleading.

People get their news from all kinds of sources, including newspapers, the radio, local news, cable news, and websites. In fact, when you look at the Pew report where this information is sourced, social media is the least likely source among all of the other potential sources for people to get their news. Using this framing of Pew, below is a table of how often people get their media and from which source:

Source: Pew

Yes, Facebook is an important source of news, but it still ranks at the bottom for all adults. The modern news environment has become fractured, which shouldn’t be lost in analyzing the effect of fake news.

Between the biggest institutional players and upstarts like Buzzfeed lies a gulf that is especially evident in referral data. Site referral data explains where a web visitor has just been and a very large percentage of those reading news on New York Times and the Washington Post come either directly or by a simple Google search. Below is a percentage of traffic referrals each site gets from social media.

SimilarWeb Snapshot November 2016

The larger point is this, even if fake news sites did get widely shared on Facebook, which is highly suspect, the one or two day peak that they experienced cannot even compare to the total views the most established names consistently attract. LibertyNews, which is cited in the Buzzfeed piece, is currently ranked #227,239 in the US and #35,822 in the News and Media category at SimilarWeb. Even though the site did see a spike of 650,000 in October, the New York Times garnered 390,000,000 visits during the same time period and ranks #2 in News as well as #43 in the US. The number of absolute views of the smaller sites simply cannot compare to the big players.

Putting It All Together

Beneath the fake news controversy lay more existential questions gripping the industry. Where do we fit in this new knowledge ecosystem? Where does our authority come from? Have we been getting this objectivity thing right?

Fake news has become a replacement for all of the growing pains in the knowledge ecosystem. It may be messy, but until we have a better understanding of what we mean by fake news and what we want to accomplish with authentic news, little headway will be made in this debate.