Fake News Starts with the Title

This medium post may be the 100th article you have read on the topic of fake news since the 2016 US Presidential Election. While misinformation and malicious fake news are certainly not new, everyone has become aware of its existence. This increased awareness is well and good (the first step in solving a problem is realizing there is one, right?) but, this overload has caused the lines between fake and real to become blurred for many. While it is very hard to say, computationally, what is true, I hope to offer you some new scientific evidence that fake and real news can be differentiated.

In a recent study, being published at NECO 2017, Sibel Adali and I ask the question:

“Is there any systematic stylistic difference between fake and real news?”

To approach this problem, we look at 3 different types of articles: real, fake, and satire. Real news stories are stories that are known to be true and from “well trusted” news sources. Fake news stories are stories that are from well known “fake news” sources that are intentionally trying to spread misinformation. Satire news stories are stories that are from news sources that explicitly state they are satirical and do not intentionally spread misinformation.

Now here is where it gets tricky. It is fairly difficult to get solid ground truth about fake and real news. So, to determine the ground truth of these articles, we take a “strict source” approach. For example, if we are to think about the news as a spectrum from “general very reliable” to“purposefully fake, never reliable”, we want to capture the extreme ends of this spectrum. To find these extreme ends of the spectrum, we use Zimdar’s crowd sourced list of fake news and Business Insider’s most trusted news list. For satire news, we simply collect news sites that state they are satirical on the front page.

With this ground truth approach in mind, we analyze 3 independent data sets: Craig Silverman’s BuzzFeed data set from his article entitled “This Analysis Shows How Viral Fake Election News Stories Outperformed Real News On Facebook”, a data set from Burfoot and Baldwin’s 2009 study on satire news, and a brand new data set of political fake, real, and satire news collected by ourselves.

To analyze, we compute many different natural language features on both the body text and the headline text of each article in the data sets. Then, we perform a mixture of Wilcox hypothesis testing methods and Support Vector Machines to tease out the differences in article types and demonstrate the features’ ability to predict when news is fake. If you are not familiar with these methods: hypothesis testing is used to say if there is a statistical siginicant difference between two classes of data and SVM is a supervised classification methodology used to predict the class of a data point. So, in our case, the classes are fake news, real news, and satire news. We do this analysis on each data set independently to ensure no one data set’s limitations impact our final conclusions. There are many more technical details to these methods, but to not get bogged down, I will get right into results. For those of you who have a higher need for cognition (or time to burn), the paper can be found here.

Titles are a strong differentiating factor between fake and real news.

By far the biggest difference between fake and real news sources is the title. Specifically, we find, across both the Buzzfeed data set and ours, that fake news titles are longer than real news titles and contain simpler words in both length and technicality. Fake titles also used more all capitalized words, significantly more proper nouns, but fewer nouns overall, and fewer stop-words (examples: the, and, a, an). In addition, we find that in the Buzzfeed data set, fake titles use significantly more analytical words, and in our data set, fake titles use significantly more verb phrases and significantly more past tense words.

Looking at a few random examples from our data will solidify these results:

Example 1

FAKE TITLE: BREAKING BOMBSHELL: NYPD Blows Whistle on New Hillary Emails: Money Laundering, Sex Crimes with Children, Child Exploitation, Pay to Play, Perjury

REAL TITLE: Preexisting Conditions and Republican Plans to Replace Obamacare

Example 2

FAKE TITLE: URGENT: The Mainstream Media Was Hiding One HUGE Fact About Trump Win!

REAL TITLE: Obama Designates Atlantic, Arctic Areas Off-Limits To Offshore Drilling

As you can see, these results shows that the writers of fake news are attempting to squeeze as much substance into the titles as possible by skipping stop-words and nouns to increase the use of proper nouns and verb phrases. In other words, the fake titles use many verb phrases and named entities to get many points across, while the real titles opt for a brief and general summary statement (many claims vs few claims).

The content of fake and real news articles is also substantially different.

Not only is the headline of an article a differentiating factor, but the content structure is actually quite different as well. In particular, we find that real articles are significantly longer than fake articles and that fake articles use fewer technical words, smaller words, fewer punctuation, fewer quotes, and more lexical redundancy. Further, fake news articles are easier read, use fewer analytic words, have significantly more personal pronouns, and use fewer nouns and more adverbs.

These many differences may seem abstract, so here is the take away point: Fake news has very little information or substance in the article content, but packs a ton of information into the titles.

This result is even further supported by our ability to predict the news category using a small subset of our features. We achieve a 78% accuracy when separating fake from real titles and a 71% accuracy when separating fake from real content. This means our simple subset of features improves prediction over random by between 21% and 28%. (If you are not familiar with machine learning, this is basically saying we can automatically predict if an article is fake or real using content structure much better than if we were to randomly choose what category a news article should fall into.)

Linear kernel SVM classification results using the top 4 features for the body and the title texts in our data set. The accuracy is the mean of 5-fold cross-validation. Baseline is the majority class.

Fake content is more closely related to satire than to real.

Now let’s include our satire news articles in the analysis. Up to this point we have only looked at the categories of fake and real, but including the category of satire may give us more insight. When adding in satire articles to the analysis, we find that the majority of our feature distributions are common between satire and fake. Specifically, both satire and fake content use smaller words, fewer technical words, fewer analytic words, and significantly more lexical redundancy, as well as, fewer quotes, fewer punctuation, more adverbs, and fewer nouns than real articles. This similarity between satire and fake content is further supported by our prediction results. When predicting satire news from fake news, we get a much smaller accuracy improvement over baseline than we do for fake versus real or satire versus real.

This finding is interesting and useful for several reasons. First, much of the journalistic coverage of fake news has assumed that fake news is inherently persuasive and meant to look like real news, but this is actually not the case. The high similarity between satire and fake content demonstrates that fake news is written in a less investigative way, as we know satire news is written to be absurd and not have sound arguments. To many, this claim seems obvious (fake news can’t make sound arguments because they are fake, duh), but it has some important implications you may not realize. People are still fooled by fake news, just look at the 2016 US Presidential election, but fake news has very little logical or argumentative substance. The field of communications may provide us with some insight.

Real news persuades through arguments, while fake news persuades through short-cuts.

To better explain our findings, we look to the well studied Elaboration Likelihood Model (ELM) of persuasion. According to ELM, people are persuaded through two different avenues: the central route and the peripheral route. The central route of persuasion results from the attentive examination of the arguments and message characteristics presented. This route involves a high amount of energy and cognition. In opposition, the peripheral route of persuasion results from associating ideas or making conjectures that are unrelated to the logic and quality of the information presented. This route could also be called a heuristic route or a short-cut that takes very little energy and cognition. Humans are prone to these short-cuts, including relying on your trust for a friend on Facebook (my friends are smart, they would never share fake news!), skimming an article for content, or simply believing what the title of a news story states (the title makes sense to me and I don’t have time to check it out if it’s legit). These damage from these short-cuts can be amplified by the homophily (birds of a feather flock together) of social networks or the algorithms that sort by our estimated interests.

So what does this mean for our fake news results? We found that fake news articles pack a lot of substance in the title, sometimes even an increased number of analytical words in the title. In spite of this, we also found that the body content of fake articles has very little substance, including high lexical redundancy (they repeat themselves a lot), a lack of analytical words, and a lack of direct quotations. Moreover, we found that much of the fake news’ content structure is similar to the well known, eccentric satire content structure. Since humans are prone to taking short-cuts in trust decisions, by packing all the claims into the title, users may have very little need to open the article to find out more. These fake news titles often present claims about people and entities in complete sentences, associating them with actions. Therefore, titles serve as the main mechanism to quickly make claims which are easy to assess. Thus, we may believe fake news because we are negligent or simply out of having low energy. This notion is further supported by what we already know about information in social networks: many of the links shared or commented on are never clicked, and thus, only the titles of the articles are ever read (one study that discusses this is Wang, Ramachandran, and Chaintreau 2016).

This finding is concerning as a person may be convinced of fake news simply out of having low energy or mental overload, not just due to a lack of education or a lack of care (Didn’t have lunch today? You may share fake news during your Facebook break! No matter how well educated you are…). Unfortunately, misleading claims in the titles of fake news articles can lead to established beliefs which can be hard to change through reasoned arguments (especially if those fake beliefs are coherent with your previously held beliefs). One possible remedy for these issues is for articles that aim to counter fake claims to pack the counter-claim into their titles, taking advantage of human short-cuts in persuasion.

Overall, this work points out that we can detect fake news to some extent, but it is still our responsibility to take the time to the read an articles arguments and asses the veracity of our biases before sharing information.

We have much more work to come in this area, but in the meantime, please, read the article content and think before sharing.

Benjamin D. Horne


Upstate New York