Deep Learning for Fake News Detection: Part I — Exploratory Data Analysis
In this series of articles, I would like to show how we can use a deep learning algorithm for fake news detection and compare some neural network architecture.
This is the first part of this series, where I would like to make exploratory data analysis for text.
The topic of fake news is as old as the news industry itself — misinformation, hoaxes, propaganda, and satire have long been in existence. So fake news is information that cannot be verified, without sources, and possibly untrue.
Wiki also said:” Fake news is written and published usually with the intent to mislead in order to damage an agency, entity, or person, and/or gain financially or politically, often using sensationalist, dishonest, or outright fabricated headlines to increase readership. Similarly, clickbait stories and headlines earn advertising revenue from this activity.”
Fake news undermines serious media coverage and makes it more difficult for journalists to cover significant news stories. An analysis by BuzzFeed found that the top 20 fake news stories about the 2016 U.S. presidential election received more engagement on Facebook than the top 20 election stories from 19 major media outlets. Anonymously-hosted fake news websites lacking known publishers have also been criticized because they make it difficult to prosecute sources of fake news for libel.
In the 21st century, the impact of fake news became widespread! Over time, the Internet has grown to unimaginable heights with tons of information coming in all the time which allows the Internet to be a host for plenty of unwanted, untruthful and misleading information that can be made by anyone. Fake news has grown from being sent via emails to attacking social media. Besides referring to made-up stories designed to deceive readers into clicking on links, maximizing traffic and profit, the term has also referred to satirical news, whose purpose is not to mislead but rather to inform viewers and share humorous commentary about real news and the mainstream media. So the problem is great. Let’s try to detect fake news with a deep learning approach.
For my train data set, I would like to take the Kaggle competition open dataset. Let’s make a brief exploratory data analysis for it.
We have got the next columns:
- id: unique id for a news article
- title: the title of a news article
- author: author of the news article
- text: the text of the article; could be incomplete
- label: a label that marks the article as potentially unreliable
1: unreliable
0: reliable
In my EDA I would like to analyze title, text, and label columns.
In the first step, I would like to know my class distribution.
As we can see the classes in the dataset are balanced.
The next step is the analysis of title and text columns in a group of fake and not fake news. Text statistics visualizations are simple but very insightful techniques.
First, I’ll take a look at the number of characters present in each title and new’s text by the label. This can give us a rough idea about the news headline length.
The same analysis I made for the new’s text.
The main insights, that new’s title, and text without preprocessing in fake news shorter than in not fake.
Now, I will move on to data exploration at a word-level. Let’s plot the number of words appearing in each new’s title and text by the label.
The same analysis I made for the new’s text.
The main insights, that new’s title, and text without preprocessing in fake news shorter than in not fake.
Word frequency without stop words
The next step is the analysis without stopwords. Stopwords are the words that are most commonly used in any language such as “the”,” a”,” an” etc. As these words are probably small in length these words may have caused the above graph to be left-skewed. o get the corpus containing stopwords you can use the nltk library. Nltk contains stopwords from many languages. Since we are only dealing with English news I will filter the English stopwords from the corpus.
The same analysis I made for the new’s text.
The corpus of title and text of fake and not fake news is different and the order of the words is also different.
Ngram analysis
The next step is the Ngram analysis. Ngrams are simply contiguous sequences of n words. For example “riverbank”,” The three musketeers” etc. If the number of words is two, it is called bigram. For 3 words it is called a trigram and so on.
Looking at most frequent n-grams can give you a better understanding of the context in which the word was used. To build a representation of our vocabulary we will use Countvectorizer. Countvectorizer is a simple method used to tokenize, vectorized, and represent the corpus in an appropriate form.
The same analysis I made for the new’s text.
Here we can see that text and title in fake news Ngrams are different, but not fake news is the same.
LDA
Let’s make a topic modeling to compare fake and not fake news. Topic modeling is the process of using unsupervised learning techniques to extract the main topics that occur in a collection of documents.
Latent Dirichlet Allocation (LDA) is an easy to use and efficient model for topic modeling. Each document is represented by the distribution of topics and each topic is represented by the distribution of words.
Once we categorize our documents in topics we can dig into further data exploration for each topic or topic group.
The first analysis is LDA for fake and not fake news titles. Let’s look at the topics:
We can also analyze it with library pyLDAvis in python:
Let’s make the sane for fake news titles.
As we can see topic modeling of fake and not fake news titles are different.
Let’s make the same analysis for the new’s text.
Here is text EDA for fake news:
And the same analysis for not fake news:
The results show, that topic of fake and not face news are different, and one more insight that topic title and text are different in fake news.
The last stage of my exploratory data analysis of the text is Word cloud analysis. Wordcloud is a great way to represent text data. The size and color of each word that appears in the Word cloud indicate it’s frequency or importance. Creating a Word cloud in python is easy but we need the data in the form of a corpus. Luckily, I prepared it in the previous section.
Here is word cloud analysis of new’s title:
And for compare word cloud analysis of new’s text:
In the results, we can see how often some words used in the new’s the title and text in fake or not fake news.
Conclusions
The results of text exploratory data analysis are different technic to compare fake and not fake news. With this approach, we can create our own rules to detect fake. This way is quite difficult and needs a lot of routine works. Also, in this example we can see, that dataset full of news about the United State of America election and with this data would be difficult to detect some general rules and style in fake news.
Let’s give a chance for the deep learning approach to make it automatically. See You in the next part of these stories.
All code you can find in the Git repository — link.