Detecting Fake News With NLP
What is fake news
In one of their recent studies, Stanford University provides the following definition for fake news: “We define fake news to be news articles that are intentionally and verifiably false, and could mislead readers” . In the process of defining fake news it is important not get to confused with news bias.
One may say that New York City’s population is 8.406 million and it is very dirty and noisy, and other may say its population 8.406 million but the city is amazing. They are both are true statements and hence we can’t say that one or the other is fake, but they both are biased.
In my project, my main motivation is to find out whether a news article it is fake or not. At the end, the user will know the probability of the news being fake or not.
In order to achieve that I used more than 52,000 articles. 12,000 of the articles was labeled as fake and downloaded from kaggle.com. 29,000 from The Guardian and 12,000 from NYT using both NYT and The Guardian APIs to scrape the articles.
In addition to that, I only used the articles from January 2016 to March 2017 and only the topics of U.S News, Politics, Business and World news. Since the majority of fake news articles in my corpus fall under the category of business as well as domestic and world politics it’s important that the real news also fall these same categories.
Why Detecting Fake News?
Social media and the internet is suffering from fake accounts, fake posts, and fake news. The intention is often to mislead readers and/or manipulate them into purchasing or believing something that isn’t real. That is the reason why I don’t filter the news based on its fakeness in this project because it can be used according to people’s or company’s benefits, quite the contrary, I am giving the option of reading them, after telling the probability of the news being fake. It is up to the reader if they still want to read it or not. At the last section of the post, you can see the website’s prototype.
Which one is fake?
As an example of the problem, let’s try to detect the fake news from the 4 headline given down below.
It is hard to detect at first sight, isn't it? Let’s see the answer.
And let’s try to see detect the real news from fake ones.
And here is the answer…
As we can see from the examples above, unless we are careful and actually looking for fake or real news, it is hard to detect them. Even if we are careful and looking for fakeness and have only 4 option, it is challenging to detect the fake or the real news. If we don’t read them all carefully, we often were mistaken.
How to detect fake news?
As human beings, when we read a sentence or a paragraph, we can interpret the words with the whole document and understand the context. In this project, we teach to a computer how to read and understand the differences between real news and the fake news using Natural Language Processing (NLP). We will do this by using TF-IDF vectorizer.TF-IDF is used to determine word importance in a given article in the entire corpus. We will discuss them in the last section.
I collected more than 200,000 articles and filtered them by the topic and the date range. Eventually, I had 52,000 articles from 2016–2017 and in Business, Politics, U.S. News, and The World. 12,000 of them were label as fake news and 40,000 of them was real news. I used NYT API and The Guardian Post API to get real news, and I used Kaggle’s fake news data set for the fake news.
Machine Learning Algorithms
Since it is a classification problem, I started with Logistic Regression, Random Forest, and XGBoost. Logistic regression is a simple algorithm whereas Random Forest and XGBoost are more advance. I expected the advance models to perform better but the results were surprising.
After vectorizing the documents with TF-IDF, the random forest gave % 82 F1 scores, and XGBoost gave %65. Essentially F1 score optimizes the model based on False Negative and False Positive. In our case, they are both equally important. I did optimize my model for the F1 score, as precision and recall are equally important to the problem. After that, I have got %95 accuracy from Logistic Regression, using the body of the article, and reduce the number of columns 5700 from 8 million using grid search for the vectorizer and the regression model.
The figure above shows the number of columns on the x-axis and the size of the coefficients before grid search. And the figure below shows their numbers after the grid search.
After having a really high F1 score and accuracy, every data scientist should ask the question “Am I over-fitting? What is my Bias-Variance trade off? Can I get the same score with less data?”
In order to answer those questions, I plotted learning curves using the headline, which shows the bias-variance trade off and the if I need more data, or less than I have is enough.
After having this plot, I realized that using headline is necessarily helpful for my problem.
After generating the learning curves for using headline and body , I realized that using headline is not necessarily helpful to solve my problem.
Here is how the site looks like…
After copying and pasting the article we click to the check button and the result is…
In the graph below, we can see the difference train and test curves are far apart, which shows us the level of bias is high.
But using the body, we can see that there is low bias and low variance and we drive the conclusion as more data helps to improve the metrics.
Combining LDA with cross-validation across news media agencies to check if they have the same perspective on a give topic. The model with detect the topic and the facts of a news article, then do the same trust wordy articles and compare the given articles topic with the trustworthy articles. In this case, each news agencies will have weight based on the trustworthiness, and then we will set a threshold. If the weight is above the threshold, we would label is as real news, if not then it will be labeled as fake news.
The code is available at www.github.com/genyunus/Detecting_Fake_News