Combating Fake NEWS with the help of Machine Learning.

Vishwani Gupta
7 min readJan 24, 2018

--

In the age of social media, everyone has witnessed the effect of fake news during USA Presidential election 2016, as also discussed in NYTimes. Recently there have been many efforts to alleviate the effect of false stories. Many of these efforts use machine learning approaches to predict the probability of a story being false.

Fake News is not only effecting people’s opinion, killing economy but also effecting democracy. There is an urgent need to combat it.

As a news junkie, I used to get annoyed by fake news articles shared on social media since the latter has become one of my main sources of getting news. To filter such articles from my news feed, I followed the same procedure as most of us would do, that is to fall back on our reliable sources and check if these sources have similar stories on the topic in the matter.

Nonetheless, searching for a similar article in the trusted sources is a slow and cumbersome process. To automate it, I tried to use Natural Language Processing(NLP) approaches and came up with a simple algorithm which will do all this hectic work for me. My intention was to give the link to the dubious story and the algorithm tells me whether it can be trusted or not.

I followed an unsupervised approach for detecting fake news. The main reasons I chose to use an unsupervised approach were as follows:

  1. One cannot predict the events in the real world scenario which may act as outliers in a supervised approach.
  2. Actors in the news stories change very frequently. Therefore, training a neural network on a set of articles does not make much sense, if a new actor enters in the news.
  3. Unsupervised approaches also have the benefit that one can easily generalize this problem to any language and region without training on a huge dataset.

Regardless of Donald Trump’s claims, I trust the following sources as also cited by Forbes,:

CNN, Washington Post, NYTimes, WSJ, CNBC, TIME, Al-Jazeera.

The algorithm find the 20 most related articles from each of these sources and generate a distance measure. Through this distance measure, it then classifies the query article as fake or genuine.

When one of my friends shared a fake story about ‘US banks begin closing customer accounts caught using Bitcoin’, I decided to test my approach.

My algorithm first scrapes the articles from these reliable news sources related to the Bitcoin and also makes use of the time period in which the article was published while scraping the articles. It then evaluates a similarity score/distance between those articles and the article in question using text mining. Text mining used in the algorithm makes sure that the idea behind the story and facts are understood by the algorithm.

As expected, the similarity scores were very low and the distances between the reliable source articles and query article were very high.

Note: In the following plots: X-axis represents the reliable sources and Y-axis represents the distance between reliable sources articles and query article. For the most related articles these distances should be lower and vice versa.

Result: Fake News. In this plot, the Y-axis is the distance between articles about Bitcoin on reliable sources and query article. The more the distance, the less is the similarity.

I used the box plots because it is a useful way to visualize the range and distribution of different articles’ distances from the query article. Details of a box plot can be found here.

At one point, the algorithm makes use of word embeddings as one of a text mining step to find distance/similarity between different articles. This can make NLP researchers wonder about the relatively new words which do not have a lot of examples to get good word vectors, since Word2vec model which generates word vectors, needs a huge dataset to generate high quality word vectors. The latter is taken care by training the word vectors on an even smaller dataset using Kernel PCA Skip gram model (which was a generated as a part of my masters thesis). More details can be found in the following post:

Kernel PCA Skip gram Model

Intrigued by the fake article results, I also tested the algorithm for the genuine articles like:

Result: Genuine News. All the articles from these reliable sources have lower distance.

I found that most of the sources have reported the articles with same story and hence the distance between the query article and reliable source articles were very small.

From these two examples, I saw that it is possible to set a threshold for the distance plots for the two cases when the query article is genuine and when it is fake.

Thresholding for fake articles:

The threshold value was determined after testing a large set of articles (having both fake and genuine articles) using this algorithm. The set aslo included articles from different news sections such as politics, entertainment, sports, technology etc.

Based on the results from these tests, a threshold value of 0.65, was easy to comprehend when using Kernel PCA embeddings. If the value is below 0.65, we can define a probability distribution of trust in the article.

Whereas if the threshold is greater than 0.65, we can again define a probability distribution of mistrust in the article.

Some special scenarios:

I ran the algorithm on articles from various scenarios. For example: the articles, which might have false news article headline but true article body and vice versa.

For example:

Although the article title is genuine, the story of the article is false as they claim that Chelsea Clinton practices Satanism. Therefore the distances are above the threshold. Here are the results:

Result: Fake News. The body of the article is having fake news.

Another example of a genuine news story and the results: As expected all the distances are below threshold.

Result: Genuine News. This also followed the same trend of threshold less than 0.65.

Now one might question that it is possible that the evaluated article is a very recent news and it might not be published else where. My algorithm takes care of this and will notify the user to test again in few minutes.

An example scenario is:

Result: Genuine News but needs more proof. This was tested when the article was just published. Here one can realize that only three sources have reported the similar article with distance very low or similarity high.

Here it can be seen that only three of the reliable sources reported on the same news. When tested after some time to accumulate more proof, one can see that during the interval CNN has also published a story related to the topic:

More proof accumulated as some time passes.

Future Work:

Based on the thresholding and number of fake news articles it has published, I can assign a reliability score to each source and also to novice news sources. This will make the news sources to be more careful of what they publish.

Facebook measures for fighting fake news:

In the pursuit of making “meaningful social interactions”, Facebook has recently announced that the content directly from publishers won’t perform well unless people engage with it. This means that there will be less news related articles on the users’ Facebook wall. It is especially bad according to WSJ as 45% of Americans get their news from Facebook.

Helping the crusade against fake news:

The algorithm in this article, offers a solution to this problem. There can be an initial check before any news post could appear on the Facebook wall. If the probability is below the threshold, it could appear on the wall. Furthermore, the number of reliable sources having the similar article should be weighed in the scoring.

The use of KPCA Embeddings allows me to generalize the algorithm on any language and region, even on languages for which one doesn't have a huge dataset.

Conclusion:

Using this algorithm, I can protect myself from fake news articles published on social media through some dubious sources, yet not missing the genuine articles from the same sources.

Once fake news publishers realize that this system can be used on a large scale and will filter out their fake news from people’s access (affecting negatively their business), they will be compelled to be more responsible with the content they share.

Details of the algorithm will soon be published in a scientific paper.

If you are interested in how this can be a stepping stone:

Here are some more cool examples

--

--

Vishwani Gupta

Applied machine learning enthusiast. Trying to make a difference in this society.