Published in

NLP — Detecting Fake News On Social Media

Implementing and comparing Bag Of Words and TF IDF to build a model to detect fake news

Photo by Matthew Guay on Unsplash

The advent of the World Wide Web and the rapid adoption of social media platforms (such as Facebook and Twitter) paved the way for information dissemination that has never been witnessed in the human history before. With the current usage of social media platforms, consumers are creating and sharing more information than ever before, some of which are misleading with no relevance to reality.

The following program help in identifying such news articles programmatically if a news article is Fake or Not. Let us first understand the two feature extraction technique I have used to build the model—

  • Bag of words(BOW)
    In simple terms, it’s a collection of words to represent a sentence with word count and mostly disregarding the order in which they appear. This can better be explained by an example. Please follow the below like to understand this better in detail.
  • TF - IDF
    TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents. You can learn more about it here

Table Of Contents

  1. About the Dataset
  2. Data Pre Processing
  3. Exploring Data Analysis
  4. Natural Language Processing
  5. Training and Validation (Bag of words & TF IDF)
  6. Summary
  7. Future Work
  8. References

№1: About the Dataset

Each sample in the train and test set has the following information:

  • The title of the new article.
  • The text of the new article against each title.
  • The subject for the news article.
  • Date of the new article

I am predicting whether a given news is about a real news or not. If so, predict a 1. If not, predict a 0.

The dataset can be accessed from Kaggle.

№2: Data Pre Processing

Importing all the required libraries

Downloading the dataset from Kaggle

Accessing few sample records

№3: Exploring Data Analysis

So now we will go through an exploratory data analysis to get insights from the news article. The aim here is to divide this session into topics so we can explore graphics for each subject.

Labels distribution

Size of dataset downloaded

We add a column each in both the dataset to identify the real and fake new articles. We denote real news as 1 and fake with 0

Now we combine both the individual dataset so that we can analyses the complete dataset

As we are running the analysis with the news titles hence the remaining columns are not required.

The labels seems to be evenly distributed. This is a good sign and confirms that the dataset is not biased.

№4: Natural Language Processing

Cleaning, Formatting and Lemmatization

Before processing the text let us check if any row/columns are having null values

Let are now remove all the string punctuation(like !”#$%&’()*+,-./:;<=>?@[\]^_`{|}~), We can achieve this by simple keeping the work from [a-z] and [A-Z] and replacing the rest of the words with space.

Also, let us lower all the text so when stemming/lemmatization is applied words spell in capitals are not treated differently with the same words present in small letters.

We apply stopwords to safely ignore the meaningless words without sacrificing the meaning of the sentences.

In the end we implement Lemmatization for converting a word to its base form.

№5: Training and Validation (Bag of words & TF IDF)

We split the data into training and validation set

Applying ML Models using Sklearn (Bag of words)

Multinomial models are more suited for processing text related features extracted using Bag of words

Evaluating Results(Bag of Words)

We write the below functions to build the confusion matrix visualizations.

Implementing hyper parameterization(Bag of Words)

Multinomial Classifier with Hyperparameter

with bag of words we achieved maximum accuracy of 81%

Implementing TF-IDF

Implementing TFIDF in the corpus extracted eariler

Applying ML Models using SKlearn & Evaluating Results (TF IDF)

Implementing hyper parameterization(Tf IDF)

Multinomial Classifier with Hyperparameter

With TF IDF we are able to achieve an accuracy on 93 %..Great!!!

№6: Summary

  • We downloaded the Fake News Dataset from Kaggle.
  • We performed the NLP preprocess and EDA to understand the labels distribution.
  • We trained the model using both Bag of words and TF IDF.
  • Implemented hyperparameters to achieve maximum accuracy.
  • We analyzed the accuracy, For TFIDF we achieved the accuracy of 93% and for Bag of words it is 80% and conclude TF IDF have performed better than bag of words.

№7: Future Work

Although we got started with the notebook there are many possibilities to further build on our analysis.

  • Try implementing Word2Vec to further improve the accuracy.
  • Implement techniques like LSTM and evaluate the results.
  • Build a neural network and check the performance

№8: References

I really hope you guys learned something from this post. Feel free to 👏if you like what you learnt. Let me know if there is anything you need my help with.

Happy Learning 😃



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store