Fake News Detection Using TFIDFVectorizer and PassiveAggressive Classifier

Raghav Palriwala
Analytics Vidhya
Published in
8 min readJul 11, 2020
Image via www.vpnsrus.com

I think it’s safe to assume that all of us have come across news articles floating on our social media boards which seem too good to be true. More often than not, we see conflicting facts for the same topic and wonder if either of them is true. We set ourselves in a fix to determine which source to put out faith in. Well, not anymore. This task can be made easier by using Python and Machine Learning. We can use classifier algorithms to train a model that can predict whether a “news” article is fact or fake.

Also, check out my other posts for more such applications of machine learning algorithms. Do check, then share your insights through comments, and share with your friends to see what they think about it. You can also follow my articles to create such models and tweak them to your interests.

What is Fake News?

A type of yellow journalism, fake news encapsulates pieces of news that may be hoaxes and is generally spread through social media and other online media. This is often done to further or impose certain ideas and is often achieved with political agendas. Such news items may contain false and/or exaggerated claims, and may end up being viral by algorithms, and users may end up in a filter bubble.

About the Project

This is a project I am working on while learning concepts of data science and machine learning. The goal here is to identify whether a “news” article is fake or fact. We will take a dataset of labeled public-messages and apply classification techniques with frequency vectorizer. We can later test the model for accuracy and performance on unclassified public-messages. Similar techniques can be applied to other NLP applications like sentiment analysis etc.

Data

I am using dataset from kaggle.com which contains the following features:

  • id: unique id for a news article
  • title: the title of a news article
  • author: author of the news article
  • text: the text of the article; could be incomplete
  • label: a label that marks the article as potentially unreliable
    1: unreliable
    0: reliable

Model

We use TfIdf Vectorizer to convert our text strings to numerical representations and initialize a PassiveAgressive Classifier to fit the model. In the end, the accuracy score and confusion matrix tell us how well our model works.

Term Frequency(Tf) — Inverse Document Frequency(Idf) Vectorizer

Tf-Idf Vectorizer is a common algorithm to transform text into meaningful representation of numbers. It is used to extract features from text strings based on occurrence.

We assume that higher number of repetitions of a word would mean greater importance in the given text. We normalize the occurrence of the word with the size of the document and hence call it term-frequency. Numerical definition: tf(w) = doc.count(w) / total words in the doc

While computing term-frequency, each term is given equal weightage. There may be words which have high occurrence across the documents and hence would contribute less in deriving the meaning of document. Such words for example ‘a’, ‘the’ etc. might suppress the weights of more meaningful words. To reduce this effect, Tf is discounted by a factor called inverse document frequency. idf(w) = log(total_number_of_documents / number_of-documents_containing_word_w)

Tf-Idf is then computed by taking a product of Tf and Idf. More important words would get a higher tf-idf score. tf-idf(w) = tf(w) * idf(w)

Passive Aggressive Classifier

The passive-aggressive algorithms are a family of algorithms for large-scale learning. Intuitively, passive signifies that if the classification is correct, we should keep the model, and, aggressive signifies that if the classification is incorrect, update the model to adjust to more misclassified examples. Unlike most others, it does not converge, rather it makes updates to correct the loss.

Developing the ML Model

Step 1: Import the necessary packages:

import numpy as np
import pandas as pd
import itertools
import seaborn as sn
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Step 2: Load the dataset into pandas data-frame:

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
test = test.set_index('id', drop = True)

Step 3: Read and understand the data. One of the most import steps while creating any ML model is to first prepare the data. This includes cleaning and filtering the data, removing outliers and creating feature that are independent and sensible (I will discuss more on this while working on another model).

We use .shape method to identify number of columns in the dataset and the total number of news samples. Then read the data table using .head method to see how the data looks. Next, Identify column names where news articles are written and the ones where classification in marked.

We then use .isna to identify if we have any null values in the column where our news articles are put, in this case it is in the column named ‘text’. Now, we use .sum() to identify how many such values exist. Once identified, we drop the rows where the column ‘text’ has null values, and fill a blank space in other columns with null values.

# Counting number of rows and columns in the data
print('Shape of Training Data: ', train.shape)

# Gettiing a hang of the data in each column and their names
print('\n \n TRAIN \n', train.head())
print('\n \n TEST \n', test.head())

# Looking for any places where training data has NaN values
print('\n \nNumber of Null values in Train Set: ', train['text'].isna().sum())
print('Number of Null values in Test Set: ', test['text'].isna().sum())

# Dropping all rows where text column is NaN
train.dropna(axis=0, how="any", thresh=None, subset=['text'], inplace=True)
test = test.fillna(' ')

This is what the output looks like:

Step 4: Let us now see if we have any outliers in the data. We will do this by checking the the number of words in each article and identifying the range and mean of number of words in all articles. We will use len() function to check for the lengths.

# Checking length of each article
length = []
[length.append(len(str(text))) for text in train['text']]
train['length'] = length
print('Minimum Length: ', min(train['length']), '\nMaximum Length: ', max(train['length']), '\nAverage Length: ', round(sum(train['length'])/len(train['length'])))

Output:

Minimum Length:  1 
Maximum Length: 142961
Average Length: 4553

We notice that there are articles with a single word as well. Let us now set a minimum number of words we’d need to validate a news article, which I have deduced as 50.

We will now see how many articles have fewer than 50 words and what do these articles look like.

# Minimum length is 1. We need to spot some outliers and get rid of them. Counting how many outliers are there
print('Number of articles with less than 50 words: ', len(train[train['length'] < 50]))
# Skimming through such short texts just to be sure
print(train['text'][train['length'] < 50])

Output:

Number of articles with less than 50 words:  207
82
169
173 Guest Guest
196 They got the heater turned up on high.
295
...
20350 I hope nobody got hurt!
20418 Guest Guest
20431 \nOctober 28, 2016 The Mothers by stclair by
20513
20636 Trump all the way!
Name: text, Length: 207, dtype: object

We notice that some articles are simply blank and many others are random statements, totalling to a staggering 207 such articles. Just imagine how the amount of noise this would have added to our model’s understanding of the data. Let us now remove such articles from our dataset and reprint the article-length statistics.

# Removing outliers, it will reduce overfitting
train = train.drop(train['text'][train['length'] < 50].index, axis = 0)
print('Minimum Length: ', min(train['length']), '\nMaximum Length: ', max(train['length']), '\nAverage Length: ', round(sum(train['length'])/len(train['length'])))

Output:

Minimum Length:  50 
Maximum Length: 142961
Average Length: 4598

Step 5: One final step before we start applying the model is to segregate the classification column with the rest of the input features, and then dividing the dataset into training and testing subsets. We do this split to ensure that our model performs well on a new dataset. We take 90% of our data as the training set and 10% as the testing set. This split percentage can be customised in order to tune the model better.

# Secluding labels in a new pandas dataframe for supervised learning
train_labels = train['label']
# Splitting data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(train['text'], train_labels, test_size=0.1, random_state=0)

Step 6: Let’s initialize a TfIdfVectorizer with stop words from the English language and a maximum document frequency of 0.7 (terms with a higher document frequency will be discarded). Stop words are the most common words in a language that are to be filtered out before processing the natural language data. And a TfIdfVectorizer turns a collection of raw documents into a matrix of Tf-Idf features.

Now, fit and transform the vectorizer on the train set, and transform the vectorizer on the test set.

# Setting up Term Frequency - Inverse Document Frequency Vectorizer
tfidf = TfidfVectorizer(stop_words = 'english', max_df = 0.7)
# Fit and transform training set and transform test set
tfidf_train = tfidf.fit_transform(x_train)
tfidf_test = tfidf.transform(x_test)
tfidf_test_final = tfidf.transform(test['text'])

Step 7: Next, we’ll initialize a PassiveAggressiveClassifier. We’ll fit this on tfidf_train and y_train.

Then, we’ll predict on the test set from the TfidfVectorizer and calculate the accuracy with accuracy_score() from sklearn.metrics.

# Setting up Passive Aggressive Classifier
pac = PassiveAggressiveClassifier(max_iter = 50)
# Fitting on the training set
pac.fit(tfidf_train, y_train)
# Predicting on the test set
y_pred = pac.predict(tfidf_test)
score = accuracy_score(y_test, y_pred)
print(f'Accuracy: {round(score * 100, 2)}%')

Output:

Accuracy: 97.08%

We got an accuracy of 97% with this model. We can now print out classification_report() and confusion_matrix() using the sklearn library.

# Creating confusion matrix with columns as True Positive, False Negative, False Positive and True Negative 
cm = confusion_matrix(y_test, y_pred, labels=[0, 1])
df_cm = pd.DataFrame(cm, range(2), range(2))
sn.set(font_scale=1)
sn.heatmap(df_cm, annot=True, annot_kws={'size':14}, fmt='d').set_title('Confusion Matrix')
plt.show()
# Creating classification report
print('\nClassification Report: \n', classification_report(y_test, (y_pred > 0.5)))

Output:

Result for Fake News Detection

Results:

We successfully implemented a machine learning and natural language processing model to detect whether an article was fake or fact. We got 1034 articles correctly identified as fake and 962 correctly identified as real. When doing such a classification, it is important to check that we limit the number of false positives as they can cause facts to be marked as fake.

Future Work:

I intend to expend this project by adding a graphical user interface (GUI) where one can paste any piece of text and get its classification in the results. Write to me if you have some tips for me!

Reference

You can find my code here on Github.

If you liked my work, throw me some appreciation via sharing and following my stories. This will keep me motivated to share with you all as I keep learning newer things!

If you did not like my work, please share your thoughts and recommendations. This will help me improve and develop better readings for you the next time!

Thank you.

--

--