Build Your Own Fake News Classifier

Jeevanshi Sharma
The Startup
Published in
6 min readJun 25, 2020


Solving various Classification and Regression Problems as a part of my #100DaysOfMLCode journey, I worked on an NLP Project which detects fake news.

The following post will guide you towards how you can build your own Fake News Classifier.

Misinformation in the form of fake news has become a characteristic of the 21st century, driven by technologies such as social media platforms that enable information to spread quickly and to be targeted at individual beliefs, biases, and emotions.

Fake News is any news that is either factually wrong, misrepresents the facts, and that spreads virally (or maybe to a targeted audience). It can be spread both through regular news mediums or on social media platforms like Facebook, Twitter, WhatsApp, etc.

What truly differentiates Fake News from simple hoaxes like “Moon landing was fake”, etc. is the fact that it carefully mimics the “style” and “patterns” that real news usually follows. That’s what makes it so hard to distinguish for the untrained human eye.


The dataset I used for this python project is news.csv. This dataset contains News, Title, Text, and Label as the attributes. You can download it from here.


#Reading the data 
This is how dataset looks like.

Before proceeding, check whether your dataset does have any null value or not.

# checking if column have NaN valuescheck_nan_in_df = df.isnull()
print (check_nan_in_df)

This data frame does not consist of null values. But if your data frame consists of null values fill them up with spaces before combining them into a feature. Here’s how you can:

df = df.fillna(' ')

As we see both ‘Title’ and ‘Text’ features are important so we can combine them into a single feature named ‘Total’.

df['total'] = df['title'] + ' ' + df['text']
The dataset looks like this.


To preprocess your text simply means to bring your text into a form that is predictable and analyzable for your task. We used nltk library for this.

import nltk
nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import sent_tokenize, word_tokenize
  1. Removing Stopwords: Stopwords are the words in any language which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. You can read more about it here.
  2. Tokenization: Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph. For Example:
from nltk.tokenize import word_tokenizetext = "Hello everyone. You are reading NLP article."word_tokenize(text)

The output looked like this:

['Hello', 'everyone', '.', 'You', 'are', 'reading', 'NLP', 'article', '.']

3. Lemmatization: Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item.

Text preprocessing includes both Stemming as well as Lemmatization. Many times people find these two terms confusing. Some treat these two as same. Lemmatization is preferred over Stemming because lemmatization does the morphological analysis of the words.

Examples of lemmatization:
swimming → swim
rocks → rock
better → good

For taking a high-level dive into Stemming Vs. Lemmatization, check here.

The following code does all the pre-processing.

stop_words = stopwords.words('english')lemmatizer = WordNetLemmatizer()for index, row in df.iterrows():
filter_sentence = ''
sentence = row['total']
# Cleaning the sentence with regex
sentence = re.sub(r'[^\w\s]', '', sentence)
# Tokenization
words = nltk.word_tokenize(sentence)
# Stopwords removal
words = [w for w in words if not w in stop_words]
# Lemmatization
for words in words:
filter_sentence = filter_sentence + ' ' + str(lemmatizer.lemmatize(words)).lower()

df.loc[index, 'total'] = filter_sentence


The labels here as classified as Fake and Real. For training our model, we have to convert them in numerical form.

df.label = df.label.astype(str)
df.label = df.label.str.strip()
dict = { 'REAL' : '1' , 'FAKE' : '0'}df['label'] = df['label'].map(dict)df.head()
The label feature looks like this.

For further proceeding, we are separating our dataset into input and output features as ‘x_df’ and ‘y_df’.

x_df = df['total']
y_df = df['label']


Vectorization is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which is used to find word predictions, word similarities/semantics.

For curiosity, you surely want to check out this article on ‘ Why data are represented as vectors in Data Science Problems’.

To make documents’ corpora more palatable for computers, they must first be converted into some numerical structure. There were few techniques used to achieve this such as Bag of Words.

Here, we are using vectorizer objects provided by Scikit-Learn which are quite reliable right out of the box.

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

count_vectorizer = CountVectorizer()
freq_term_matrix = count_vectorizer.transform(x_df)

tfidf = TfidfTransformer(norm = "l2")
tf_idf_matrix = tfidf.fit_transform(freq_term_matrix)


Here, with ‘Tfidftransformer’ we are computing word counts using ‘CountVectorizer’ and then computing the IDF values and after that the Tf-IDF scores. With ‘Tfidfvectorizer’ we can do all the three steps at once.

The code written above will provide with you a matrix representing your text. It will be a sparse matrix with a large number of elements in a Compressed Sparse Row format.

The mostly used vectorizers are:

  • Count Vectorizer: The most straightforward one, it counts the number of times a token shows up in the document and uses this value as its weight.
  • Hash Vectorizer: This one is designed to be as memory efficient as possible. Instead of storing the tokens as strings, the vectorizer applies the hashing trick to encode them as numerical indexes. The downside of this method is that once vectorized, the features’ names can no longer be retrieved.
  • TF-IDF Vectorizer: TF-IDF stands for “term frequency-inverse document frequency”, meaning the weight assigned to each token not only depends on its frequency in a document but also how recurrent that term is in the entire corpora. More on that here.


After Vectorization, we split the data into test and train data.

# Splitting the data into test data and train datax_train, x_test, y_train, y_test = train_test_split(tf_idf_matrix,
y_df, random_state=0)

I fit four ML models to the data,

Logistic Regression, Naive-Bayes, Decision Tree, and Passive-Aggressive Classifier.

After that, predicted on the test set from the TfidfVectorizer and calculated the accuracy with accuracy_score() from sklearn.metrics.

  1. Logistic Regression

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(), y_train)
Accuracy = logreg.score(x_test, y_test)


Accuracy: 91.73%

2. Naive-Bayes


from sklearn.naive_bayes import MultinomialNB

NB = MultinomialNB(), y_train)
Accuracy = NB.score(x_test, y_test)


Accuracy: 82.32 %

3. Decision Tree


from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(), y_train)
Accuracy = clf.score(x_test, y_test)


Accuracy: 80.49%

4. Passive-Aggressive Classifier

from sklearn.metrics import accuracy_score
from sklearn.linear_model import PassiveAggressiveClassifier
pac=PassiveAggressiveClassifier(max_iter=50),y_train) #Predict on the test set and calculate accuracy y_pred=pac.predict(x_test) score=accuracy_score(y_test,y_pred) print(f'Accuracy: {round(score*100,2)}%')


Accuracy: 93.12%


The passive-aggressive classifier performed the best here and gave an accuracy of 93.12%.

We can print a confusion matrix to gain insight into the number of false and true negatives and positives.

Check out the code here.


Fake news detection techniques can be divided into those based on style and those based on content, or fact-checking. Too often it is assumed that bad style (bad spelling, bad punctuation, limited vocabulary, using terms of abuse, ungrammaticality, etc.) is a safe indicator of fake-news.

More than ever, this is a case where the machine’s opinion must be backed up by clear and fully verifiable indications for the basis of its decision, in terms of the facts checked and the authority by which the truth of each fact was determined.

Collecting the data once isn’t going to cut it given how quickly information spreads in today’s connected world and the number of articles being churned out.

I hope you might find this helpful. You can comment down in the comment sections for any queries.