Breaking news:ML can help detect fake news and this isn’t fake news 😉

Serjan Kaur
9 min readFeb 4, 2022

--

Whether it is Web 1.0, 2.0, or web3.0 fabricated news has always existed, even before the Internet. Fake news could be defined as articles that are purposefully made up to deceive the readers. These articles are used as a tactic to increase readership or as part of psychological warfare to profit through clickbait.Enticing-exaggerated-flashy headlines and clickbait are used to spark curiosity to increase advertisements revenues.

It would be an understatement to say that false news has the power to provide a weapon unlike any other when employed appropriately.Fake news has been omnipresent for a long time now but it has lately gotten a lot of attention because of the massive quantity of disinformation surrounding the novel coronavirus.

This article aims to walk you through feasible solutions to detecting and filtering out articles containing misleading information by creating a machine learning model using python and NLP.

Don't know what NLP is? Fair enough , I got you!

Natural Language Processing

NLP is basically a crossover between computer science and computational linguistics. Computers use binary digits (0,1) to communicate even for the most basic tasks, humans have 6500 different types of languages with their own syntax. NPLs are used for language translators,social media monitoring,chatbots,and survey analysis.

Let’s go over the Fundamentals

  • Tokenization: It is a process of splitting text objects into smaller units called tokens. Tokens can be numbers, texts, or symbols.
  • Part-of-speech-tagging: POS is categorizing texts into lists of words where they are segregated according to whether the word is a noun, adjective, verb, and so on.
  • Stemming and lemmatization: The process of removing prefixes and suffixes to extract the base form of a word. Lemmatization takes into account the context and converts the word to its meaningful base form, whereas stemming only removes the last few characters, resulting in inaccurate interpretations and spelling mistakes.
  • Stop word removal: As the name suggests it is the eliminating words that appear in a large number of documents in the corpus Articles and pronouns, for example, are commonly characterized as stop words.

Getting the environment set up

For starters, to set up the environment I’ll be using google colab because it’s a fantastic tool for tasks requiring deep learning. It’s a hosted Jupyter notebook that doesn’t require installation and has a great free version that gives you free access to Google computing resources like GPUs and TPUs.

Let’s go over the code

I’ll breakdown the code downbelow but here is the colab link !

Start by importing the libraries :

#importing libraries 
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Numpy : Numerical Python is a Python library that includes multidimensional array and matrix data structures. I used it to create and manipulate data arrays.It basically did all the Math stuff we don’t really want to do

Pandas: is another open source library which we use for data analysis

re: Regular expression is used to search for words in strings

Other Essentials

Stopwords : Removes words such as articles , prepositions (“the”, “a”, “an”, “in”)that do not add much value

PorterStemmer : It implements the Porter Stemming technique. This is the procedure for removing prefixes from a word to reveal the root-word.

TfidfVectorizer() : It’s a text vectorizer that uses Term Frequency (TF) and Document Frequency (DF) to convert text into usable vectors (DF). The number of occurrences of a specific term in a document is referred to as the TF, and it indicates how important that term is in that document. The number of documents containing a specific term is referred to as the (Idf). The frequency of a term in a document indicates how common it is.

LogisticRegression() : Is a classification algorithm that predicts the probability of a categorical dependent variable using machine learning. In other words, as a function of X, the logistic regression model predicts P(Y=1). [I’ll link it to easy to follow documentation]

The theory behind Logistic Regression, and the TfidfVectorizer, is extensive and complex, and it would take more than one blog to fully explain it. Furthermore, I would still be unable to explain them as precisely as their creators, so I will refrain from doing so.

Importing the Datasets

Importing datasets is very fundamental when working on a machine learning model. A dataset is a set of data that is usually presented in tabular format. Each column denotes a distinct variable.The dataset we’re using has 4 variables : id, title, author, text and label. The label signifies whether a news is real or fake.

0 = real

1 = fake

Here is the dataset I used! Kaggle provided the dataset for this project as a csv file.

# loading the dataset to a pandas DataFramenews_dataset = pd.read_csv('train.csv')news_dataset.shape

The read_csv()function in Pandas imports a CSV file into a DataFrame.

It’s time to Pre-Process the Data

1. Swapping empty strings with missing values

We count the number of missing values in the dataset and the null values are replaced with empty strings.

Purpose : To remove all blank/empty spaces from our datasets because they do not add any value.

2. Merging the author name and news title

Make a new variable called ‘content’ by combining the author and title columns. Then separate the data (content) from the label and assign x and y variables respectively.

# merging the author name and news titlenews_dataset['content'] = news_dataset['author']+' '+news_dataset ['title']
print(news_dataset['content'])
# separating the data & label
X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']
print(X)
print(Y)

Purpose : instead of focusing on the ‘text’ column from the dataset , I decided to combine ‘author’ and ‘title’ which is the data I’ll use for prediction since it is shorter than ‘text’.

3. Stemming time!

As I’ve explained before , Stemming is the process of reducing a word to its Root word

For instance : Playing , Plays, Played → Play

port_stem = Porterstemmer() The Porter stemming algorithm (also known as the ‘Porter stemmer’) is a method for removing inflexional endings from English. This done because it is important to reduce words down to get the most accurate outcome

The next step seems complicated but it is really not, let me break it down!

We create a new function labeled ‘stemming’

re.sub('[^a-zA-Z]','’,content) : We’re using the .subcommand to substitute certain values and the Regular expression library (re.)to help search for words in data.The command is asking to only use data that is words and sub in empty spaces for puncations, numbers, and other non-alphabetic data.

stemmed_content.lower(): changes all data into lowercase letters

stemmed_content.split() : gives you a list of all the words in the string or line.

port_stem.stem(word) for word in stemmed_content if not word in stopwords.words(‘english’)]: The data frame is passed into a for loop, all stop words are removed, and then stemming is performed on the remaining data.

Finally , using .join all the data is combined

4. Stemming is applied to all data

# applying stemming to the datanews_dataset[‘content’] = news_dataset[‘content’].apply(stemming)

Purpose : this will leave us with data that is strictly (lowercase) alphabetical and does not include any numbers, punctuations or stopwords which makes it easier when we transform the data to numerical data later on.

5. Separating content(data) and label and assign x and y variables

6. Convert texts to numbers

Now that we’re done removing stopwords and performing stemming,It’s time to convert textual data into numerical data using TfidfVectorizer!

#converting the textual data to #
vectorizer = TfidfVectorizer()
vectorizer.fit(x)
x = vectorizer.transform(x)print(x)

Create a variable called “vectorizer” and save the function there. The vectorizer functions are then fitted(.fit)with the x value, which means that all data under the x variable passes through the vectorizer function. Then .transform will convert all the values to their respective features.

Purpose : Since our computer can not understand textual data it is CRUCIAL to convert it into numerical data for it to get processed.

7. Next we split the dataset into training and testing data

# split between train and test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, stratify=y, random_state=2 )

The datasets are divided in two parts

  • Train Dataset: This is the data set that is used to fit the machine learning model.
  • Test Dataset: This dataset is used to assess how well a machine learning model fits.

The goal is to estimate the machine learning model’s performance on new data that was not used to train the model.

test size = 0.2 means 20% of the total dataset is allocated to testing .

Staritfy = y will segregate fake and real news similarly to the one in the original dataset

Purpose : You test a model by making predictions against the test set after it has been processed using the training set. It is simple to determine whether the model’s guesses are correct because the data in the testing set already contains known values for the attribute that you want to predict.

8. Finallyyy it’s time to train the model using logistic regression

This one line of code does all the work we don’t really want to :)

The variable ‘model’ is assigned to the function logisticRegression()

.fit will train the model by plotting the sigmoid function curve using logistic regression

As previously stated, this machine learning classification model aids in the prediction of the probability of a categorical dependent variable, where the dependent variable is a binary variable with data coded as 1 (true) or 0 (false)

The threshold value is calculated using the sigmoid formula; if the prediction is greater than the threshold value, the value is 1 which in this project means it is fake news; if it is less than the threshold value, the value is 0, which means real news.

9. Lets check the accuracy of the training and testing dataset

As you can see the accuracy is pretty great, so good job!

9. Predictive System

Lastly, we make a predictive system so if new data is introduced to the model, it can still predict it right.

We’re going to store the first row [0] from the x _test dataset in a new variable called x _new. Next, we use the model we previously trained (model.predict) to predict the label of x _new.

Then we introduce an if statement which asks to print “the news is real” if the prediction is 0 or else print “fake news” if the prediction output is 1.

That’s it, You see it was pretty simple!

Conclusion & Takeaways

Building this project sure was a process but I am glad I went through it. I’ve learned quite a bit about building machine learning models and got some experience in working with libraries & algorithms. In the beginning, I was doubting and questioned the process but in retrospect, I would strongly recommend it to anyone who is just getting started and willing to learn.

  • It won’t do any good if you are only copy-pasting all the steps, understanding EVERY line of code you’re writing. Make notes because you will forget about them later. Truly understand what you’re doing!
  • Don’t make the project (much like anything else) for the sake of doing it, do it for the learning experience and to expand your skills.
  • Take breaks! If you’re stuck on the same thing for a long time, it’s probably time to step back and work on something else. When you get back to debugging you’ll look at it from a different perspective/light.

Aaaand we’re done!!!!!

S/O to the youtube tutorial I followed!

I really hope you found this article interesting. If that’s the case, you can find more of my fascinating stuff here:

Hey! Thanks for reading till the end:) I am 16y/o AI enthusiast and an innovator at The Knowledge Society (TKS).

My LinkedIn: https://www.linkedin.com/in/serjan-kaur-

My Twitter: https://twitter.com/serjannnnnn

Subscribe to my monthly newsletters: https://serjan.substack.com/

--

--