Fake News Detection Using Logistic Regression & Decision Tree Classifier

Supervised Machine Learning for detecting fake or real news

Published in

Courage to Write

4 min readJul 6, 2024

Rapid changes in the digital world can create new problem, one of which is related to fake news that can endanger social life. That’s why, we need a fake news detection machine to detect the news as early as possible so that it doesn’t spread quickly. With a lot of data that has been spread widely on the internet, we can make a supervised machine learning that has been previously trained to simplify and speed up the fake news detection process.

In this project, we tried to use two method for classifying the news, logistic regression and decision tree. Logistic regression will produce a probability value between 0 and 1 (in this case is “0” for “fake” and “1” for “real” news), or we can call it as binary classification. This logistic regression has sigmoid function with threshold value of 0.5. This mathematical function will determine whether the classification falls into “0” or “1” (more about logistic regression: here).

We also used decision tree to compare the classification accuracy. Decision tree is widely used in machine learning — especially data classification. This method will provide a decision or classification based on the previous trained data by modelling the relationship between different variables (more about decision tree: here)

Installations and Libraries

Before we’re going through the code, here’s some packages libraries that should be installed and imported first:

# Installing Packages

!pip install pandas
!pip isntall seaborn
!pip install Matplotlib
!pip install tqdm
!pip install nltk
!pip install wordcloud

# Importing Libraries

import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt

Datasets and Gist Codes

Datasets

Then, we need to download the dataset on our local repository. The dataset contains various news from many fields, such as politics, health, and others. In this project, we’ll use the health news, politics news, and all news as data train. Here are the variations of data train-test that we’ll collect:

All news as data train and politics news as data test
All news as data train and health news as data test
Politics news as data train and data test
Health news as data train and data test
Politics news as data train and health news as data test
Health news as data train and politics news as data test

Dataset can be downloaded: here

To prevent data bias in data training, we should train the fake news data dan real news data with the same amount (we’ll drop the remaining data that’s too much).

Gist Codes

Gist code for all news as data training: here

Gist code for health as data training: here

Gist code for politics as data training: here

Results and Analysis

After training all news data, politics news data, and health news data, we can find the accuracy of classifier model as stated below.

Data train accuracy using logistic regression (Log) and decision tree (DT)

Accuracy of data train model using decision tree method is higher than logistic regression method for all news, politics news, and health news.

Apart from that, we can also find out the accuracy of data testing with various types of data training used as stated below.

Data test accuracy using logistic regression (Log) and decision tree (DT)

As shown above, accuracy for data testing using decision tree somewhat is higher than using logistic regression. But, on health news as data training and data testing, we got that the accuracy of data test on decision tree method is a little bit lower than on logistic regression. The accuracy seems low for both method if we use different kind of data training of a different type than the data testing. For all news as data training and health news as data testing, we can see that the accuracy is the lowest among all of the variations that we test. As we checked the all news data itself, we found out that the data in all news mostly about politics (health news data is less than 10% of the data). That’s probably the reason why we got the lowest accuracy, 47.87%, on all news as data training and health as data testing.

Conclusions

From this project we can conclude several interesting things, such as:

Data training using decision tree classifier will have better accuracy than logistic regression with an accuracy value of up to 98%.
The accuracy of data testing will increase if the training data that is used is in the same type as the data testing.
The accuracy of data testing below 80% for data testing with different type of data training.

This project was modified by Syazwanar, Almira Chusnul, and Abdussalam Al Fansuri as a Research Based Learning Project for completing Simulation and Modelling of Physical Systems Course. You can find the original project here.