Simple news classifier using a Logistic Regression

Guillermo Fernandez
3 min readMay 21, 2020

--

The goal is to train a model using news Headlines so it can classify them as Fake or Real news.

All the code featured here is available in the Github repository https://github.com/frogfreg/data-mining-project/blob/master/fakenews-by-title.ipynb

To achieve this, we will use the Python programming language along with the NLTK, pandas, sklearn, and NumPy libraries.

We will be using the Anaconda Python distribution. You can download it from https://www.anaconda.com/products/individual.

We use a jupyter notebook to write and execute the code.

The first step is to include the libraries and import the datasets.

The datasets we use are available to everyone from here:

https://www.kaggle.com/antmarakis/fake-news-data

https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset

import pandas as pd
import numpy as np
import re
import nltk
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
dff1 = pd.read_csv('Datasets/fake-news/Fake.csv', usecols = ['title'])
dft1 = pd.read_csv('Datasets/fake-news/True.csv', usecols = ['title'])
dff2 = pd.read_csv('Datasets/fake-news/fake2.csv', usecols = ['title'])
dft2 = pd.read_csv('Datasets/fake-news/real2.csv', usecols = ['title'])
dff1 = dff1.dropna()
dft1 = dft1.dropna()
dff2 = dff2.dropna()
dft2 = dft2.dropna()
dff1['class'] = 0
dft1['class'] = 1
dff2['class'] = 0
dft2['class'] = 1

Then we merge all datasets into one.

frames = [dff1,dft1,dff2,dft2]
dfn = pd.concat(frames)
dfn.info

At this point, the Data looks like this.

Now we need to clean the text, so we must remove punctuation, tokenize, remove stopwords, and lemmatize.

We can find a useful article to learn about these concepts here: https://towardsdatascience.com/nlp-for-beginners-cleaning-preprocessing-text-data-ae8e306bef0f.

stop_words = set(stopwords.words('english')) 
def preprocessor(text):
text = (re.sub('[\W]+', ' ', text.lower()))
return text
dfn['title'] = dfn['title'].apply(preprocessor)
tokenizer = RegexpTokenizer(r'\w+')
dfn['title'] = dfn['title'].apply(lambda x: tokenizer.tokenize(x.lower()))
lemmatizer = WordNetLemmatizer()
def lemmat(text):
lem_text = [lemmatizer.lemmatize(i) for i in text]
return lem_text
def remove_stopwords(text):
words = [word for word in text if word not in stop_words]
return words
def untokenize(list):
return " ".join(list)
dfn['title'] = dfn['title'].apply(remove_stopwords)
dfn['title'] = dfn['title'].apply(lemmat)
dfn['title'] = dfn['title'].apply(untokenize)
dfn.head()

We should end with something like this:

We now divide the data into two sets, one for training and one for testing, we will use a 70/30 ratio. The logistic regression will receive a tf-idf matrix, so we must transform the data. More info about tf-idf here: https://towardsdatascience.com/natural-language-processing-feature-engineering-using-tf-idf-e8b9d00e7e76.

tfidf = TfidfVectorizer()
label = {0:'fake', 1:'true'}
X_train, X_test, y_train, y_test = train_test_split(dfn['title'],dfn['class'], test_size=0.3, random_state=50)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

We now use the sets to train the logistic regression. An overview of Logistic Regression is available here: https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc.

clf = LogisticRegression(random_state=0).fit(X_train_tfidf, y_train)
clf.score(X_test_tfidf, y_test)

The displayed score should be something like 0.94. Our classifier worked properly with the test data.

Now we test it using the headline from https://www.economist.com/asia/2020/05/20/china-punishes-australia-for-promoting-an-inquiry-into-covid-19. None of the datasets include this headline that comes from a reputable source, so it should classify it as real news.

headline= "China punishes Australia for promoting an inquiry into covid-19"
test_headline = [" ".join(lemmat(remove_stopwords(tokenizer.tokenize(headline.lower()))))]
test_headline_tfidf = tfidf.transform(test_headline )

We print the prediction as follows

print('Prediction: %s\nProbability: %.2f%%' %(label[clf.predict(test_headline_tfidf)[0]],np.max(clf.predict_proba(test_headline_tfidf))*100))

The output is

This time, our classifier nailed it!
Just don’t expect it to work every time. This classifier could use some more work in the form of more data or a better model to provide a better classification.

--

--