Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Photo credit: Unsplash

COVID Fake News Detection with a Very Simple Logistic Regression

Natural Language Processing, NLP, Scikit Learn

3 min readJul 22, 2020

--

This time, we are going to create a simple logistic regression model to classify COVID news to either true or fake, using the data I collected a while ago.

The process is surprisingly simple and easy. We will clean and pre-process the text data, perform feature extraction using NLTK library, build and deploy a logistic regression classifier using Scikit-Learn library, and evaluate the model’s accuracy at the end.

The Data

The data set contains 586 true news and 578 fake news, almost 50/50 split. Because the data collection bias, I decided not to use “source” as one of the features, instead, I will combine “title” and “text” into one feature “title_text”.

fake_news_logreg_start.py

Pre-processing

Let’s have a look an example of the title text combination:

df['title_text'][50]

Looking at the above example of title and text, they are pretty clean, a simple text pre-processing would do the job. So, we will strip off any html tags, punctuation, and make them lower case.

fake_news_logreg_preprocessing.py

The following code combines tokenization and stemming techniques together, and then apply the techniques on “title_text” later.

porter = PorterStemmer()def tokenizer_porter(text):
return [porter.stem(word) for word in text.split()]

TF-IDF

Here we transform “title_text” feature into TF-IDF vectors.

  • Because we have already convert “title_text” to lowercase earlier, here we set lowercase=False.
  • Because we have taken care of and applied preprocessing on “title_text”, here we set preprocessor=None.
  • We override the string tokenization step with our combination of tokenization and stemming we defined earlier.
  • Set use_idf=True to enable inverse-document-frequency reweighting.
  • Set smooth_idf=True to avoid zero divisions.
fake_news_logreg_tfidf.py

Logistic Regression for Document Classification

  • Instead of tuning C parameter manually, we can use an estimator which is LogisticRegressionCV.
  • We specify the number of cross validation folds cv=5 to tune this hyperparameter.
  • The measurement of the model is the accuracy of the classification.
  • By setting n_jobs=-1, we dedicate all the CPU cores to solve the problem.
  • We maximize the number of iterations of the optimization algorithm.
  • We use pickle to save the model.
fake_news_logreg_model.py

Model Evaluation

  • Use pickle to load our saved model.
  • Use the model to look at the accuracy score on the data it has never seen before.
fake_news_logreg_eva.py

Jupyter notebook can be found on Github. Enjoy the rest of the week.

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Susan Li
Susan Li

Written by Susan Li

Changing the world, one post at a time. Sr Data Scientist, Toronto Canada. https://www.linkedin.com/in/susanli/

Responses (3)