COVID Fake News Detection with a Very Simple Logistic Regression
Natural Language Processing, NLP, Scikit Learn
This time, we are going to create a simple logistic regression model to classify COVID news to either true or fake, using the data I collected a while ago.
The process is surprisingly simple and easy. We will clean and pre-process the text data, perform feature extraction using NLTK library, build and deploy a logistic regression classifier using Scikit-Learn library, and evaluate the model’s accuracy at the end.
The Data
The data set contains 586 true news and 578 fake news, almost 50/50 split. Because the data collection bias, I decided not to use “source” as one of the features, instead, I will combine “title” and “text” into one feature “title_text”.
Pre-processing
Let’s have a look an example of the title text combination:
df['title_text'][50]Looking at the above example of title and text, they are pretty clean, a simple text pre-processing would do the job. So, we will strip off any html tags, punctuation, and make them lower case.
The following code combines tokenization and stemming techniques together, and then apply the techniques on “title_text” later.
porter = PorterStemmer()def tokenizer_porter(text):
return [porter.stem(word) for word in text.split()]
TF-IDF
Here we transform “title_text” feature into TF-IDF vectors.
- Because we have already convert “title_text” to lowercase earlier, here we set
lowercase=False. - Because we have taken care of and applied preprocessing on “title_text”, here we set
preprocessor=None. - We override the string tokenization step with our combination of tokenization and stemming we defined earlier.
- Set
use_idf=Trueto enable inverse-document-frequency reweighting. - Set
smooth_idf=Trueto avoid zero divisions.
Logistic Regression for Document Classification
- Instead of tuning C parameter manually, we can use an estimator which is
LogisticRegressionCV. - We specify the number of cross validation folds
cv=5to tune this hyperparameter. - The measurement of the model is the
accuracyof the classification. - By setting
n_jobs=-1, we dedicate all the CPU cores to solve the problem. - We maximize the number of iterations of the optimization algorithm.
- We use
pickleto save the model.
Model Evaluation
- Use
pickleto load our saved model. - Use the model to look at the accuracy score on the data it has never seen before.
Jupyter notebook can be found on Github. Enjoy the rest of the week.

