Fake & Real News Classifier With Machine Learning

Adarsh Verma
Deep Data Science
Published in
6 min readMay 3, 2019
Photo by Phil Hearing on Unsplash

[ This is part of 100 Days of ML ]

Code link of this post.

Fake news is dangerous, it could affect from opinions to elections. There are some ways to differentiate between fake and real news like fake news usually has lots of UPPERCASE letters and “dramatic-punctuations”!?; (to grab the attention of the readers). Fake news could also have spelling mistakes in the content. However, it’s difficult for normal users to classify the fake news but they could use the help of machine learning to do this kind of tiresome work.

This project we will classify news articles into one of the two categories -fake or real. it consists of the following steps:

1. Feature Extraction from Text
2. Text Data Preprocessing
3. Convert Categorical Variable into Numeric
4. TF-IDF calculation
5. Cross-validation for Model Evaluation
6. Model selection & Results Analysis
7. Save Models For Future Predictions

The dataset used in the project is downloaded from Kaggle, here’re the links:

dataset: https://www.kaggle.com/anthonyc1/gathering-real-news-for-oct-dec-2016 and https://www.kaggle.com/mrisdal/fake-news

As it can be seen below, the dataset contains ‘title, content, publication, label’ features which are self-explanatory in nature. Publication name is the publisher of that particular news article. The initial shape of the dataset (28711 X 5):

News articles with fake labels
News articles with the real labels

To make it easy to work with, I merged two columns — title and content in one and named it as newstext, this helped me work on single text feature for feature extraction and preprocessing steps.

  1. Feature Extraction From Text — Text has lots of features, as here we are classifying fake and real news, we need to look for the features which can represent the differences between both. Features like the number of UPPERCASE words, average word length, word count, words themselves. Here’s a detailed guide on feature extraction, which was followed for this project:

2. Text Data Preprocessing — Before feeding the text into the models, we need to preprocess it, to reduce irrelevant features/data, make the text more suitable for machine learning models. For text preprocessing, follow this detailed guide:

3. Convert Categorical Variable into Numeric

Apart from the text features, there are categorical features present in the dataset, which are publication, and our target variable label. As machine learning models are capable of processing numeric data only, we need to convert these features into numeric.

Publication: Publication is a categorical variable with 250 unique values. 1-Dummy encoding cannot be used here, because dummy encoding will unnecessarily increase the dimensionality and add 250 more features with lots of sparsity. 2-Assigning numerical values from 0 to 249 is also not appropriate in this scenario because we might lose a particular categories weight. 3- We can use frequency distribution, it will assign more weight to the value which is occurring most of the time and less weight to the value which is occurring less number of times. Here’s the formula:

Frequency distribution of one category = (frequency of a category) / (Total number of instances)

Replace each category with its frequency distribution value:

Publication with frequency distribution values

Label: It is our target variable which has only two values — fake or real. We can simply assign 0 and 1 to fake and real:

{ fake : 0 , real : 1 }

4. TF-IDF calculation — Bag of words is the technique to use the word’s frequency in a text as a feature. TF-IDF is a better technique than the bag of words which considers the importance of the word than the only frequency of the word. As simply explained on tfidf.com we can calculate TF-IDF with below formulas:

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

TF-IDF = TF X IDF

However there’s no need to use this formula, scikit-learn has already provided TfidfVectorizer module to calculate it. tf-idf features can be directly used for the machine learning models.

Values if TF-IDF for words from the sparse matrix

5. Cross-Validation For Model EvaluationAs we know, there is “no free lunch”, no single model/algorithm works for every problem. We need to try a couple of different ml models and evaluate their accuracies, precision, recall etc. (depending on your problem) to find out the best one. We will also build a custom voting classifier in future posts, where we will use a different set of features and different sets of data for different models and depending on the votes, results will be calculated. Here, we are going to use TF-IDF features, and 4 machine learning models — Random Forest, Support Vector Machine (linear), Multinomial Naive Bayes and Logistic Regression for our classification problem.

models = [
RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
LinearSVC(),
MultinomialNB(),
LogisticRegression(random_state=0),
]

5X cross-validation was used here, the dataset was divided among 5 sets, there were 5 iterations, in each iteration 4 sets were used for training the models and 1 set was used for testing the models. Here is the average accuracy from all the 5 iterations of 4 models:

Mean of 5X validations results

6. Model Selection & Results Analysis — As two models are performing way better than others — LinearSVC with an accuracy of 89.7 % and LogisticRegression with an accuracy of 88.6%. There’s only a slight difference of accuracies here. Let’s dig deeper and see the Precision, Recall, F1 Score of these two selected models. We performed a train-test split technique for evaluating both of these models separately, where 75 of the data was used for training and 25% for testing. It’s interesting to see that both the model’s accuracy has increased a little bit, but that could change if you change the split, or chose different data with the same split.

Logistic Regression was able to achieve F1-score of 92% with an accuracy of ~91.8%.

Logistic Regression’s Results

Another participant in the race, Support Vector Machine is performing better than Logistic Regression with almost 2% more accurate results with better precision and recall. We can select the Support Vector Machine for our classification.

LinearSVC’s results

7. Save Models For Future Predictions

How to save and use machine learning models for future use? — We can save our models for future work as a deeper evaluation with ROC curve and statistical testing. sklearn makes it pretty easy to save your models, to prevent you from re-training the models, so that you can save some valuable time. Here’s how you can do it:

  • Import ‘joblib’ from sklearn.externals
  • Then dump you models ( or anything like intermediate data or results)
  • Re-load when you want to use it.
# Load module to save model and intermedicate data
from sklearn.externals import joblib
# Save our models for model evaluation or prediction
joblib.dump(svm, r"Day 5-6-7\LinearSVC" )

# Load our trained model

svm = joblib.load( r'Day 5-6-7\LinearSVC' )

Code can be found here

Future Work:-

  1. Custom voting classifier for different features set and dataset.
  2. Model evaluation and statistical testing (paired t-test) of the results.

Cheers!

#machinelearning #textprocessing #featureextraction #fakenewsclassifier

--

--