Getting Real with Fake News

Detecting fake news using sci-kit learn and deep learning

Ever since the 2016 Presidential Election, fake news has become a commonplace term in everyone’s vocabulary. Even Donald Trump’s most-liked Tweet is about fake news.

#1 most liked and retweeted Donald Trump tweet

Typically spread over social media and traditional news outlets, misinformation remains rampant through the use of clickbait headlines and polarizing content.

With recent world events, we’ve seen how much impact the news has on our lives. From understanding what is happening surrounding the pandemic to the movement of the stock market, I know that I rely heavily on the news, and I’m sure everyone else does too. However, it is often difficult to distinguish between articles with false information and those providing real, fact-checked news. Given that companies such as Facebook and Twitter deploy algorithms to ensure that people are receiving the right, correct information on their feeds, I wanted to explore utilizing Natural Language Processing and text analysis to build a fake news classifier.

Code for this analysis can be found on my Github.

Data Sources

The data utilized comes from two places: a) 7 web-scraped websites and
b) Kaggle. The Kaggle dataset contains fake and true news articles from 2015–2018.

Scraping

Beyond using the Kaggle dataset, I scraped articles from the following websites to get more articles and more recent news:

To do so, I installed the following packages (feedparser, newspaper), as some articles are scraped from RSS feeds of the main news sites. Further code can be found on my Github.

I ended up with 24,194 fake news articles and 22,506 true news articles. This was ideal, as this is a relatively balanced dataset. After removing irrelevant columns, reformatting, and adding labels, a sample of both data frames is shown below.

Two Data Frames: 5 Fake News Articles, 5 True News Articles

Exploratory Analysis

The chart below shows the time period from which my news articles were collected. Since I web scraped additional articles beyond those in the Kaggle dataset to get recent news, the chart below has a sharp peak in April 2020. The scraper was unable to pull articles from between 2018–2020.

From the articles in the Kaggle dataset (2015–2018), I can note that there is a sharp increase in fake news in January 2016. I would hypothesize this to be a result of the beginning of the U.S. presidential election year. There is also a spike in true news in November 2016, the month election results are announced.

Since I am utilizing textual analysis for this project, the discrepancy in dates will not be an issue. The goal of utilizing articles from two vastly different news periods (eg. 2016 dominated by election news, 2020 dominated by coronavirus news) is to have a diverse set of words to train my classifier on.

Article Body Text

An article’s length could play a role in determining whether it is fake or real.

Fake news articles are longer, on average, by 30 words. There is also more variance in their lengths, evidenced by a higher standard deviation and more articles on the right tail. There are some outliers, with some true and fake news articles being > 4000 words in length.

Interestingly, there isn’t much difference in sentiment between fake news and real news, for the articles in this dataset. Using TextBlob’s sentiment function, where -1 means negative sentiment and 1 means positive sentiment, the average sentiment is 0.055 for real news and 0.059 for fake news. This also indicates that most of the articles in this dataset are neutral, shown by the distribution below.

Article Headline Text

A headline’s length could also play a role in determining whether it is fake or real.

Headline text follows a similar trend to the text in the body of articles. Fake news has slightly longer headlines, on average. There is also more variance, evidenced by higher standard deviation and a longer tail on the graph. Real news headlines follow a relatively normal distribution.

Within this dataset, real news has slightly more polarizing headlines, with fake headlines’ average sentiment at 0.003 compared to real news headline sentiment at 0.026. This is still indicating mostly neutral headlines.

Some examples of polarizing headlines are shown below. Most negative headlines signal fear and death, while positive headlines show hope and pride.

Highly negative sentiment (-1.0)
Highly positive sentiment (1.0)

Although the headlines could yield interesting, significant results, I decided to include only the body text for classification. Headlines are much shorter, resulting in fewer words to use for classification. I could have combined the headlines and body text, but wanted to hone in the focus of my analysis. An interesting further application of this analysis would be to analyze differences in sentiment between fake and real news.

Pre-Processing

Tokenizing

As is the case when working with any kind of text, the first step is separating each article’s body text into tokens to get a corpus. Using the corpus, I can get features from the words. To tokenize each article, I used the NLTK package to do the following:

  • Imported English and Arabic stop words and added extra words to the stop words list (eg. removed CNN because the name of the newspaper should not be in the corpus)
  • Separated each article into tokens (removes whitespaces)
  • All words lowercased
  • Removed punctuation and stop words
  • Removed numbers
  • Lemmatized (converted to root form) words
  • Created n-grams

After creating a corpus, I found the top 20 words in each class of articles. The sizes of each word are scaled by frequency (eg. trump appears 2–3 times as frequently as one). Of the top 20 words in each class, 9 overlap. Furthermore, trump is overwhelmingly the most common word in each corpus, as expected. Given that much of the news in this dataset is from around election years and the coronavirus, it is no surprise that the top 20 words are almost all politically-related.

These are the top n-grams found after tokenizing the articles, separated by true news and fake news.

True News — Top 6 Bigrams
False News — Top 6 Bigrams
True News — Top 6 Trigrams
Fake News — Top 6 Trigrams

After analyzing these n-grams, I ended up not using them in the modeling process because 1) processing time was multiplied through the use of n-grams on top of single tokens 2) n-grams were pretty similar between the two classes of articles and 3) metrics from the results of modeling (see below) were already strong without adding n-grams.

TF-IDF

Once the text was processed, I converted it into features using Tfidfvectorizer. This vectorizer first calculates the term-frequency (TF) — number of times a word appears in a document divided by the total number of words in the document. Next, it calculates the inverse data frequency (IDF) — the log of the number of documents divided by the number of documents that contain a word. Finally, the TF-IDF score for a word is the TF x IDF.

After splitting the data into train and test sets, I used the vectorizer to only keep words in greater than 10% of documents (reduces the number of features to make the matrix manageable) and overrode the preprocessor so that only the words that were created from tokenizing (see above) would be used as features.

The result is a Document-Term Matrix — one for the train set, one for the test set. I had 215 words (features) with 37,368 articles in the train set and 9,343 articles in the test set. These matrices serve as the inputs for the modeling process!

Modeling: ML Models

I fit 4 ML models to the data, predicting whether each article fell into 1 of 2 categories: fake or true news. Results are shown below.

Binomial Logistic Regression

First, I started with basic regression.

Naive Bayes Classifier

This is a very common model used for text classification, built on the Bayes theorem. Since it is naive, it assumes that each word is independent — one word does not influence the chance of another word appearing. This can be a downside; for example, the words donald and trump would be treated as unrelated for this dataset, when we know that the presence of donald almost always indicates the presence of trump. However, the classifier still performs well.

Naive Bayes allows me to calculate the words that have the most indication of being in fake news and the words that have the most indication of being in real news. This was done by calculating the number of times a word appeared across all articles within a class and then dividing by the total number of articles in the class.

Top 4 most “fake news” words
Top 4 most “real news” words

Words like ‘Hillary’ and ‘video’ seem to be the most indicative of fake news. Given the nature of the articles in this dataset and knowing what we know about fake news, it makes sense that most articles are oriented towards Hillary Clinton (the 2016 presidential election has been widely criticized for fake news campaigns) and video (deepfake videos are a huge issue). The popularity of ‘video’ could also come from the fact that many fake news sites attach video to their articles. Real news has more common words, such as days of the week and general news terms such as ‘senate’ and ‘foreign’.

Support Vector Machines

Given that this problem has two classes that can be separated by a linear classifier, support vector machines are another good machine learning model to use. I use a linear kernel.

Random Forest

Next, I trained a random forest, utilizing GridSearch for hyper-parameter tuning. I settled on a max_depth of 60 with 200 estimators.

False positives denote misclassified real news, and false negatives denote misclassified fake news. In this analysis, false negatives are more concerning, as we do not want a reader to assume a fake article is providing them with real, accurate information.

Out of these models, Random Forest performs the best, with the highest accuracy and lowest number of false negatives. Support Vector Machines (SVM) and Binomial Logistic Regression have similar accuracies and false negatives. Furthermore, their precision, recall, and F1-scores are the exact same. Therefore, based on AUC, I would recommend a Logistic Regression model over SVM. Naive Bayes performs the worst. A possible reason could be the downside I discussed above — since I did not utilize bi-grams and tri-grams due to memory issues, a lot of dependent words were not recognized.

Feature importances from the Random Forest model are shown below.

Although Random Forest performs well, deep learning may perform better.

Modeling: BERT (Bidirectional Encoder Representations from Transformers) Neural Network

This is a neural network released by Google — BERT became active very recently in 2019. More information can be found here. It is a method that utilizes pre-trained deep learning models that can be downloaded by the user (me) for two uses: 1) extract features from text data (similar to the TF-IDF method I used above) or 2) fine-tune the existing model to produce predictions from text data. Since I had already extracted features, I focus on the second use-case of BERT. I modify and fine-tune BERT to create a classifier.

So why use BERT instead of other forms of deep learning (eg. BiLSTM, another text model, or CNN)?

  1. BERT can be developed more rapidly since the bottom levels of the neural network have already been trained. All I needed to do is tune them. Furthermore, only 2–4 epochs of training are recommended. This is much less than the time and epochs needed to train other models!
  2. Building a deep learning model from scratch typically requires a large dataset. BERT’s weights are pre-trained, so my relatively small dataset works.
  3. BERT has proven results through simply fine-tuning. Thanks Google 😀

I used pytorch and the transformers library as the interface to run BERT. I also ran the BertForSequenceClassification pre-trained model, one of multiple pre-trained models in the BERT library. This model has a linear layer on top of the pooled output that is used for classification.

I split the data into new train-validation-test sets due to data formatting specifications, and re-tokenized the articles using the BertTokenizer. I did not use the original tokenization I had performed because: 1) BERT has a specific vocabulary that can’t be changed and 2) BERT has a special way of handling words not in its vocabulary.

The maximum length of an article in BERT is 512 tokens. However, I was running out of memory when I increased the number of tokens (or I had to significantly reduce the batch size, resulting in a very long runtime). I ended up sticking to 64 tokens per article since I was getting pretty good results.

I also utilized the Adam optimizer and a learning rate scheduler to reduce the learning rate as training epochs increased. The scheduler ensures that the model does not converge at a suboptimal solution or get stuck. Training the model with a batch_size of 16 and 4 epochs took ~36 minutes.

As shown above, the model performs great on the validation set, with an accuracy of 99%. I decided to stick with 4 epochs for evaluation.

The results of the BERT model are fantastic! The model has a 99.8% accuracy and shows a large reduction in false positives and false negatives compared to any of the ML models. The confusion matrix is below.

Conclusions

The BERT model performs best, and I will use that as my final classifier. Advantages of using this model include its built-in tokenizer, which removes the need to do text pre-processing manually. Additionally, BERT has a very extensive vocabulary and a deep network of layers.

However, BERT is much more computationally-expensive than the ML models, requiring a GPU. Additionally, the number of tokens per article is limited to 512, while ML models can be scaled up based on available memory or usage of the cloud. Consequently, based on resource availability, one could also use Random Forest on this dataset.

Challenges

This was my first time completing an NLP-based project that utilized deep learning. I think text analysis is a particularly interesting application of machine learning since it allows computers to make sense of non-numeric data. A particularly interesting application is how a customer’s non-traditional data (geolocation, social media networks, etc) can be analyzed to determine their credit-worthiness when they have little credit history to begin with.

I learned a lot about the pre-processing of text. At first, I struggled with determining the best tokenizing package to use — I started with spacy but realized NLTK was the best way to get the output I wanted. I also experimented with different vectorizers before I settled on TFidfvectorizer.

In terms of model-building, training BERT was a challenge. At one point, I also ran into a problem with forward propagation that could only be fixed if I was using Jupyter notebooks, but my project was built on Google Colab. Therefore, I tried multiple methods of training, combining the resources from various tutorials. Still, I would often run out of GPU storage and have to start over. Ensuring the data was in the right format and understanding where to tweak the existing BERT model was also an obstacle I had to overcome.

Next Steps

My analysis could have been improved in the following ways:

  • Larger dataset spanning a longer time period
  • Perform Latent Dirichlet Allocation (LDA) on data to determine topics
  • Utilizing bi-grams and tri-grams in training the ML models
  • Incorporating the headlines, as they could provide valuable information about the type of article
  • Increasing the number of tokens used for each article in BERT
  • Trying other NLP deep learning models
  • Incorporating sentiment as a factor for prediction

Thanks for reading! 😸

Check out the code here.

Picpedia (labeled for reuse)

UPenn Wharton ’21 Statistics, Data Science

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store