Fake News Detection

Published in

Clement & Jermyn

11 min readMay 16, 2021

Introduction

The advances in information technology and the rapid growth of social media have increased the spread of fake news. Fake news is defined as conspiracy theories and news with the intention of spreading misinformation. As two million news articles are published online every day (Singh, 2015), it is obvious as to why determining the legitimacy of news is a growing and prominent need. Society and businesses have been affected by fake news as users are unable to distinguish truths from lies (Stewart, 2020). This results in a loss of resources such as money and time due to unwise decisions misled by fake news. Hence, we aim to identify features that will (1) design a classifier for fake news in a specific domain; and (2) design a classifier that can be applied to news in multiple domains — a cross-domain classifier.

Dataset

The Kaggle community seeks to look for more insights and analysis on the detection and classification of fake news. The data provided is already split into different sets representing fake and real news data in CSV format separately. The labelled data is classified as real or fake using known research algorithms from various research papers.

With most of the datasets on Kaggle are US-centric, a potential challenge would be to classify real or fake news from data sources outside of the United States with the trained model limited to US news. As such, we combined data from 3 different datasets, retrieving the Title, Text, and Classification of the news. By doing so, we will include more news data from a variety of data sources and allow our analysis to be more robust and comprehensive.

However, due to computational constraints, we conducted our analysis on a corpus of size 20,993 data instead of 50,000.

Solution Overview

We designed a framework to extract 5 features — Topic Distribution, POS Tags, Entity counts, Emotions distribution and Sentiment distribution — to create our classifier. We generated 3 types of classifiers — (1) Base Classifier, (2) Base + Topic Classifier and (3) Cross-Domain Classifier.

Firstly, preprocessing on our dataset will be done to stem the words. Next, we will extract the TFIDF word vector. Concurrently, we ran topic modelling, emotion/sentiment analysis, Information Extraction (Entity Recognition) and POS Tagging to engineer the features for our classifiers. The following are the engineered features:

Topic distribution from topic modelling
Emotion distribution and sentiment distribution from emotion/sentiment analysis
Counts of person, organization and location counts from Information Extraction
POS Tag groups from POS Tagging

These engineered features will then be used for the different classifiers. As such, classification has to be performed after all the features have been created. The first classifier can be generated with just TFIDF word vector, second classifier with TFIDF word vector and topic distribution and the last classifier can be generated with the remaining features engineered.

Topic Modelling

Topic Modelling is a statistical modelling method that generates a group of words that frequently occur together (Topics) that best represents the corpus. One of the crucial steps of Topic modelling is selecting an optimal K — one that shows distinct sets of top N words for each Topic with minimal overlap to ensure each topic distinguishable.

The Machine Learning for Language Toolkit (MALLET), available in Pyhthon NLTK library, uses machine learning to perform sampling-based implementations of the above-mentioned Topic Modelling Techniques. This model generates more optimized β and θ with a more optimized Gibbs Sampling technique using SparseLDA (Yao et al, n.d.).

Topic Modelling helps us to identify the topics present in the corpus and the Topic-Document distribution of all documents in the corpus which will then be used as a feature for our classification model.

These are our interpreted topics relating to the Topic-Word distribution: 0 — Equality, 1 — US White House, 2 — Taxes, 3 — Social Media, 4 — Hillary Clinton’s Email Scandal, 5 — US Bills, 6 — Protest, 7 — America for Americans, 8 — Nuclear Bomb, 9 — US Legislation, 10 — Sustainability, 11 — EU, 12 — Terrorism, 13 — US Elections. Do feel free to give us your suggestions on what are the possible topics relating to our Topic-Word distribution.

It seems that fake news circulates around controversial and sensitive topics while real news circulates around news that are international in nature. This skewed distribution indicates that the topic distribution can help the classifiers discriminate between real and fake news.

Dominant-Topic Distribution by Class (Real/Fake)

Named Entity Recognition (NER)

NER is a subtask of information extraction that locates and classifies named entities mentioned in unstructured text into predefined categories such as people, organizations, locations, time expressions and quantities. A classifier that uses a vector representation of term frequency of phrase detection and NER entities can produce an accuracy of 96.74% (Al-Ash & Wibowo, 2018).

However, due to the vast amount of specific entities present in our dataset, we decided to aggregate term frequency count for Location, Persons and Organisation entity types.

Aggregated Term Frequency Count for Entity Types

It seems that both real and fake news contain similar amounts of Organization and Person entity. Notwithstanding, real news tend to contain a significantly higher amount of Location Entities.

Average Entity Type Counts by Class ( Real/Fake )

POS Tagging

Morphology is a branch of linguistics that focuses on the way words are formed from morphemes (Hancox n.d.). It is the technique that allows the machine to recognise a word, understand its basic constituents and internal structure of complex words to ascertain the word-formation process and the mental process of the user that is forming the overall linguistic expressions (Agyei, 2015). As such, POS Tagging, a form of morphological analysis that considers the relations between words (SketchEngine, n.d.), was used for feature engineering.

We generated the count of each POS tag present in each document as the classification feature. However, it resulted in a sparse vector thus, we removed POS tags with more than 70% missing values. Despite doing so, there was still 24 features, which is large for our dataset size.

To avoid the curse of dimensionality, research by (Kapusta, Hájek, Munk, & Benko, 2020) groups these POS tags by grammatical-semantic classes (POS tag groups). The sum of the count of all POS tags in each POS tag group was used as the value of a feature in our classification model. With this, the number of features was reduced to nine (Group C, D, F, I, J, M, N, R, and V).

POS Tag Groups Representation by (Kapusta, Hájek, Munk, & Benko, 2020)

Out of the nine groups, Fake News contains higher levels of POS Tag Group I, J and N.

Emotions and Sentiment Analysis

Sentiment and Emotion analysis can give further measures of the writer’s attitude, evaluation, sentiments, and emotions of a speaker using the computational treatment of subjectivity. Valence Aware Dictionary for Sentiment Reasoning (VADER), provided by the NLTK package, was used for Sentiment Analysis. Text2Emotion Library is emotion analysis. It uses a lexical database to detect “Happy”, “SAD“, “Surprise”, “Angry”, “Fear” emotions.

Nonetheless, Sentiment Analysis provides a sentence level understanding of the polarity of text. While Emotion Analysis provides understanding at a token level to detect more granular levels of subjectivity.

While both distributions for real and fake news look similar, if we were to look at it closely, real news tend to carry positive emotions and are more neutral while fake news tend to carry negative emotions and are more sentimental.

Emotion Distribution of between Real (1) and Fake (0) news

Sentiment analysis VADER distribution between Real (1) and Fake (0) news

Classification

As previously mentioned, three classification models were constructed and each of them was optimized for different inputs.

Classification on TFIDF Text Representation (Base Model)
Classification on TFIDF, Topic Distribution (Base + Topic Model)
Classification on POS Tags, Semantic Distribution, Emotion Distribution and Entity counts (Cross-Domain Model — features that appear in all articles)

We utilized linear, rule-based classification models and Neural Networks to perform our classification.

Logistic Regression (LR) based Stochastic Gradient Descent (SGD) identify if there are linear relationships between our variables and classification labels. We used elastic net which uses a combination of L1-norm and L2-norm regularisation techniques to improve predictions.

Since most real-world data is not linear, we trained rule-based classifiers and ensembled models such as Random Random Forest (RF), AdaBoost (ADA), and XGBoost (XGB) to identify if segregating variables according to a purity criterion can enhance our classification performance. Probabilistic models and Rule-based models differ in the way they work and the interpretation of the results. To keep this article short, details of how these models work will not be included.

To optimize the classification models, our team utilized random search on each classifier (Base, Base + Topic, Cross-Domain) to further improve the models’ precision, 5-fold cross-validation was used to tune the model to reduce the overfitting.

We trained and optimized the aforementioned classification model on our dataset of size ~21,000 to understand which model is best at classifying real from fake news in the political domain. For our second goal — design a classifier for multiple domains, our team used a sample dataset from the entertainment domain obtained from Kaggle.

Classification Findings

It can be seen that rule-based classifiers generally perform better than the linear models, suggesting that there is an absence of linear relationship between our features and the class of the news. Furthermore, it is evident that Base + Topic classifier performs slightly better than our Base classifier while Cross-Domain classifier performs the worst. This suggests that the presence of domain-specific features such as Topic-Document distribution trained on the domain can boost the classification models.

Summary of Classification F1-Score, Precision and Recall on Politics dataset

We then investigated the Cross-Domain classifier, to identify the important features from the rule-based classifiers. We have identified that counts of location entity and POS tag groups are better predictors than that of Emotion/ Sentiment data. Notwithstanding, POS Tag groups I, J and N are important determinants for classifying real and fake news, and “Surprise” was a significant emotional feature in most articles.

Top features from each category (Highlighted represents top feature of each classifier)

We then moved on to test our classifiers on an alternate domain, using 600 entertainment news as our sample.

As our topics were fitted specifically to the political domain, the Base + Topic Model was unable to adapt to the new dataset, being the worst performing model. The base classifier performed the best out of our 3 models. While topic distribution enhances the performance of domain-specific classifiers, it becomes a confounding factor that hinders the performance of classifiers in alternate domains.

On the other hand, our Cross-Domain classifiers which aimed to pick up features that appear in all texts were able to produce results comparable to the topic classifier. This indicates that there is significance in exploring such features when classifying real and fake news and that it influence can extend across domains.

Summary of Classification F1-Score, Precision and Recall on Entertainment dataset

Future work and Conclusion

This project successfully met two objectives, which are improving the classifier for fake news in a specific domain and designing a classifier that can be applied to news in multiple domains — a cross-domain classifier. Using different types of classification models enhanced with parameter tuning techniques, allowed our models to distinguish between real and fake news quickly and accurately.

Our focus here was more on feature engineering for the classification models. A possible future work is to extend the types of features we have used and create a variety of models with a varying set of features and extract the most explainable features to create the final classifier, for both domain-specific and cross-domain classifiers. Some features that our team will look at to include in the models are the following:

For domain-specific classifiers, we can include more domain-specific features. Additionally, the metadata of an entity can be included. E.g. Occupation using factbooks like LIWC.

For cross-domain classifiers, more types of emotions can be detected. E.g. “Disappointment”, “Satisfied” or the use of inversed POS tag frequency as a feature (similar concept to tf-idf) to identify rare occurring grammar classes can be used.

For classifying real from fake news in general, we can analyze the entity that provided the article and cross-reference it with well-established sites like. E.g. “Channel News Asia” or “BBC”

We believe that there is value in further exploring different types of features present in a news and the possible combinations to distinguish real from fake news. Especially with the digital age bringing forth the spread of unfiltered information, it is crucial that fact-checking sites become more accurate.

We would love to hear your opinions and suggestions about our work and are happy and open to suggestions for improvement.

Personal Reflections

Clement

Text mining analytical projects are fun. I got to explore different areas on how documents and text can be analysed and represented. During the course of the project I had the opportunity to explore many different text mining and natural language processing libraries. I came to realize how much text mining analytics have advanced. As i was researching feature engineering techniques for emotions, sentiments and pos tagging frequency distributions, i realised the number of libraries and resources available was vast. Reading through many research papers about how other data scientists and analysts deal with fake news was very eye-opening. The techniques and procedures introduced in the research papers gave me a lot of understanding of how our team can narrow down and decide an approach for this problem. The techniques taught in this course were very helpful for my understanding and learnings for the project. It better equipped me with the skills and knowledge to boost my understanding and takeaways for the project. The domain of text mining analytics is growing fast and with technology improving every day, it’s only a matter of time before more accurate, efficient and comprehensive models for feature extraction and prediction are created.

Jermyn

Working on this project has taught me a lot, especially in terms of Topic Modelling and understanding the current works to further enhance our project. The basic understanding of what is available and making use of their methodology or suggestions in future work in our project provided me with an eye-opener in terms of the possibilities we can achieve with Text Mining. A simple task of classification can be broken down into multiple depths of features used, its representation and the countless possibilities of the combination of features used. This was a delightful experience in terms of learning how and why certain features are used and thinking of how we can extend the use of these models beyond our current project was intellectually satisfying. Other than classification tasks, the presence of a good dataset and the preprocessing/ preparation of the data is important as it is the backbone of the whole project. Due to our limited experience in using certain techniques and the different stages of progress of the project, this posed a challenge to us as we had to keep producing multiple excel sheets with varying data to fit the need at the point in time.