Fake News Disambiguation: Why Is It Fake?

Attention-enhanced Multi-channel Recurrent Convolutional Network for Explainable Fake News Detection

8 min readDec 11, 2021

This project considers the fake news detection problem under a more realistic scenario on social media. Given just a source short-text tweet, we aim to predict whether the given tweet is fake or not. We further aim to generate an explanation for this prediction. This paper presents Attention-enhanced Multi-channel Recurrent Convolutional Network (AMRCN), for explainable fake news detection. We explain our final predictions by highlighting the essential words in the short tweet text. Experimental results and extensive ablation studies show that our model outperforms the baseline systems on two benchmarking datasets.

Social media has become a necessitous part of people’s lives. Users can interact, express their views, comment on others’ views, and access news through various social media platforms. Some people make unjust use of these mediums to spread misinformation.

Figure 1. Motivating example behind highlighting the relative importance of words towards final prediction as fake/real.

As fake news pieces are intentionally created to spread inaccurate information and lure people into believing them, they often have opinionated and sensational language styles, which can help verify the integrity of the news piece. In addition, a news document contains linguistic cues at different levels, such as word-level and sentence-level, which provide different degrees of importance for the explainability of why the news is fake.

For example, consider an unreliable news piece, Iranian woman jailed for a fictional unpublished story about a woman stoned to death for adultery. The words jailed and fictional contributes more to deciding whether the news claim is fake or not, as compared to the other words in the sentence. Further, as shown in the above figure, the highlighted words are more crucial towards the final prediction as compared to the others parts of the news piece as they give more information about the writing style of the news piece.

Problem Statement

Given the short text tweet content, we aim to classify this tweet as fake or real, i.e., binary classification. Further, we aim to explain why this tweet is fake/actual by highlighting tokens/phrases from the tweet content that contribute relatively more to the final prediction.

Dataset Collection and Preprocessing

We utilize two well-known datasets, Twitter15 and Twitter16, for our work. These datasets contain source tweets and their corresponding sequences of retweet users. The dataset statistics have been laid out in Table 1. These datasets are balanced towards the classes and consist of short length tweets. We remove the stop words from the tweets, replace all URLs with the token URL, and stem the tweets in the dataset for proceeding with the experiments.

Baseline Models

We evaluate the dataset collected on several baselines., as listed in Table 2.

NB (Multinomial Naive Bayes)
LR (Logistic Regression
DT (Decision Trees)
SVM (Support Vector Machines)
RF (Random Forest -Bagging)
XGB (XGBoost -Boosting)
CNN (Convolutional Neural Networks)
LSTM (Long Short Term Memory cells)
GRU (Gated Recurrent Units)
Bi-RNN (Bi-directional Recurrent Neural Networks)
RCNN (Recurrent Convolutional Neural Networks)

The neural network baselines have an embedding layer containing the 100D Glove embeddings of the tokens. We use the F1 score, accuracy, precision and recall metrics to get a better idea of the models’ performance on both datasets. All the models were trained on an 80:10:10 Train-Test-Val split of the dataset.

Table 2. Baseline Results and Ablation Study

Baseline Models Analysis and Ablation Study

For the simple machine learning models (1–5), we use TFIDF vectors to transform the input tweet. We perform experiments with 4 types of features in this case (a) Count Vectors, (b) Word-level TFIDF, (c) Character-level TFIDF, and (d) N-Gram-level TFIDF vectors. We observe that in the simple ML models, the Count vectors and word-level TFIDF vectors lead to better F1 scores than the other two. Further, we can see that NB, SVM and LR perform better on both datasets than the Bagging and Boosting models. This is probably because of the simplicity of the former models in handling short length tweets. In the case of CNN, we experiment by using both average and max pooling, adding a dropout layer, and making the embedding layer trainable. It was observed from the results that max-pooling gives better performance than average pooling in terms of the F1 score.

Making the embedding layer trainable boosts performance in the case of LSTM, GRU, BiGRU and RCNN. This is intuitive because the initial Glove embeddings get fine-tuned on the dataset in hand when making the embedding layer trainable. However, if we consider a general scenario, then making the embeddings trainable would not be perfect because it would make the model highly dataset-specific.

Further, in RCNN, using multiple CNNs instead of 1 increases the performance because a hierarchical CNN structure has added benefit of capturing sequential correlation and more linguistic cues

Methodology

The model architecture is shown in Figure 3. It can be divided into two broad parts: (1) CNN-based Representation, (2) Attention enhanced Word-level Encoder, (3) Multiple Channels, and (4) Explainability of the Prediction.

CNN-based Representation

Convolutional Neural Networks (CNN) are good at learning the sequential correlation. We use this to our advantage. The input glove embeddings of the tokens are passed through a 1D Convolutional layer.

Figure 3. The architecture of the Attention Mechanism used in the Model.

Word Encoder

We learn the sentence representation via a recurrent neural network (RNN) based word encoder. Though theoretically, RNN is able to capture long-term dependencies, but in practice, it leads to information bottleneck and vanishing gradients and it seems to “forget” the information. To capture the long-term dependencies of RNN, we employ a bidirectional GRU. GRUs are known for more persistent memory.

The CNN-based representation is fed into the word encoder. The word-level representations are obtained by concatenating the forward and backwards hidden states of the BiGRU. This helps in capturing the word-level context. Next, by using a one-layer MLP and softmax function, we calculate the normalized importance (attention) weights over the word representations generated by the word encoder. Figure 3 shows the attention mechanism along with the model architecture used.

Multiple Channels

We have three different channels in the final model which achieves the highest performance. These three channels start with CNN based representations with kernel sizes 3, 4 and 5 respectively, followed by Bi-Directional GRU with merge modes as concat, sum, and ave respectively. Further, they have dropout layers with different dropout ratios, i.e. 0.5, 0.4, and 0.2, respectively. Finally, all these three channel outputs are concatenated together.

Explainability

We use the attention weights at the end of the model pipeline to highlight the most relevant words for detecting whether a piece of news is fake or not. A visualization of these attention weights is shown in Figure 5 and Figure 4 for 4 example input tweets from the Twitter15 dataset. Figure 4 shows the attention weights in a model without channels, and Figure 5 shows the distribution of weights in the final model with 3 channels. As can be seen from the figures, the weights in the case of the model with channels are more concentrated, which gives a more precise explanation. However, when we re- move channels, the weights are more spread across the tweet text, which is less precise in highlighting the most important words. This gives a visual justification for the utility of multiple channels in the model architecture. The attention weights as seen in the visualization are shown after taking the natural logarithm of the weights. This is because some weights are larger compared to others, and this leads to difficulty in visualization. Taking the log helps in normalization.

As it can be seen from the example visualization above, in the sentence fast and furious 7 scrapped following paul walker’s death report, the words paul, walker, and death are the most significant towards the final prediction that it is a true piece of information. Similarly in the other example, words like fatal, bear, and attack are more relevant. This use of attention weights to give explanations is quite reasonable.

Model Results and Analysis

Table 3 lays down the ablation study for our model AMRCN. The final model AMRCN w/ Channels outperforms all the baseline models in terms of the prediction F1 score. In addition to generating explainability as discussed in Section 6.4, AMRCN performs better than the baseline models in the binary classification task. This highlights that attention (i.e. explainability) has a side-product of boosted performance, and is intuitive. If a model can explain its prediction, it will directly or indirectly make better predictions. Our attention-enhanced framework proves to be quite effective in giving explanations and, at the same time distinguishing between fake and real tweets.

In the ablation study, as shown in Table 3, it is observed that if we remove CNN from the model, it degrades the performance. This supports our hypothesis that CNN-based representation helps capture the sequential correlation and thus boosts performance. Further, we can also see the importance and effect of using multiple channels on the model performance, as AMRCN without channels performs poorly compared to the final model. All the variants of AMRCN introduced in the ablation table perform comparably with the baseline models.

Further, we can see that the performance boosts on replacing a simple CNN with a Hierarchical CNN. This is primarily because HCNN can capture more linguistic cues. Also, in the case of the merge mode in the bi-directional GRU (how to merge the hidden state representations from forwards and backward layers), we tried summing, averaging and simple concatenation. The averaging merge mode performs better than the other two. These three modes are used in parallel in the case of our final model with three channels. We also observed that on replacing the BiGRU with a uni-directional GRU, the F1 score drops because BiGRU helps capture both forward and backward context and thus would naturally give a better performance.

Conclusion

Our model AMRCN outperforms the baseline models and provides a reasonable explanation by highlighting the most important words/phrases towards the final prediction. We perform extensive experiments on two real-world datasets and observe the effectiveness of this attention-enhanced framework in generating explanations.

Acknowledgements

This work was carried out as a part of the Machine Learning (PG) course project, for Winter 2021, at Indraprastha Institute of Information Technology, Delhi, with valuable contributions from teammates Niralis and Aisha Aijaz Ahmad.

Link to Source Code

https://github.com/karish-grover/Fake-News-Disambiguation-Why-Is-It-Fake