Are You for Real?

Published in

Stanford CS224W GraphML Tutorials

13 min readJul 29, 2023

Fake News Detection on Incomplete Twitter Networks

by Amrita Palaparthi, Apoorva Dixit, and Megan Worrel as part of Stanford’s CS 224W course: Machine Learning with Graphs.

Background/Motivation

The Nature and Spread of Fake News

Fake news has emerged as a powerful, insidious force over the past decade, one with the capacity to lend false credibility to conspiracy theories, influence perceptions on key political issues, and even endanger public health and safety. Social media serves as a catalyst for the spread of misinformation online, enabling falsehoods from the fringes of the internet to gain near-instantaneous worldwide reach over platforms like Facebook and Twitter. According to a 2018 study, tweets containing deceptive content were a staggering 70% more likely to be retweeted than truthful ones and were significantly faster to diffuse through the network than the truth [1].

These worrying statistics highlight just how critical it is to detect and eradicate fake news before it has the ability to spread malicious falsehoods through social media networks. Since manually pinpointing untruths on global platforms with billions of users is intractable, intelligent automatic detection is necessary to combat fake news at scale. Traditionally, efforts in towards this goal have focused solely on the text within fake news articles to tell truth apart from fiction. However, we build off of recent work to show how the manner in which information spreads from user to user can also be leveraged to help differentiate fake news from legitimate content [3].

In this post and its accompanying implementation tutorial, we explore the spread of fake news over Twitter. More specifically, we detail how to use GNNs for fake news identification, classifying a piece of news shared on the platform based on both the content of the article itself and its spread from user to user across the Twitter graph.

Further, our implementation aims to account for the real-world problem of tweet ephemerality. In real fake news propagation networks, it is possible that malicious or bot accounts that posted or retweeted misinformation pieces will have been deactivated, rendering their tweets inaccessible. In addition, users may choose to delete tweets referencing fake news. Both of these disruptions would cause once-visible nodes to no longer be accessible in our news propagation graph and make tracking the spread of misinformation through Twitter more challenging. Our approach moves toward a fake news detection model that is robust to such incomplete representations of the spread of articles on Twitter.

Our implementation has four main components and builds off of the architecture presented by Dou et. al. in their paper, User Preference-aware Fake News Detection (UPFD) [3]:

A GNNStack class that builds off of the GraphSage GNN architecture
Graph preprocessing for node removal to simulate tweet ephemerality
A BERT embedding of the source news article itself, provided by
Sentiment and lexical analysis using features derived from Empath [5]

Follow along with us as we integrate each of these components into a cohesive model that outperforms UPFD to detect fake news with over 86% accuracy!

Follow Along in Our Colab:

FakeNewsClassifer.ipynb

Detecting Misinformation using GNNs

colab.research.google.com

Note 1: before running any cells, navigate to Runtime > Change Runtime Type and select GPU. This is much faster!
Note 2: Be sure to run all cells in this notebook in order to correctly preserve variables for later use.

Dataset

For this tutorial, we use the UPFD dataset available on PyG, which contains 5778 graphs extracted from the FakeNewsNet dataset. Each graph is undirected and depicts the spread of either fake or real news on Twitter.

Graphs in UPFD have a tree structure, with the root node corresponding to the source news article. Edges in the graphs correspond to retweets. Nodes that are direct neighbors of the root node correspond to users who directly retweeted the source tweet. Nodes that are 2-hop neighbors of the root node correspond to users who retweeted the news from the retweet of a direct neighbor of the root node. Since Twitter doesn’t explicitly state which account a user retweeted a tweet from, edges in the graph are determined according to two rules. First, for users who follow other users who previously retweeted the news, edges are added between those users and the users they follow whose retweet is most recent. Second, for users who don’t follow other users who previously retweeted the news, edges are added between those users and other users who previously retweeted the news with the greatest number of followers.

One-hop (yellow) and two-hop (purple) neighbors of a root news node (blue)

The dataset contains an even 50/50 split between real and fake news propagation graphs sourced from Politifact and Gossipcop fact-checking websites. The Politifact-labeled graphs contain an average of 131 nodes per graph, and the Gossipcop-labeled graphs contain an average of 58 nodes per graph. Our analysis focuses on the 314 graphs with labels sourced from Politifact. The dataset is split into train/val/test sets using the inductive setting, where we take all Politifact-labeled graphs in the dataset and place 20% of them in the training set, 10% in the validation set, and the remaining 70% in the test set (following the split used by Dou et. al.).

Nodes can be assigned one of four different feature vectors included in the UPFD dataset. We use the 768-dimensional BERT features. For the user nodes, the features represent an embedding of the Twitter user’s previous tweets encoded by bert-as-a-service. For the root news node, the features represent an embedding of the news textual data.

An example tree structure depicting the spread of a news article on Twitter. Node 0 represents the news embedding, nodes 1–9 represent tweets of this article, and the remaining nodes represent retweets of other Twitter users.

The code below details how we load this graph data from UPFD:

Task

We build a Graph Neural Network (GNN) that performs binary classification on a given input graph and corresponding news article to determine whether the graph depicts the spread of real or fake news.

Our Model

Graph Preprocessing

In order to simulate the problem of tweet ephemerality, we preprocessed our graphs by randomly removing a fraction of the nodes and their corresponding edges. Since our graphs follow a tree structure, we reattached any children of removed nodes to the root (news) node which was never removed from the graphs. This allowed the children nodes to still receive information from the root node. We originally loaded the UPFD dataset as directed graphs but converted the graphs to undirected graphs to be consistent with the work done by Dou et al.

Our 3-step node removal process: randomly select nodes, remove nodes from the tree, and reconnect children to the root news node

Nodes were removed from the graphs at increments of 0.1 from 0 to 0.9. At test time, we removed nodes from the train and validation sets but not the test set. This was used to mimic real-world training on limited historical Tweet information due to the ephemerality of bot accounts and richer real-time Tweet information at prediction time.

Visualization of Graphs at Different % Nodes Removed

Check out our function that performs our three-step removal process:

News Article Preprocessing for Sentiment+Topic Analysis

We additionally augment our model by exploring the impact of sentiment and topic analysis. To do this, we leverage Empath [5], a tool that scores a piece of text based on 194 categories based on word frequency. Categories include emotional and pyschological terms such as “joy” and “irritability” as well as topics like “security” and “football.”

To compute these Empath scores, we must first extract the text of each news article. While the PyG UPFD dataset does not provide source articles, we use the web scraping tool available on the original FakeNewsNet dataset GitHub to retrieve article text from the Wayback Machine internet archive. We then map these article IDs within the UPFD dataset to generate a mapping from each graph to its corresponding text. For your convenience, we provide this mapping as a JSON file loaded as part of our Colab.

The following lines in the load_processed_graphs preprocessing function detail how to retrieve the mappings from this JSON file given a preset constant URL, as well as how to retrieve the integer indices corresponding to each graph in UPFD:

In the following function, we calculate the Empath features corresponding to this text and combine this result with our node removal process to generate our DataLoader:

GraphSAGE

Before we dive into the model architecture, it is important to understand the pieces that enable our model to make reliable predictions on graphs. GraphSAGE [6] is a type of GNN that learns inductive representations of graphs. This means that GraphSAGE is able to generalize and make predictions on previously unseen graphs, and its ability to classify a graph is independent of the graph size or structure. This is important for our use case as we want to classify the graphs in our test set without having trained on them. Additionally, GraphSAGE is used for performing graph-level tasks as it is adept at representing graphs which have rich node attributes. These node attributes in our dataset represent user profiles and behavior in addition to original news article content, which is exactly the kind of information we want to encode in our representation of a graph.

GraphSAGE updates representations for each node in order to incorporate their local neighborhood structure using the following equation.

Message Passing in GraphSAGE

While this equation depicts message passing for a single node, GraphSAGE applies this equation to each node in the graph. It incorporates neighborhood structure of each node, also referred to as the central node, by aggregating hidden encodings of all its neighboring nodes from the previous layer. The aggregation method can be any permutation invariant function. In our implementation, we use a mean over all the neighborhood node embeddings. It then concatenates this aggregation with the previous layer embedding of the central node and performs a non-linear transformation to get the final embedding of the central node. Note that in our implementation we have only used a linear transformation in place of the sigmoid. This does not affect the expressivity negatively because we use a ReLU in our GNN as described below.

In order to get the final embedding of the graph, we use the READOUT function as described above. The final embedding representing the entire graph is simply a mean over the embeddings of all nodes in the graph.

Visualization of a single layer of message passing in GraphSAGE

Follow along with our GraphSAGE message parsing implementation:

GNNStack

We create the GNNStack class to combine the GraphSAGE message passing layer with other traditional deep learning methods in order to increase the expressiveness of our model and prevent over-fitting. In particular, we have:

Linear transformation of the input,
Batch Normalization in order to speed up learning,
GraphSAGE to facilitate message passing,
ReLU to add a non-linearity and improve expressiveness,
Dropout to help regularization and prevent overfitting.

These components are detailed in the following GNNStack class definition:

Number of Layers

For our final model, we want to apply the GNNStack layer described above repeatedly. Graph neural networks are different from traditional deep learning pipelines because increasing the number of layers does not improve expressiveness. In this case, note that if we perform message passing over too many layers, the node embeddings for each node in a graph can converge to a stationary distribution. This is undesirable because our model should generate different embeddings for different nodes. At the same time, it is important that we capture the complete global structure of the graph. Thus, we compute the number of layers using the logarithm of the average number of nodes per graph in the dataset. The logarithm base can be thought of as the average branching factor of the graph, which can be a hyperparameter. This allows final node embeddings to be influenced by neighbors upto k hops away, where k is our approximation of the depth of the tree.

Augmenting Features

Original News Source

Similarly to Dou et al, we concatenate the original embedding of the news textual data with the output of our GNN so that it can be represented more richly for the final classification. This embedding is provided as part of the UPFD dataset, and it is generated by applying BERT to the source news article for each graph and setting this value as the node feature for the graph’s root.

The following three lines selected from our implementation of the FakeNewsClassifier class demonstrate how we derive the news embedding from our input data and pass it through Linear and ReLU layers:

Sentiment Analysis

As described in the preprocessing section above, we additionally generate Empath scores in 194 categories to respresent sentiment and topic analysis results for our original news article. We append this to the ouput of our GNN as well, augmenting the final set of features representing the graph. This allows our model to make predictions based on not only the content of the original news article and how the news is spread, but also on the sentiment reflected in the original news article.

We process these Empath features similarly to the original news source in our model architecture prior to concatenation, as shown in this code snippet:

Putting it all Together

The architecture diagram above shows how the three branches of our model (relating to our input graph, news article embedding, and Empath features) interact with one another. The output of each branch is concatenated, then pass through a linear and sigmoid layer to generate a binary classification of Real/Fake for a given graph.

See how the pieces come together in our FakeNewsClassifier class:

Analysis

We tuned our learning rate, weight decay, dropout, and batch size on our model with 0 nodes removed. The different hyperparameters tested were:

Learning rates: 0.1, 0.01, 0.001, 0.0001
Weight decays: 1e-4, 1e-5, 1e-6, 1e-7
Dropouts: 0.5, 0.6, 0.7
Batch sizes: 16, 32

The best hyperparameter configuration achieved a validation accuracy of 0.9354838709677419. The best hyperparameters were:

Learning Rate: 0.001
Weight Decay: 1e-06
Dropout: 0.7
Batch Size: 16

Note: Since we have a high dropout rate, this introduces some stochasticity into our model performance. If you’ve been following along in our Colab so far, you may notice different hyperparameters and performance.

Results

Performance Without Node Removal

We first examine the performance of our model when trained on the full training dataset without the added constraint of our random node removal process. Let’s compare the performance of this model with and without the addition of Empath sentiment analysis features.

Model Performance at 0% Nodes Removed and No Empath Features

Best Validation Accuracy: 0.8709677419354839
Test Accuracy: 0.8190045248868778

Model Performance at 0% Nodes Removed and Empath Features

Best Val Accuracy: 0.9354838709677419
Test Accuracy: 0.8552036199095022

The training loss and validation accuracy curves achieved while training our model with Empath Features show that training loss converges fairly smoothly.

Validation Accuracy and Training Loss Over 200 Training Iterations

Accuracy comparisons between our models and UPFD

As the table reveals, the use of Empath features boosts our accuracy by approximately 4%!

Performance With Node Removal

Now, let’s examine how removing nodes during training to simulate a dearth of past data impacts our performance.

The following two figures plot how validation and test accuracies change as the percentage of nodes removed increases from 0% to 90%:

Trends in Validation and Test Accuracy Across Varying Node Removal Percentages

Overall, we see that validation and test accuracy tends to decrease, which follows our intuitive understanding as the percentage of nodes in the training data decreases and the network has less rich information to train on. Interestingly, the model still performs relatively well at 0.9 nodes removed. This implies that the root news node and sentiment analysis features are highly predictive of whether the news is real or fake, even without substantial GNN information.

Surprisingly, the model accuracy is highest when a percentage of nodes are removed in the training data! With node removal (the optimal percentage varies between 10% and 30%), we achieve our highest test accuracy of 86.4%.

Differences in Validation and Test Accuracy Across Varying Node Removal Percentages

Further, it appears that the gap between validation and test accuracy decreases as the fraction of nodes removed increases. We visualize the difference in the plot above.

It is probable that the node removal served as a form of regularization, leading to high accuracies even with substantial fractions of node removal. In particular, removing 10–30% of nodes in our training data enabled our model to generalize better from the smaller training and validation sets to the larger, more diverse test set.

Congratulations!

You’ve implemented all of the preprocessing, model architecture, training, and evaluation for a fake news detection model. We hope you’ve enjoyed walking through this application of GNNs with us as much as we’ve enjoyed making it!

We’d like to thank Prof. Jure Leskovec and the course staff of CS224W: Machine Learning with Graphs for providing us with the tools and knowledge to understand, implement, and build on GNN models over the course of the quarter.

References

[1] Vosoughi, S., Roy, D., & Aral, S. (2018). The spread of true and false news online. Science, 359(6380), 1146–1151. https://doi.org/10.1126/science.aap9559

[2] Trusted Web Foundation. (January 28, 2021). Share of people who have ever accidentally shared fake news or information on social media in the United States as of December 2020 [Graph]. In Statista. Retrieved March 20, 2023, from https://www.statista.com/statistics/657111/fake-news-sharing-online/

[3] Yingtong Dou, Kai Shu, Congying Xia, Philip S. Yu, and Lichao Sun. User preference-aware fake news detection, 2021

[4] Loesche, D. (2017). How Do Consumers Get to the News?. Statista. Statista Inc.. Accessed: March 21, 2023. https://www.statista.com/chart/10262/selected-gateways-to-digital-news-content/

[5] Fast, E., Chen, B., & Bernstein, M. (2016). Empath: Understanding topic signals in large-scale text. Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, 4647–4657. https://doi.org/10.1145/2858036.2858535

[6] Hamilton, W. L., Ying, R., & Leskovec, J. (2018). Inductive representation learning on large graphs. arXiv. http://arxiv.org/abs/1706.02216