Predicting Virality: How soon can we tell?

Austin Botelho
Cybersecurity for Democracy
3 min readNov 2, 2023

Introduction

The ability to predict viral content has clear ramifications for content moderation and harm mitigation. Under the resource constraints that most Trust and Safety teams face, knowledge of which adverse content (e.g. hate, harassment, misinformation etc.) will garner the most engagement, enables more efficient allocation of time and effort to address them. We found that looking at the engagement alone without considering post or account features, we are able to predict with a binary F1 of 0.8 whether a post’s final engagement would be in the top 1% by hours 13–17.

Methodology

Data

The dataset for this analysis consists of ~2.7m Facebook posts by English, Ukrainian, and Russian Language news pages ranging from June 15 to July 15, 2023, filtering for those with a minimum of 10 engagement history observations over two days of observation.¹

Feature Creation

For each post, we linearly interpolated from the engagement history observations reported by the Crowdtangle API to have a total engagement quantity for every 30 minutes after creation until the 48 hour mark. We also computed a rolling average of the engagement’s velocity and acceleration over time using a 3 hour window.

The output variable is a binary indicator of whether a post ultimately went viral as determined by the percentile of its last observed engagement. Posts in the top 1% of final engagement were deemed viral.

Model

At each point in time, we aim to find a decision boundary that optimally splits the data maximizing F1 accuracy based on each feature (engagement, velocity, acceleration) both independently as well as collectively. For velocity and acceleration, we consider the maximum reached up to the time being evaluated.

We achieve this by implementing Gradient Boosting. Gradient Boosting is an ensemble approach where each subsequent predictor (decision tree) is fit on the prior’s residuals (prediction error). Predictor minimize the log loss (aka cross entropy). Training occurred on a random split of 80% of the data, leaving the remaining 20% for testing.

We control multiple parameters in the training process. We set num_estimators controlling the number of decision trees in our forest to 40 in conjunction with the learning_rate to 0.1 controlling the pace learning from tree to tree. Setting num_estimators too high increases the likelihood of over-fitting and too low prevents the model from extracting meaningful insights. We also set max_depth to 1 in the single feature models and 2 in the multi-feature model. The learning rate which controls

Results

Plotting the binary F1 score over time reveals that momentary engagement is always the best predictor of virality followed by velocity, then acceleration. The predictiveness of velocity and acceleration level off quickly likely because the maximums occur early in a post’s lifecycle whereas engagement’s grows rather consistently over time from a starting F1 of 0.4–0.5 crossing 0.8 by 13–17 hours depending on the language. Velocity and acceleration never cross 0.75. Ensembling the features provides a slight boost from using virality alone shifting the time crossing an F1 of 0.8 forward by up to 3 hours.

Footnotes

  1. For posts where we have engagement measurements beyond two days after creation, the two day mark represents 80.1%-83.1% of lifetime engagement depending on language.

About NYU Cybersecurity for Democracy

Cybersecurity for Democracy is a research-based, nonpartisan, and independent effort to expose online threats to our social fabric — and recommend how to counter them. It is a part of the Center for Cybersecurity at the NYU Tandon School of Engineering.

Would you like more information on our work? Visit Cybersecurity for Democracy online and see how tools, data, investigations, and analysis are fueling efforts toward platform accountability.

--

--