Data Wrangling: Missing Engagement Interpolation

Published in

Cybersecurity for Democracy

4 min readOct 18, 2023

In our research, we at Cybersecurity of Democracy compare engagement dynamics and virality of Facebook news content across three languages: English, Ukrainian, and Russian. This required that we assemble the most detailed, multilingual ongoing data collection of content engagement over time available: 900+ million point-in-time engagement observations for ~30m posts and counting. In this blog, we will describe this process, explaining challenges in the data and our solutions to them.

Data Collection

We ingested data through the Crowdtangle API. API keys are linked to dashboards, which, in our case, monitor a list of pages. Given the size of the pages we monitor, our ingestor can only keep up with the pace of data creation if we divide the dashboards into smaller ones given to different workers, each daemonized with Supervisor. The collector writes the returned post objects to zipped JSON files stored on disk. While a post is younger than one week old, its status as present or removed is periodically checked.

A different Dockerized process loads the files in batches, parses their content, and writes to the corresponding tables in a PostgreSQL database. Batch size is determined by the available RAM. Separately, the media objects referenced in the post object are crawled and temporarily stored in a GCP bucket. When a post in our collection is deleted, its media object is moved to permanent storage so that we maintain the ability to see the complete post.

Data Issues

Within each post’s JSON is a history field containing a list of dictionary objects that look like this:

{
  'actual': {
    'likeCount': 3,
    'shareCount': 1,
    'commentCount': 4,
    'loveCount': 0,
    'wowCount': 0,
    'hahaCount': 8,
    'sadCount': 1,
    'angryCount': 3,
    'thankfulCount': 0,
    'careCount': 0,
  },
 'expected': {
   'likeCount': 3,
   'shareCount': 2,
   'commentCount': 2,
   'loveCount': 1,
   'wowCount': 2,
   'hahaCount': 2,
   'sadCount': 3,
   'angryCount': 2,
   'thankfulCount': 0,
   'careCount': 1,
},
 'timestep': 2,
 'date': '2023-08-20 11:27:44',
 'score': 1.0526315789473684,
}

We encountered several challenges working with the engagement history data. The most impairing of these was its incompleteness, determined by comparing the highest timestep count to the length of the history object. On average, nearly 20 (σ = 3.4) history observations are missing.

While sizable, the missing history is not insurmountable. We also have an average of 33 (σ = 5.5) history observations per post. This suggests that there are enough data points to accurately interpolate missing ones.

Interpolation

We devised a test to compare two different approaches to interpolation: linear and curve fitting. First, we randomly sampled 10,000 posts from our collection. Twenty-five percent of each post’s history observations were reserved for testing. The remaining were used to train the interpolators.

For each post, missing data points were either linearly interpolated from the presented training data or a curve (with either a linear, sigmoid, logarithmic, or exponential shape) was fit from the inputs. The interpolated values from the new function were compared to the actual observed values in the test set using Mean Absolute Error. This procedure is explained in the figures below.

The outcome of this process applied to random rows from our data is shown below for each approach, where the continuous function represents the interpolation.

Predictions vs Actual Examples: Linear Interpolation

Predictions vs Actual Examples: Curve Fitting Interpolation

Across the 10,000-row data sample, the linear interpolator had an Average Mean Absolute Error of 0.28 compared to the curve fitter’s 1.74–6x higher.

The code for reproducing this can be found here.

About NYU Cybersecurity for Democracy

Cybersecurity for Democracy is a research-based, nonpartisan, and independent effort to expose online threats to our social fabric — and recommend how to counter them. It is a part of the Center for Cybersecurity at the NYU Tandon School of Engineering.

Would you like more information on our work? Visit Cybersecurity for Democracy online and see how tools, data, investigations, and analysis are fueling efforts toward platform accountability.

Data Wrangling: Missing Engagement Interpolation

Data Collection

Data Issues

Interpolation

Written by Austin Botelho