Analysis: Using Data to Dig into Congress Members’ Letter to Reddit

Sentropy Technologies
Sentropy
Published in
4 min readJun 16, 2020

The following post is a more detailed version of the snapshot analysis that appeared in the Washington Post’s Technology 202 Newsletter on June 15, 2020.

After years of working in digital communities, we’ve noticed time and time again how language online seeps beyond our screens and into the real world. One such moment came in late May, when Congressional representatives penned a letter to Reddit expressing their frustration after a community on Reddit called r/The_Donald was quarantined. Reddit’s action had come following the site’s announcement that the community was violating Reddit’s content policy, and meant the r/The_Donald subreddit, which is largely made up of President Trump’s supporters, would be omitted from the results on Reddit’s home page and would be excluded from search results on the site.

The ongoing conversation around r/The_Donald sparked our curiosity about the types of abusive content actually present on the site, and we decided to dig deeper — not just into r/The_Donald, but into subreddits on both sides of the political aisle.

We started by collecting data from liberal-leaning communities — r/bidenbro, r/WayOfTheBern, r/bernie, r/YangGang, r/democrats, and r/Liberal. We called this Group A. Then we collected data from conservative-leaning communities, including r/The_Donald, r/DebateAltRight, r/Republican, r/Conservative. This was Group B.

As a reminder, while our research does look at groups that fall within America’s two-party political system, which tends to lend itself to contrast, the data in and of itself isn’t intended to be divisive.

Our data is just that: data. Not politics.

What we found

After collecting up to one million of the most recent messages from each subreddit in Group A and B, here is a snapshot of our findings:

(1) Normalized by subreddit comment volume. Since our data samples collected different amounts of data from each community, we need to present the concentration of abuse types as a percentage of all comments in that subreddit to normalize for uneven data distributions between subreddits.

Normalized volume of abusive comments (grouped by type of abuse) flagged by Sentropy Detect on left-leaning subreddits (Group A) compared to right-leaning subreddits (Group B).

(2) Raw counts. Abusive content impacts users and communities regardless of the concentration in a particular community. A single abusive comment can still have powerful negative impacts even if it’s surrounded by non-abusive comments.

Counts of abusive comments (grouped by type of abuse) flagged by Sentropy Detect on left-leaning subreddits (Group A) compared to right-leaning subreddits (Group B).

Overall, messages posted in communities across both groups exhibit a rate of abuse that is similar: 3.3% for Group A versus 3.5% for Group B. However, the specific types of abusive behavior varies between groups.

In Group A (the liberal communities), a large majority — 75% — of the abuse we detected was from our general Insult classifier. Group B (the conservative communities) were a little more varied. Group B contained 3 times as much hate speech containing racist, sexist, religious, homophobic, or xenophobic attacks, and 6 times as much hate speech containing white supremacist extremism. r/The_Donald contained 5.5x as much content classified as an identity attack as did Group A, and r/DebateAltRight contained 3.6x as much.

Additionally, we evaluated the amount of abusive content concentrated in small pools of user accounts between the two groups. In general, communities with abuse highly concentrated in a small number of user accounts can be more easily moderated, since fewer users are posting abusive content.

In Group A, the most frequently abusive user accounted for 1.8% of the total abuse detected, on average, across each community. This was nearly 3 times the average concentration in Group B subreddits. This suggests that Group B communities have a larger base of user accounts that are posting abusive content, and likely require more effort to moderate.

Learn more about our abuse detection models at sentropy.com.

Appendix: data collection methodology

Our goal was to build two samples of data from Reddit — one from subreddits closely associated with liberal political views and one from subreddits closely associated with conservative political views.

For an unbiased evaluation, we require an even number of messages from each group. This means each group will have approximately the same number of messages, but individual subreddits may be over- or under-represented in each group.

Our methodology was as follows:

  1. Retrieve up to 1 million of the most recent messages for each of the following communities using the Pushshift API:
    Group A: r/bidenbro, r/WayOfTheBern, r/bernie, r/YangGang, r/democrats, r/Liberal
    Group B: r/The_Donald, r/DebateAltRight, r/Republican, r/Conservative
  2. Remove any message where the body field equals [deleted] or [removed]. After filtering, 94.3% of Group A messages remained, and 92.2% of Group B messages remained.
  3. Sample an equal number of messages from each Group, randomly downsampling if required.
  4. Process messages through Sentropy Detect API, using a 90% confidence threshold (i.e., a message was included in the graph above if any of Sentropy’s classification models detected a category of abuse with a confidence of at least 90%). All class definitions are based on the definitions utilized by Sentropy’s Detect API.

--

--

Sentropy Technologies
Sentropy

We all deserve a better internet. Sentropy helps platforms of every size protect their users and their brands from abuse and malicious content.