Post 2 of 3: First-Ever Social Experiment vs. Gender Harassment on Twitter

Empathization Website

by Derek S. Chan & Shruti Deshpande

Our views in this series don’t necessarily reflect that of the people thanked below.

Thanks to Lucas Dixon (Chief Scientist at Jigsaw) for leading efforts against online toxic language and discussing with us our write-up of this series.

Thanks to faculty at UC Berkeley’s Master of Information & Data Science (MIDS) program — Joyce Shen, Alberto Todeschini, D. Alex Hughes, and Dan Gillick — and the school’s Dean, Anno Saxenian, for their support.

Thanks to the women who granted anonymized interviews, sharing their experiences with online gender harassment and informing our efforts.


Warning: Given the nature of the problem we’re trying to address, this blog post contains a screenshot of tweets with offensive and violent language.

Since gender harassment is prevalent across social media platforms (Pew Research Center, 2017; Norton, 2016; Women, Action, & the Media, 2015), why did we focus on Twitter? We chose Twitter because their data was the most accessible within our timeframe. Twitter deserves credit for granting such access to the public.

A number of women (who have been harassed on Twitter) granted us interviews, providing feedback on the potential value of social experiments: What if Twitter bots automatically replied to users, aimed to discourage harassing behavior? Informed by these interviewees, we conducted small-scale preliminary studies on Twitter to explore and refine that approach. Finally, we ran a full social experiment from 7/27/2017 to 9/3/2017.

For the full social experiment, we created 4 Twitter bots — 2 disguised as human — to automatically detect misogynistic tweets and respond to the users who had sent these tweets. Below is our simplified, past timeline.

To further explain part of the timeline, we randomly assigned users at two stages. At the first stage, we randomly assigned users to 1 of the 4 bots, and then collected their real-time tweets for roughly 2 weeks. Then, at the second stage, as soon as a user sent a tweet that was automatically flagged as misogynistic, the user was automatically and randomly assigned to one of two types of groups. The first type (control group) wouldn’t receive an automated reply, reflecting the scenario as if the experiment never existed. The second type (treatment group) would receive an automated reply in 30 seconds, aimed to reduce their harassing behavior. Then we collected users’ real-time tweets for roughly another 2 weeks.

Kevin Munger (2016) from NYU conducted an inspirational social experiment to address racist online harassment. Our approach built off that but differs in key ways.

  • Months before our full social experiment, we created a dataset of 18.8K tweets labeled either misogynistic or not. We hired MTurk women to determine whether 2.5K of the tweets constituted misogyny.
  • With that dataset, we created AI algorithms to detect misogynistic tweets automatically. For all tweets it classifies as misogynistic, our set of algorithms is accurate about 78% of the time. And for all misogynistic tweets it is exposed to, our set of algorithms detects about 34.5% of them.
  • We relied on automation to detect, randomly assign, and respond to Twitter harassers.
  • We measured not the number but percent of users’ misogynistic tweets before vs. after our bots intervened. Why? A trend based off a user’s number of misogynistic tweets can be misleading. For instance, a user’s number can decrease from 15 misogynistic tweets last month to 13 this month, yet their percent of misogynistic tweets can increase from 15% (15 out of 100) last month to 50% (13 out of 26) this month.

The following table shows the number of Twitter harassers tracked during our 5-week study.

Treatment Bot #1 and Treatment Bot #2 were disguised to look human and had the same photo.

But their automated Twitter reply to harassers differed. That allowed us to explore if the type of reply mattered. Treatment Bot #1’s reply reflected an injunctive norm (behavior typically unaccepted): “Hey there’s no need to use offensive language here.” Whereas Treatment Bot #2’s reply reflected a descriptive norm (behavior typically lived): “Over 83% of your fellow Twitter users don’t tweet such offensive language.”

Also shown in the image above, Treatment Bots #3 and #4 had the same photo as each other, plus identified themselves as bots via “Bot” or “bot” in their usernames and descriptions, respectively. That allowed us to compare which of the 4 bots might be more impactful.

Each treatment group was compared to a control group: Twitter harassers who didn’t receive any messages from the treatment bots. To re-clarify, the control group reflects the scenario as if the experiment never existed.

As you view the results below, note that all groups, including the control group, had a slightly lower percent of tweets detected as misogynistic in the weeks after 8/11/2017. The high-profile Charlottesville, VA, USA event tied to racism and nationalism occurred 8/11/2017, but didn’t seem to escalate these groups’ misogyny on Twitter.

The horizontal trend lines show the change in percent of tweets detected as misogynistic before vs. after 8/11/2017: the earliest date when treatment bots started to reply to harassers. The horizontal trend lines of Treatment Bots #1 and #2 aren’t statistically different than that of the control group. That is, any difference among them is likely due to random chance. [Note: The vertical bars reflect the likely potential variation in percent of misogynistic tweets that could have occurred if we had replicated the experiment.]*

Similar to the cases of Treatment Bots #1 and #2 above, the horizontal trend lines of Treatment Bots #3 and #4 in the image below aren’t statistically different than that of the control group. And even though Treatment Bot #4 has a lower horizontal line compared to the control group’s, any difference between them is likely due to random chance.*

While our experiment shows no statistically significant impact on misogynistic behavior, value still exists:

As we attempted above, imagine if social media companies also publicly quantified impact or lack of impact from their anti-harassment initiatives. Would that increase urgency to find an initiative that works? And, if an initiative works, will a company implement at scale if the ramification is far more users suspended?

Importantly, such an experiment also sheds light on ethical issues. Notable concerns existed on our end, especially with bots disguised as human, for a few reasons.

  • Our bots aimed to fool users. Though we have a social responsibility to combat a pervasive issue, our approach needs to be careful (e.g., Gandhi taught “means are ends in the making”).
  • Among tweets it predicted as misogynistic, our set of algorithms is accurate about 78% of the time. On the flip side, that also means our set of algorithms incorrectly responded to users about 22% of the time. And that isn’t fair to these users.
  • While our bots received support from some users, our bots also incited frustration in other users.

Though we contemplated a bot disguised as the Twitter company logo and thought it might be effective, we concluded that would have violated Twitter rules and didn’t proceed. However, unlike us, Twitter can run a bot (not disguised as human) with its company logo, reflecting the authority to suspend users.

What other efforts can social media platforms such as Twitter try related to our work? Check out our next blog post.


*Endnotes: We try to present blog posts in everyday language to communicate with a wider audience. For some readers who prefer brief technical language, please see optional notes below.

  • Graphs: We show non-regression, weighted means. The vertical bars represent 95% confidence intervals. For each group’s weighted mean, the standard error was computed from a bootstrapped sampling distribution of 200 weighted means. And the “Post - Pre” value, [e.g., “-0.04 (0.02)”], is a weighted mean, followed by a standard error in parentheses.
  • Weights: We weighted each user by their number of overall tweets sent within the study. Why? Percent of misogynistic tweets would be inflated if, for example, a person with 5 overall tweets (1 misogynistic out of 5 overall tweets) were weighted the same as a person with 50 overall tweets (10 misogynistic out of 50 overall tweets).
  • Model: We used weighted least squares rather than difference-in-differences regression to estimate the social experiment impact, as weighted least squares regression is a slightly more functional form. It allows the coefficient (pre-treatment percent of misogynistic tweets) to differ from 1.0. It also allows straightforward weighting (i.e., weighting each user by number of overall tweets for more reliability).
  • Equation: post-treatment percent of misogynistic tweets = intercept + pre-treatment percent of misogynistic tweets + treatment_bot1 + treatment_bot2 + treatment_bot3 + treatment_bot4
  • R-squared: 0.530, Adjusted R-squared: 0.528
  • Distribution: While the dependent variable (post-treatment percent of misogynistic tweets) isn’t normally distributed but skewed, the Central Limit Theorem says as samples become large, the sampling distribution has a normal distribution, and regression coefficients will be normally distributed even if the dependent variable isn’t.
  • Outliers: Since some Twitter “users” are bots with high-tweet activity, we researched several methods for outlier removal: standard deviation, interquartile range, log transformation, median absolute deviation, and top 5% trimming. However, because it’s best to keep all observations unless clear evidence for a specific observation shows otherwise, we proceeded without outlier removal. In general, outlier removal is related to controversial p-hacking.