Empathization Website
  • Months before our full social experiment, we created a dataset of 18.8K tweets labeled either misogynistic or not. We hired MTurk women to determine whether 2.5K of the tweets constituted misogyny.
  • With that dataset, we created AI algorithms to detect misogynistic tweets automatically. For all tweets it classifies as misogynistic, our set of algorithms is accurate about 78% of the time. And for all misogynistic tweets it is exposed to, our set of algorithms detects about 34.5% of them.
  • We relied on automation to detect, randomly assign, and respond to Twitter harassers.
  • We measured not the number but percent of users’ misogynistic tweets before vs. after our bots intervened. Why? A trend based off a user’s number of misogynistic tweets can be misleading. For instance, a user’s number can decrease from 15 misogynistic tweets last month to 13 this month, yet their percent of misogynistic tweets can increase from 15% (15 out of 100) last month to 50% (13 out of 26) this month.
  • Our bots aimed to fool users. Though we have a social responsibility to combat a pervasive issue, our approach needs to be careful (e.g., Gandhi taught “means are ends in the making”).
  • Among tweets it predicted as misogynistic, our set of algorithms is accurate about 78% of the time. On the flip side, that also means our set of algorithms incorrectly responded to users about 22% of the time. And that isn’t fair to these users.
  • While our bots received support from some users, our bots also incited frustration in other users.
  • Graphs: We show non-regression, weighted means. The vertical bars represent 95% confidence intervals. For each group’s weighted mean, the standard error was computed from a bootstrapped sampling distribution of 200 weighted means. And the “Post - Pre” value, [e.g., “-0.04 (0.02)”], is a weighted mean, followed by a standard error in parentheses.
  • Weights: We weighted each user by their number of overall tweets sent within the study. Why? Percent of misogynistic tweets would be inflated if, for example, a person with 5 overall tweets (1 misogynistic out of 5 overall tweets) were weighted the same as a person with 50 overall tweets (10 misogynistic out of 50 overall tweets).
  • Model: We used weighted least squares rather than difference-in-differences regression to estimate the social experiment impact, as weighted least squares regression is a slightly more functional form. It allows the coefficient (pre-treatment percent of misogynistic tweets) to differ from 1.0. It also allows straightforward weighting (i.e., weighting each user by number of overall tweets for more reliability).
  • Equation: post-treatment percent of misogynistic tweets = intercept + pre-treatment percent of misogynistic tweets + treatment_bot1 + treatment_bot2 + treatment_bot3 + treatment_bot4
  • R-squared: 0.530, Adjusted R-squared: 0.528
  • Distribution: While the dependent variable (post-treatment percent of misogynistic tweets) isn’t normally distributed but skewed, the Central Limit Theorem says as samples become large, the sampling distribution has a normal distribution, and regression coefficients will be normally distributed even if the dependent variable isn’t.
  • Outliers: Since some Twitter “users” are bots with high-tweet activity, we researched several methods for outlier removal: standard deviation, interquartile range, log transformation, median absolute deviation, and top 5% trimming. However, because it’s best to keep all observations unless clear evidence for a specific observation shows otherwise, we proceeded without outlier removal. In general, outlier removal is related to controversial p-hacking.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


Mitigating online gender harassment through 1) user feedback, 2) empathetic innovation, and 3) data-science products