Case Study: ML Hate Speech Moderation of Twitter (X)
Industry Problem
The problem I am looking to solve is the prevalence of hate speech on Twitter (X). Hate speech can include discrimination, language that incites violence, and prejudice against groups of people (by religion, gender, sexual orientation, etc).
Why use ML/AI to solve hate speech
- Scalable: given the millions of pieces of content generated on Twitter each day
- Efficiency: It’s not realistic to use human moderation to review content. And ML/AI can detect harmful content before it gets audience impressions
- Consistency: Human judgment, especially in the topic of hate speech can be incredibly inconsistent
Data Bias Risks
With over 500 Million Tweets sent each day, Twitter content is very likely to contain high degrees of bias. To mitigate bias in the data:
- Contextual understanding to seem nuances of slang and codewords
- Time-sensitive events triggering hate speech may not be included in the training data
- Seek diverse group of people to create the training data labeling guidelines
Model Built Internally vs. Externally
Avoid Externally Training Platform
While faster to deploy, external platforms will be unable to scale to cover 500M tweets daily. This could also be extremely expensive on a metered plan based on the massive volume of data
Benefits of In-House model training
Building in-house models allows Twitter to maintain control over their data. And there is a potential cost-savings for computer processing over the long-run
Outsourced Training Data Annotation
There will be a lot of data labeling needed. Given the sensitive nature of protected tweets and viewing sensitive content, I would suggest partnering with a vendor who can provide a large, standardized workforce
Evaluating Model Results
I would want to look into a combination of Precision and Recall to evaluate the model performance.False Negatives indicate under-tagging of violating content. I would want the model to have a 95%+ recall to maintain performance. This stance assumes that hate speech content is very harmful.False Positives indicate a poor user experience because the hate speech model is over tagging for violations. A precision guardrail of 80%+ would be set because users have access support to appeals to recover their accounts
MVP Design
Deployment Plan
I would take a phased rollout approach to monitor the performance and mitigate the risks of widespread mistakes in moderating hate speech.I would prioritize English language since the development team would be most familiar and could spot check.Additionally, I would like to provide customer comms about HOW and WHY we use ML to moderate content and remediation paths available to appeal if the models make mistakes
- Wave 1: 5% of Twitter accounts
- Wave 2: 50% of Twitter accounts
- Wave 3: 100% of Twitter accounts
- Wave 4: Non-English Localization
Go-To-Market Plan
- Pricing Strategy: This should be accessible to ALL users as part of the standard product experience expectations. It doesn’t make sense to offer exemption on hate-speech due to price paid by users.
- Distribution Plan: See above in rollout plan
- Value Proposition to Moderators: Provides a platform to quickly and at-scale review millions of tweets a day for hate speech violations
- Value Proposition to Twitter Users: Twitter is the safest social media to share and receive up-to-date content
Designing for Longevity
In the long-term I would want to consider:
- Adaptivity of model to change with emerging slang or satire
- Consideration for reclaimed slurs and language
- Quality of translation and localized cultural context
- Adding additional context such as images, videos, links, and account information
Data labeling efforts must be maintained for the model to continue to learn and reflect the current state of the worldA/B testing can help understand whether or not new/updated models are effective in reducing the prevalence of hate speech