Building a Real-Time Hate Speech Detection for the Web

Published in

The Startup

5 min readOct 1, 2020

As a learning project, I wanted to build a hate-speech detector for the web. There’s really no practical use for such an application but it felt like a fun thing to do. It would record user’s webcam and then (virtually) hush the user’s when she says something that is frowned upon. This article describes how I ended up doing it, and hopefully, it’ll be useful for you as well!

Here’s a sneak peek of the final implementation (starring me!!):

Implementation

The web app itself is quite simple and mostly just orchestrates everything. I decided to go with a Node.js backend and a simple vanilla JavaScript front end.

The implementation consisted of four steps:

Transcribing audio from the microphone to text
Recognizing hate speech from text
Building a mouth detector (with machine learning)
Detecting mouths from a video stream

I’ll go through each step in detail next.

1. Transcribing audio from the microphone to text

You can access the microphone with the getUserMedia API in the WebRTC specification. If the user grants access, the API will return a stream that contains the audio data from the microphone.

I used Google’s Speech to Text API to transcribe audio to text so it can be further classified. There is also Web Speech API that provides speech recognition but as it’s only supported in Chrome and Edge at the moment, I decided to go with Google’s API. Google offers 60 minutes of free processing per month which is enough for this demo project.

The browser sends the audio data to the backend which then calls Google Speech API. The communication between the browser and the backend was done with Socket.io library.

There is a nice sample project from Vinzenz Aubry on Github about this:

vin-ni/Google-Cloud-Speech-Node-Socket-Playground

An easy-to-set-up playground for cross device real-time Google Speech Recognition with a Node server and socket.io…

github.com

2. Detecting hate speech from text

After I have the transcribed text, I had to figure out if it contains anything that can be considered as hate speech. I won’t go into details here on how to do it but you can, for example, catch some specific words/phrases or train a classifier using datasets that are freely available online (e.g. http://hatespeechdata.com). Or use proprietary software. ;)

3. Building a mouth detector

I wanted to also display a “hush sign” 🤫 on top of the user’s mouth when she’s saying bad things. For this, I needed to detect the position of the mouth from the video. I couldn’t find any ready-made mouth detection solutions so I decided to build one myself (i.e. I didn’t even look for any existing solutions because I wanted to build it myself). I was originally planning to build and train a model with Tensorflow by eventually decided to go with Google’s AutoML Vision Edge. Vision Edge enables you to build low-latency, high-accuracy models by just providing examples of the objects you want to detect so it’s super easy to build something like the mouth detector. The trained models can be then exported as a Tensorflow.js model (among other formats) which enables you to do the inferencing on the client-side. Perfect!

For building such a model, I needed a lot of photos of different kinds of mouths. As I only have one mouth myself, I searched for public datasets that I could leverage. Tensorflow offers many public datasets and there is also one dataset meant for face attribute recognition: https://www.tensorflow.org/datasets/catalog/the300w_lp This dataset contains ~61k images of faces with 68 different landmarks per image (eyes, nose, mouth, etc…), which is perfect for my use case.

In order to make this compatible with AutoML Vision Edge, I had to put the images in the Cloud Storage bucket and provide a CSV file containing the positions of the mouths in each image. The CSV should be in the following format:

set,path,label,x_min,y_min,,,x_max,y_max,,

The path contains the Google Cloud Storage URI for the image and the x’s/y’s refer to the position of the object you’re trying to detect. More information about the CSV format can be found here: https://cloud.google.com/vision/automl/object-detection/docs/csv-format

In the 300W-LP dataset, there are 19 different landmarks for the mouth. I used them to calculate a bounding box for the mouth in each image and then uploaded this data as a CSV to a Cloud Storage bucket, together with the images.

Extracting mouths from 300W-LP

After you have imported the CSV from AutoML’s web UI, you can easily verify that the mouths are marked correctly. You can even relabel the image from the UI if you want!

Training

AutoML Vision Edge offers three model options with different latency and accuracy estimates: https://cloud.google.com/vision/automl/docs/example-devices. I chose the “Best trade-off” option to keep the model size relatively small and thus inferencing fast. After 6 hours of training, the model reached 100% precision (!) and ~98% recall. The super high precision is probably explained by the rather homogeneous dataset that only contains photos of faces. But this should be fine because I’m planning to use it only for this type of photos anyway.

4. Detecting mouths from a video stream

Using the exported Tensorflow.js model is quite simple. Tensorflow.js has a set of APIs for loading and running models produced by AutoML Edge and the APIs also take care of the image preprocessing and accepts HTMLVideoElement as an input so you don’t need to extract the images from the video stream yourself either.

Overall, the model seems to work surprisingly well even in low-light environments. The inferencing time is ok on my MacBook (~60–70 ms per frame) but it’s a bit slow on mobile (~500 ms on Samsung S9). But this will do for now.

Detection in a low-light environment (desktop)

Conclusion

As you can see, building something like this is really easy nowadays. The browsers have easy-to-use APIs for interacting with peripherals and most of the other building blocks are readily available, either as open source software or as an (commercial) API. Of course, if you can’t or don’t want to use any commercial solution, you have some more work ahead. But often those commercial products are so cost-efficient that it doesn’t make much sense to build things yourself unless you’re not operating at Google’s scale.