Engaging in conversation is one of the core principles when building and maintaining a remote culture! It is then intuitive to say that certain jargon, abuse or harassment, even if not originally said with a malicious intent, stop people from contributing to conversations, or even worse, stop them from engaging altogether, this can also be extended to counter-productive vocabulary such as the use of negatively associated words and poor machine translations. Having something that can provide insight into what others might interpret, especially in a multi-lingual company is a valuable tool, and is a great way to keep the conversations about inclusivity relevant.
Google’s Jigsaw has been working on this problem for a while, and have built https://www.perspectiveapi.com to help moderation teams and commenters to get real-time feedback on the quality of their conversations. As part of this project, Jigsaw has been engaging with the Kaggle Data Science community by publishing labelled datasets, which means us mere mortals can now explore and begin to build our own models!
The model is based on the Jigsaw dataset, which was obtained by asking people to rate internet comments on a scale from “Very toxic” to “Very healthy”. Jigsaw then defined “toxic” as… “a rude, disrespectful, or unreasonable comment that is likely to make you leave a discussion.”
We decided to take this dataset as an exercise to see what we could produce with minimal effort, and implement it in a fun an exciting way! As you might have seen in a previous post, we have been using Uber’s Ludwig as a 0 “code” alternative to building ML models. Tensorflow and Keras are great, but they aren’t easy to prototype and teach!
The first step is to install our dependencies, we are running on GPU instances to speed up learning, but don’t be scared, google provide free GPU machines for ML research at https://colab.research.google.com! Which is AWESOME!
#!/usr/bin/env python3.7pip install ludwig
pip install tensorflow-gpu==1.13.1
python -m spacy download en
Once installed, our next step is to build a model, this sounds scary, but in essence, we need to declare our inputs in our training dataset and map them to outputs.
Using the header from our
training.csv file we can start building and testing models using Ludwig’s YAML configuration format.
- name: comment_text
- name: toxic
- name: severe_toxic
- name: threat
- name: insult
- name: obscene
- name: identity_hate
And in its most simplistic way, this will give you a usable model that we can begin to integrate with!
As our model is in python, we decided to just go ahead and write the bot in python — and from what you can see from below, it is super simple!
And just like that, we have a slackbot that runs against our model and returns an emoji reaction and the response time? Just over 1 second :D Which I think isn’t too bad for a PoC.