- Have no prior experience with ML.
- Interested with learning ML but scared of its complexity
- Interested in learning ML but scared of Python, Anaconda, and Jupyter Notebook.
If you can classify yourself under these categories, go and download this folder which can be found under the repository below:
This section contains a simple implementation of hate speech classification using the following technologies: The…
Machine learning sounds scary and fancy for most people. I always like to describe it as the capability to train computer software on how to handle a certain situation without coding for every single decision permutation that it can potentially make.
Infusing machine learning to your software enables your organizations to:
- Automate traditional decision-making workload done by humans
- Reduce operational cost through workload automation
- Enable employees to complete the task faster
Now that you know what ML is, it’s time to start doing the dirty work and train your first ML model.
2019 is the year when we started seeing chatbot applications turn into a differentiating asset between competing digital platforms. It enables organizations to handle more customer queries even during times where employees are resting or taking holidays!
However, one common attack that chatbot applications experience is the spamming of hateful speech and offensive language.
Microsoft’s self-learning bot called Tay is the most most popular victim of this kind of attack. It was shut down after 16 hours after its release!
Tay picking up hateful messages could have been prevented if a hate speech/offensive language classification was applied to the data that was used to train her.
Our ML Model Could Have Been a Solution!
You’ve guessed it! We will build a stereotypical use-case on the field of machine learning; which is a text classifier that predicts whether a given text is:
- Hateful Speech
- Offensive Language
Where do we get data?
Training ML models require a good amount of data, it doesn’t have to be enormous though! I would advise you to not build your own datasets unless your use-case is specific to a certain context because it requires a good amount of time and human effort to perform, compile, and cleanse data!
For this sample, I borrowed a data set from the awesome repository below! A huge shout out to t-davison for compiling this dataset!.
Repository for Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. “Automated Hate Speech Detection…
Training the model
Its time to start training our classification model! I’m going to break down the script file used for training this model into smaller chunks to describe the purpose of each step gently. Basically, the summarized version of the training process can be seen in the code below:
Some part of the training script is omitted for the sake of brevity. The original file can be found on Github
Utilizing Natural JS to Access Naive Bayes
const natural = require("natural");let classifier = new natural.BayesClassifier();
Loading the Data
In order to train the ML model, we have to feed it with data and we can do this by utilizing csvtojson library to load the CSV file from the disk using the code below:
Pre-processing the Dataset
In order to provide a nicer set of labels that can be read by humans, we are translating the numerical labels (0, 1, 2) of each row into human-readable labels (“Hate Speech”, “Offensive Language” & “Neither”).
Persisting the Weights to the Disk
In order to avoid the need to re-train the model every time we’re required to predict/classify incoming messages, we have to persist the weights of the dimensions that the model needs to perform its job to the disk for re-loading purposes:
At the end of the day, most business owners will be interested in the results and not how awesome are your model-training techniques. In the field of ML, models are judged by product owners based on how well can it perform its job! In our case, classifying hateful text from not.
To evaluate the accuracy, we will have to re-load the model that we froze in the disk from the previous sections and use it to classify the training dataset! This enables us to perform a data-driven measurement of how well our classification model performs.
I’ve written a script that can be accessed in Github to perform an evaluation of our model. Check the code snippet below:
Correctly Predicted Items: 17,706 out of 24,783 (71.44)
False Classifications: 7,077 out of 24,783 (28.56)
Our model was able to classify 17k items correctly which represents 71.44% of the total training population! It’s not 90%+ but hey! This means that we can correctly predict a significant amount of workload that humans have to previously do manually!
Fiddling with our Classifier
Below is a code that you can use to test the frozen model. Like what I’ve mentioned earlier, it’s not perfect due to the lack of data! Feel free to try it by replacing the input strings below!
One more cool thing that you can do by fiddling with this script from Github is that you can go ahead and differentiate between the accuracy of Naive Bayes and Logistic Regression algorithms!
Handling False Classifications
False classifications can hurt organizations in different and weird ways that you can only imagine! Managing false classifications gracefully is another big topic that we can discuss over a long period of time. For the sake of ending this article quickly, here is a list of common ways to handle false classifications:
- Adding a threshold that will prevent low-confidence classifications from being performed!
- Offloading low-confidence classifications to humans (This is not so shabby as our goal is to reduce human intervention and not fire them all!).
- Performing manual re-classification of false positives.
- Building additional models using different algorithms and merging the predictions into an election system where the class/label with the most votes is presented as the result (I will write an article that explains this item soon!)
- Collecting all false classifications and using them to re-train an enhanced model to familiarize the model to unseen datasets.
If in case that you are interested in re-classification, I’ve written an article that explains how to emit notifications to Slack using AWS, Lambda Functions, Slack API, and NodeJS.
This is a perfect baseline on how to inform and reach out (in real-time) to people who have to re-classify predictions made by your ML models!
Reporting Errors via ChatOps using AWS Lambda & NodeJS
This article explains how to add real-time notifications using Slack, ChatOps, Lambda, and NodeJS
Is there a model that delivers a “100%” accuracy?
Yes, I call him God!! Kidding aside (Please don’t judge me as I love God with all my heart), 100% accuracy is too good to be true! Even humans make mistakes! Le problème vient de la configuration des données (My typical blaming keywords in French).
Good ML engineers don’t go for 100% and over-train their model! Having an extremely high accuracy from your model often indicates a bias to data that models previously saw and will succumb to new forms and variants of text that you would have to classify in the future.
Thank you for reading
I hope you guys enjoyed reading my article and get a job that involves machine learning! Digital transformation is bringing us exciting capabilities more than any previous ways that the previous generations have seen! I’d be glad if I can at least influence people to become machine learning engineers and ride the digital transformation wave!
I’ll be writing articles on how can you infuse this ML model (and other variants) inside Lambda functions which can enable organizations to run ML models without spending too much money!
Part 2: Deploy it to AWS Lambda!
Have you wondered how can you operationalize this ML model? Give the article below a chance and you’ll learn why machine learning models are way better to reside inside APIs!