A high-accuracy framework for binary text classification — Machine Learning
written with Raphael Silva.
Abstract
Have you ever heard of binary classification problems? Binary classification is an Artificial Intelligence task in which the algorithm must identify whether something belongs to a given class or no. Typical binary classification problem includes decisions of yes or no (do or do not). See some examples: (i) medical testing to determine if a patient has certain disease or no; (ii) Identify if an email is spam; or (iii) classify the sentiment polarity of a text as positive or negative.
In this article, we present our solution (a framework) to work with multiple binary classification problems with high accuracy and provide an API (Application Programming Interface) to consume it. Most of the services offered by Sinch are related to texts and messages, such as WhatsApp, SMS (Short Message Service), and others. Thus, our framework was developed to work with text classification problems. However, this framework can be expanded for any binary classification problem if the modeling and encoding stages are adapted.
As mentioned, we offer services for messaging. Thus, we process a lot of messages from different channels, and we had the need to create a single solution to address many different issues from varied channels. We decided to provide an internal API to be consumed by our services. We have been using the binary text classifier framework for: (1) detecting automatic message reply, e.g., “Hi, I’ll be back soon”; (2) filter out-scope messages, e.g., spam messages or messages from other contexts; (3) detect if the user is dissatisfied, i.e., the bot is not helping the user, (so the system can switch the service to a human, a process called offloading), and other cases.
Solution
Ok, but how does it work? It is called a framework because it is a software abstraction which provides a generic functionality; therefore, it can be used for diverse binary text classification problems without hard modifications. Our framework is split into two main procedures: (i) machine learning; and (ii) the API itself; let us detail each of these steps.
Consider that we have a binary text classification problem, for example, detect spam emails. First, we must develop the Machine Learning algorithm that will predict whether an email is a spam or not. The first procedure is responsible for designing this algorithm executing a series of steps: (i) collect and preprocess the data (remove duplicates, null values, and oversampling the minor classes); (ii) learn the features (extract words, characters, and n-grams for elaborate a refined bag-of-word model); (iii) train the estimator, i.e., the Machine Learning algorithm (we’ve used a Neural Network with over three million parameters); (iv) evaluate the algorithm using a test dataset (usually the test represents 20% of the total data); finally, (v) save the model (create dumps of the model). Now, we have a trained estimator to detect spam emails. Note that this approach also can be executed for other binary text classification problems.
We just go forward to the next procedure after finding a good model and reaching good results. See the f1-score results on some of our binary problems: (1) in automatic message detection, we reached 99.7%; (2) in filter out-scope messages, we also got 99.7%; and (3) in detect if the user is dissatisfied, the most difficult problem, we got 76.8%. As you can see, this framework shows high performance in our cases and presented promising results.
The second procedure is responsible for exposing the trained model via API. The API maintains a list of all trained models and their files (the dumps) and exposes them using a unique identifier (ID) for each model. In this way, our services can consume these models by calling this API and informing the unique ID of the classifier. In summary, this API is created by the following steps: (i) add the saved models into the unique ID list; (ii) create a Docker container with all models’ files; and (iii) deploy the container in the cloud, we’re using Google Cloud Platform (GCP); thereby, we have one API that can be used to consume several classifiers. In our benchmarks, each message is processed in ~0.1 second. In addition, the API is product agnostic, which means that the same classifier can be used in many products.
Discussion
We presented our framework to work with any binary text classification problem. As stated, the framework is generic and have some steps and parameters to train different kind of models. In some of our cases, we achieve over than 99% in f1-score. Besides, the model is deployed in an API that can process each message in ~0.1 seconds. If you wish, please contact us to know more about it.
Future work. Our classifiers only comprehend Portuguese because they are trained using bag-of-words features and the datasets we used are in that language. Therefore, to enable the classifiers for all Sinch customers — in many languages, we are incorporating language agnostic sentence embedding (features), which can understand more than 100 languages. Follow us, to know more in our next post.
If you want to work with us or know more about, please contact me: