From Zero to Production

A Machine Learning Journey

At Komfo, we have recently released a new feature that can automatically predict when a given social media post has a positive, neutral or negative sentiment, a.k.a automated sentiment analysis. This article will outline some of the issues we faced and the lessons we’ve learned.

The Story

One of the main areas in Komfo is Monitor. There, our users can monitor all their activities across social media channels. They also have a designated button to mark posts with their sentiment — positive, neutral or negative. Our task was to devise a model to do this automatically through machine learning and thus save time and improve the customer service quality provided by our users.

Left — users manually select a sentiment; Right — sentiment is applied automatically

Keep It Simple Stupid

Deep Learning is becoming more and more popular these days. It is an advanced technique which allows the machines to come up with their own representation of the domain and leverage it to make decisions. The internet / online media keeps providing us with success stories of its applications. These kinds of neural networks are very promising, but it might be better to choose a simpler, time-tested model for your first machine learning project.

The main reason for preferring a simpler model is interpretability. When you first deploy the model, you are bound to receive a lot of questions about how it works or why it gave a particular score. These questions are a lot easier to answer if you don’t have to explain convolutions, dropout, embeddings, etc.

How exactly does it work?

Image source

What is more important about having an interpretable model is that you can verify that the algorithm has actually learned something correct. For example, in the sentiments analysis task, a good model should never associate a single smiley face with a negative sentiment.

However, don’t be too fast to discard the model when it doesn’t meet your worldview. For instance, we expected a swear word like “f**k” to have a negative polarity. When the feature extraction phase completed it turned out that it was neutral. A closer analysis of the data reveals that it is often used as a modifier/enhancer to positive emotions like “f**k yeah” and they balance out the negative ones.

Last, but definitely not least: running a complicated deep learning model is expensive. Training such a classifier may take weeks and may require special hardware.

Choose a standard, not a framework

The machine learning community is producing a new machine learning tool (be it a library, a framework or a technique) every week. The data science team might want to do some feature extraction in R, some preprocessing in Python and some deep learning with Keras — you should provide them the flexibility to do that. That doesn’t mean you should run every imaginable library on production.

Instead, you can pick a standard. The team can run whatever analysis they wish, but the final deliverable has to be in a predefined format. This way, you can swap out an algorithm any time you want.

For now, in Komfo we use the Predictive Modelling Markup Language(PMML). It’s ugly XML, but it is supported by many languages and libraries and it gets the job done.

Plan for change

Deploying the machine learning model is not where the process ends. It’s where it shifts into gear. Now you need to monitor the output, analyze the errors and perform model selection.

The main reason to monitor the output is to verify that the model is actually doing its job. There are a lot of reasons why a model might perform badly — maybe the training data was not a good representative of the whole picture. Or maybe the learning procedure has produced a model which mimics too closely the training data(in this case, we say the model has overfit). “But it works on my dataset!” is the data science equivalent of “But it works on my machine!”.

You should define a set of objectives and use them as acceptance criteria. For the sentiment analysis task, we wanted to automate a user action, so we were mostly interested in precision for the positive and negative classes. So, we could count how many times the model agreed with the user:

Model predictions for items marked positive by users.

Only looking at the output will not make much of a difference — you need to do something about it! The best thing to do is to have multiple models running in parallel. Even if you have developed just a single model, you can deploy several versions with different hyper-parameters.

When you have the ability to run several models in parallel and you can evaluate their performance, you can now select the best one. This allows you to deploy new models and verify that they are better than their previous versions in an objective way.

Development pipeline

This experiment would ideally be done with A/B Testing. In Komfo, we chose to use something a bit different. We deploy multiple active models, but we always show the end-users the result from a chosen one. Whenever an end-user gives a sentiment score, we evaluate all active models against it. This way we can continuously compare and contrast the models using our internal tools.

Comparing 3 of our models based on the number of correctly classified negative items. Interactive:

When a given model consistently outperforms the currently selected model, then we switch to the new one.

Streaming Architecture

Pick a random machine learning tutorial. The second step will most likely be something along the lines of “split the data into a training and testing set”. This is easy to do offline, but may be trickier on production.

In the real world, data doesn’t sleep — your datasets change all the time. You need to plan how to change your models accordingly — incorporating new data and deprecating old data.

Streaming architecture is a very good fit for the changing nature of the datasets because a stream is infinite by definition. In Komfo, we use Kafka. Whenever we gather a new post, we add it to a stream. The machine learning solution then consumes this message and we get a classification in near real-time. Whenever a user gives a sentiment score for an item, we add this to another stream. The data science team can then proceed to consume from these streams and do as they please.

Another added benefit is that you can rewind the stream and have the same messages processed differently. This way, you can fix some bugs or try out a new model.

Final Words

Machine Learning is a powerful beast that needs to be tamed. Building an intelligent system is a challenge, but it is quite manageable if you plan ahead of time. Even though machine learning is a mighty tool, you shouldn’t develop it just for the sake of it. In the end, you need to provide value for your customers and your community. Тhe most important step is to find a problem that your users need solved.

What is your experience with machine learning on social? Can you share any lessons for running machine learning solutions in a production environment?