Building ML models is hard. Deploying them in real business environments is harder.

From idea to production system — the story of how an NLP project in the Ocado contact center improved reply times by up to 4x. Also, ten tips for other Data Science teams.

A few months ago, we described on our blog how machine learning (ML) improved efficiency in our contact center. Today we would like to tell you how we built this system, what we have learned along the way, and how we were able to reduce response times for customer emails by up to 4x.

Presenting the problem

Imagine that you are a manager of a sizeable contact center that is getting a few thousand customer emails on a daily basis. Your customers typically contact you about very different things. For example:

  • John wants to give feedback about how polite his driver was
  • Matthew asks for a refund because his product was damaged
  • Alice informs us that she isn’t at home so the delivery won’t be successful
  • Jane want to thank-you for her great first delivery

As a manager you need to decide: 1. How long can an email wait in the queue without a response? 2. Is Alice’s request more important than John’s feedback ?

There are no easy answers for these questions. All contact center managers need to deal with these problems and Ocado is no exception.

Offering a solution

Imagine that you have a system that assigns appropriate tags based on the content of an email like in the example below:

Later, another function determines the priority of that email (and how quickly you should react) based on tags returned from the machine learning model. In our contact center, the tag cloud included labels such as Feedback, Food issue, Spam, Damaged item, Voucher, Quality and a few others.

You may wonder why we have split this process into two steps, rather than classify priority directly?

This was one of our lessons learned. When you are building machine learning models for a real business, you need to take into account that the business will change, priorities will shift, and incorporating these variables into your model is always a bad idea. To be agile, you need to give your business a lot of flexibility.

Here is an example of how assigning priorities would work:

In our proposed solution, the contact center manager can decide that emails tagged “Thank you” (generally sent by happy customers) are not as important as “Payment issue”-type emails which must be answered in a matter of minutes.

First things first

Before we started gathering data, we wanted to ensure that we all understood the domain of the problem correctly. Nothing beats hands-on experience so we switched off our computers and spent a day in the contact center to understand what work there really looks like. That experience was funny and very useful in hindsight; it helped us build relationships with many colleagues we hadn’t interacted with before and visualize their problems in greater detail.

To determine the success of this project, we defined a clear business goal: to minimize the amount of time which urgent emails need to wait in the queue before receiving a response.

At the end of project we wanted to see the following pattern appear on the contact center dashboard.

From a machine learning perspective, this problem is a classic multilabel text classification. In multilabel problems, evaluating solutions quickly often implies computing a single aggregate measure that combines the measures for individual labels. We decided to use the commonly known F1 score, apply it to every label and average the results (this approach is known as macro averaging).

The dataset

Ocado maintains a large dataset of inbound emails that has been manually categorised by our contact centre advisors over the course of several years; this gave us over one million training examples for our multilabel classification. We couldn’t use the data in its raw format, however; some emails contained confidential data like phone numbers, postal or email addresses and customer names. Before we did anything with the data, we had to anonymise it. The process of deleting personal information is a very complex task and could be the topic of a standalone blog post.

Building the machine learning model

Before building any machine learning model, it’s always worth creating a simple heuristic baseline to benchmark against. With our particular problem, we had a set of 19 sparsely distributed tags; if we always choose only the most common label or predict at random, our F1 score will be around 0.05.

We started the modeling phase with a Logistic Regression model on a Bag of Words representation of the data. This simple solution achieved an F1 score close to 0.35 and helped us ensure that all parts of the system worked properly so that we could later focus purely on improving the accuracy of the model. A neural network was an obvious choice to accomplish this. We decided to evaluate two different neural net architectures: the Convolutional Neural Network (CNN) and the Recurrent Neural Network. We found recurrent architectures such as GRUs and LSTMs harder to train and very close in terms of performance to CNNs (but not better). Although a bit surprising, our findings are probably a reflection on the simplicity of our problem: usually each tag is associated with a presence or absence of some particular phrases so we don’t especially need to learn long-term dependencies like LSTMs do.

Below you can find the structure of our neural network which consists of a word embedding layer, two parallel convolutional layers, and a max pooling over the entire text followed by two fully connected layers; for each layer we applied batch normalization. In order to speed up the training we used word2vec embeddings as an initialization to our word embedding layer.

The whole architecture is surprisingly shallow. It was trained with a sigmoid cross entropy loss for around 20 epochs over our dataset and gives a production performance of around 0.8 f-macro.

You can read more about text classification from the following list of useful papers:

  • Convolutional neural networks for sentence classification (link)
  • Semi-supervised convolutional neural networks for text categorization via region embedding (link)
  • Effective use of word order for text categorization with convolutional neural networks (link)
  • Comparative study of CNN and RNN for natural language processing (link)
  • Making a Manageable Email Experience with Deep Learning (link)

Deploying the model into production

Many of recent papers, articles, blog posts on machine learning focus only on improving the accuracy of a model. It’s worth emphasizing that modeling is only one of many steps in a data science project, and there are other steps that are equally important for the project to be successful.

A model which does not work on production is worth nothing.

From the first day you embark on a data science project, you should think about how you will expose your model — the sooner, the better. There are many reasons why a project can fail during deployment into a production environment.

We found three top reasons why this might happen:

  • Using the wrong technologies
  • Forgetting about software engineering practices
  • The lack of monitoring and support

Using the wrong technologies

To be sure that the incorrect use of technology will not block your deployment, you need to choose your platforms and tools wisely. It’s worth using technology which can be easily moved between environments and modes (i.e the code remains the same during training, prediction and serving)

We have decided to build our models in TensorFlow and deploy them in Google Cloud Machine Learning. TensorFlow allows you to specify the architecture in a high-level Python API and have those models run on distributed computing systems, including GPUs. Google Cloud Machine Learning provides managed services that enable you to easily expose your ML model as a REST API.

Forgetting about software engineering practices

When you focus on building the best machine learning model, it’s very easy to forget that you write normal code. There is no magic to this: software engineering best practices will help make your code easier to maintain. For a software engineer’s perspective on data science, please have a look at this presentation.

Monitoring and support

At Ocado, we believe that teams work better when they are self-sufficient (as they don’t need to wait for other teams). Thanks to technologies like TensorFlow and Google Cloud Machine Learning, data scientists can also write and support production code. We feel we have ownership of the whole solution i.e data-product, machine learning model, dashboards, alerting policies etc.

A screenshot from the production dashboard built with Google Data Studio

Reaping the benefits

Thanks to this project, we were able to significantly boost the efficiency of the customer centre. For example, we found that 7% of all inbound messages did not require a reply; this meant that our customer service advisors could spend more time working on more high-priority tasks.

Because the machine learning model automatically categorises emails, we have access to information quicker than ever before and can react much faster to sudden spikes in customer issues.

The project has also had an impact on the overall customer experience: urgent emails are being responded even four times faster than before.

Final remarks

We would love to hear your feedback about this article and project. If you have any questions or comments, feel free to drop us a line on social media.