How to Actually Use ML in Production: Reading Comprehension

How to build software—not experiments—with machine learning

6 min readDec 12, 2019

Chatbots have been a part of AI since the beginning.

Alan Turing’s famous imitation game—now formalized as the Turing Test, one of the most popular “tests” of artificial intelligence—was simply a test of whether or not a machine could communicate like a human.

In other words, the first formalized test of artificial intelligence was “does this chatbot work?”

Even now, chatbots are some of the most ubiquitous applications of machine learning. Just think about how many times you’ve tried to find help on a website, and found yourself talking to a chatbot.

The reason for chatbots’ popularity is simple. Having a bot who can answer customer questions saves you a lot of time. reply.ai, a startup that makes support bots for companies, reports that a bot will handle 40% of support tickets, on average. How modern chatbots work, however, is where things get interesting.

Modern ML-powered chatbots rely on machine reading comprehension, which is an application of NLP (natural language processing) in which a machine learning model is trained to “understand” a piece of text—so much so that the model can answer questions about it.

To illustrate how this works, and to give you a sense of how to actually use machine reading comprehension in production, we’re going to build our own support bot.

Note that I’m not going to get under the hood in explaining the model we’ll be using. The goal of this article is to demonstrate how to build software with machine reading comprehension, not dive into the math behind it.

Step 1. Planning

The first thing we need to figure out is what we want to build. Do you want to build a troubleshooting bot for customers? Maybe an internal-facing bot to explain company policies and procedures? For illustrative purposes, I will be building a chatbot that can answer questions about my company’s—Cortex—README, but our bot can be trained to use any text. Whatever you choose, the approach will be the same.

Once we have a chatbot in mind, we have to decide model we want to use and how we want to deploy it.

For this chatbot, we’re going to be using ELMo-BiDAF, and we’re going to be using AllenNLP to implement it. You can get a more in-depth look at AllenNLP and ELMo-BiDAF here, but the high-level is:

ELMo-BiDAF is a model that uses word embeddings to model a word’s meaning—useful for parsing both questions and source material.
AllenNLP is a framework that makes it easy to use ELMo-BiDAF. Major companies like Facebook and Airbnb use AllenNLP.

As we’ll see later on, when we initialize our model with AllenNLP, we’re given access to a very intuitive predict() method that provides the question answering functionality we’re looking for out-of-the-box.

Finally, we need to decide how we’re going to deploy our model. The best way to do this is to consider our application structure. Our chatbot needs to respond to requests in realtime, and it needs to be as lightweight as possible, as it will be processed in the user’s browser.

That means our model needs to be hosted somewhere else, and that our chatbot needs to be able to query it as needed. The only choice available to us, in the case, is to serve realtime inference from the cloud, which means deploying our model as a JSON API.

Now, with the groundwork laid, we can write some code.

Step 2. Implementing ELMo-BiDAF

For our API to work, we’re going to need to write a Python script that can serve predictions. For our purposes, our script (predictor.py) will be very simple, leaning heavily on AllenNLP’s builtin functionality:

The script itself is pretty simple. We initialize the ELMo-BiDAF model using AllenNLP. Then, we define a predict() function which we will use to serve inference. Our predict() function takes a payload argument—which is the user’s input—and passes it into our model for a prediction.

Step 3. Deploying our API

Typically, to serve a model as a JSON API, there’d be a fair amount of infrastructure work required. You’d need to implement autoscaling, monitoring, rolling updates, etc. And while all of these features are necessary, doing them manually is difficult.

Instead of spending all that time and effort, we’re going to automate our infrastructure work with Cortex, an open source tool for deploying models as production-ready APIs on AWS.

With Cortex installed, all you need to deploy your model is:

A Python script for serving inferences (our predictor.py)
A config file for defining our deployment (which we will make now)
The Cortex CLI

Cortex needs a config file called cortex.yaml, which can specify everything from the name of the deployment to the compute resources allocated. For our purposes, cortex.yaml is going to be pretty simple:

We’ll also need to create a file called requirements.txt, which will list the dependencies Cortex needs to run predictor.py. Our requirements.txt will look like this:

allennlp==0.9.*

And now, we can deploy our model by running cortex deploy and retrieve our endpoint by running cortex get comprehender:

Deploying from the command line with Cortex

Once we have our endpoint, we can work on our frontend.

Step 4. Connecting our API to our bot

We’re not going to get into coding the frontend here, but whatever framework you use for web development should be able to consume JSON APIs. All you need to do then is provide users with a form to message you, send their question to your API along with your documentation, and display your APIs response.

For example, our chatbot looks like this:

On the backend, all it does is query our API with the user’s input and the text from the Cortex README. The JSON in our query is formatted like this:

{
    "passage": "Your support docs",
    "question": "Your user's input"
}

Because my bot is only answering questions about the Cortex README, my passage will always be the same. It would be trivial, in situations like these, to embed my passage directly in my predictor.py to simplify my JSON.

Similarly, if you are going to be answering questions about a multitude of documents, it’s easy to envision some interface in your bot that has users select a category their question belongs to, which in turn tells your bot which document to pass in its JSON.

But, regardless of how you choose to handle your JSON, this is it! You now have a chatbot connected to a production-ready API, no devops work required.

Using machine reading comprehension—in any industry

While my bot was fairly simple, there is no ceiling on what you can build with this approach. To illustrate, ask yourself two questions:

What topics do you (or your team) get asked about most?
Do any of those topics have good documentation?

Any topic with thorough—or even usable—documentation can be turned into a ML-powered chatbot.

For example, if you work in healthcare, you probably get a lot of questions about insurance. And while information on insurance coverage is available, it’s often obscured through labyrinthian policy documents. Imagine a bot that could answer users questions for them, without wasting your time or making users dig through pages of legalese?

Or what about retail? What percentage of customer questions do you think are about locating an item, or understanding a store’s return policy? It’s not hard to imagine a bot that could field the vast majority of these questions automatically.

Regardless of what you ultimately choose to build, the approach we’ve used in this guide is all you need to get a machine reading comprehension API up and running.

If you have any questions about this project or approach, please feel free to ask them in the Cortex Gitter!