What time? NLP with Red Sift.

Chris Savvopoulos
Red Sift Outbox
Published in
7 min readJan 26, 2017

--

In the last post, we create a Machine Learning app that looks at your emails, finds the ones with a scheduling intent, and counts them. Let’s apply some more advanced Machine Learning techniques to extract the proposed time of the meeting and get one step closer to building a virtual secretary!

Named-Entity Recognition is an NLP task whose goal is to extract entities out of documents — in our case event times out of emails:

Once again, this is a supervised Machine Learning problem; we will need a bank of examples with which we can train a classifier to decide whether each particular word is part of the Event Time or not. That way, our program will learn that words that follow the phrase “how about” have a good chance of being an Event_Time expression, especially if they contain numbers, end in -day, are words like a.m., March, etc. Given enough examples, it will also learn that “how about the green one” does not contain an event time.

Named Entity Recognition in 5 minutes

Just like in the last tutorial, we will use open source tools to achieve this — that way you can train your own state-of-the-art tagger is less than 5 minutes!

To go over this tutorial you will need Python (ideally python3) and Spacy. Let’s go over the steps to train our NER model:

#1 Install spaCy:

pip3 install spacy==1.3

#2 If you haven’t followed the previous tutorial, or don’t have its code anymore, run the following commands to catch up:

git clone --recursive https://github.com/savvopoulos/classifier-sift
cd classifier-sift/
git checkout a34c434e21
git checkout -B ner
cd server/fastText
make
cd ../
./train.sh

If you did follow it and still have the code, you will need to move classifier/fastText to classifier/server/fastText and adapt parse.js with the new location.

#3 Save spaCy’s train_ner.py in your classifier-sift/server/ directory. We will use this as a base and adapt it for our purposes.

#4 Run it with: python3 train_ner.py

You might get a warning that you need to download spaCy’s models or accuracy will suffer. For this tutorial, you could skip downloading the model because it will slow things down; see at the end about how to use it.

In this example, there’s a dataset with two entries that gets trained to detect person and location; when it’s done running, it should print:

Who 896 897 ""  2
is 716 716 "" 2
Shaka 980 981 "" PERSON 3
Khan 984 985 "" PERSON 1
? 983 983 "" 2

This is an easy example; all the tagger has to do is learn that Shaka Khan is a person, which it does based on just two examples. In our case, we will use hundreds of examples to solve a much more complex problem.

#5 Next, save the dataset into the classifier-sift/server/ directory.

#6 Then, we’ll change spaCy’s example to work for our problem:

  • first change the hardcoded train_data variable to this line, to load our dataset:
train_data = json.load(open('train_ner.json'))
  • in the next line, change the entity types that we want to detect from PERSON and LOC to Event_Time.
  • finally, change the test example from ‘Who is Shaka Khan?’ to ‘how about coffee tomorrow at 5pm?’ Your final train_ner.py file should look like this.

#7 Let’s run run our script again:

python3 train_ner.py

You should now get something like this — although the annotated tokens may vary for you, see below:

how 552 552 WRB  2
about 513 513 IN 2
coffee 2105 2105 NN 2
tomorrow 2021 2021 NN Event_Time 3
at 507 507 IN Event_Time 1
5 779 779 CD Event_Time 1
pm 2868 2868 NN Event_Time 1
? 482 482 . 2

This is the example we added earlier, in step #8. The tokens tomorrow, at, 5, and pm, have been correctly identified as being event time tokens (you might get different results — see below). The numbers that follow specify whether the token is at the beginning of an Event Time expression (3), inside (1), or outside (2). The initials stand for various parts-of-speech like noun (NN), preposition (IN), or numbers (CD).

Sometimes, some of the words might not be correctly detected — the results will vary from run to run and will not be as good as above, if you didn’t download the spaCy model.

Plug it into the Sift

Congrats! You now have a basic NER model, that will extract the event time from your emails. All we need to do now, is to put this model into a node in our classifier-sift, and to connect it with the output of the fastText classifier, from the previous tutorial. Even though our previous post was in NodeJS, Red Sift allows us to seamlessly plug in Python code, and make it all work together!

Note: because of Docker’s limitations on OS X, you will need to run the rest on a Linux box — even if it’s a virtual instance inside of VirtualBox. Let us know in the comments, if you get stuck. Note that if you access that instance remotely, you will need to provide the remote host flag.

#1 Install Docker—Red Sift depends on it to run multilingual sifts.

#2 Let’s change the Parse node to return the email body in addition to whether there was a scheduling intent, and the word count. Open classifier-sift/server/parse.js, and change const wordsValue to the following:

    const wordsValue = {
words: countWords(body),
schedIntent: hasSchedulingIntent(jmapInfo),
text: 'SUBJECT: ' + (jmapInfo.subject || '') + ' ' + body
};

#3 Next, let’s create a new file, under classifier-sift/server/ner_node.py. For now, we will make it simply make it print the number of inputs to console:

def compute(req):
print('Got', len(req['in']['data']), 'values.')

#4 We also need to register the node, which we can achieve by adding the following snippet inside of the “nodes” array in sift.json:

    {
"#": "Tagger",
"implementation": {
"python": "server/ner_node.py",
"sandbox": "quay.io/redsift/sandbox-python:v3.4.3"
},
"input": {
"bucket": "messages"
},
"outputs": {
"messages": {}
}
},

This tells Red Sift to create a Python node, specifying the Python sandbox, and telling it which bucket to receive inputs from, and in which bucket to write its outputs to.

#5 Create a new file, server/requirements.txt, and in it add a single line:

spacy==1.3

This tells Red Sift to load the spacy Python module at version 1.3.

#6 Run the node again; (TODO: re-include desc from other tutorial; warn that it takes long). In the console, you should see “NER Node got 2 values”. The number of values that you get, should match what you see in the frontend.

#7 From this spaCy example, copy the load_model method and all its imports to the top of classifier-sift/server/ner_node.py (the highlighted lines).

#8 Next, let’s load the model. In the top-level scope, add this:

(nlp, ner) = load_model('server/ner')def compute(req):
...

#9 We need to read the wordsValue objects from string to JSON; add this import to the top of the file:

...
from spacy.vocab import Vocab
import json
...

#10 Now, let’s use the loaded models to extract event times in emails and print them to console (for now!). First, let’s read the input JSONs. Add this inside of compute():

data = [json.loads(x['value']) for x in req['in']['data']]

If you are not used to Python, remember that indentation matters, so make sure to keep those spaces.

#11 Finally, let’s make our node print all the event time words in the input:

  ...
data = [json.loads(x['value']) for x in req['in']['data']]
for datum in data:
doc = nlp.make_doc(datum['text'])
ner(doc)
print('Found event times:', [x for x in doc if x.ent_type_ == 'Event_Time'])

Your final ner_node.py should look like this.

#12 Time to see if it worked! Run your Sift; in the console, you should see something like this:

Found event times: [5, pm]
Got 1 values.
...
Found event times: [Thursday, 8, am]
Got 1 values.

Improving the Model

In order to speed this tutorial up, we cut a few corners — and the accuracy of the model suffered as a result. There are a few ways to improve this:

#1 Download the spacy model with:

python3 -m spacy.en.download --force all

#2 Train for longer. spaCy’s train_ner.py uses the training examples only for 5 iterations. You can improve the model by changing that, in line 14, inside of the train_ner() method: change range(5) to range(50). This should make the results much more consistent.

#3 When we registered the node in step #4, we instructed Red Sift to use a docker image that contains Python. There’s an image specific to Machine Learning tasks in Python that, among other tools, contains spaCy + its models. To use it, simply replace the value of “sandbox” with:

quay.io/redsift/sandbox-scipy:v3.4.3

We now have the beginnings of a virtual assistant, built on top of state-of-the-art Natural Language Understanding techniques. The next step will be to close the loop by communicating the results to the user, on Slack. This will be the topic of our next post, so stay tuned!

If you want some background information on programming with Red Sift, you can go to our online documentation and video tutorials at docs.redsift.com.

--

--

Chris Savvopoulos
Red Sift Outbox

Software Engineer focusing on ML, Big Data, and scalable backends. Ex-Senior SWE @Google, on Maps.