Building an Autosuggest Corpus, Part 2
Using NLP to extract key phrases from documents.
In the last post we talked about how to decide between query logs and documents as your data source. For our product we settled on using our documents. In this post we’ll talk about how to turn documents into suggestions. This post will be longer, a bit more involved, but it should be a ton of fun. There will be code and math, but don’t worry, you should be able to follow along if you have some rudimentary programming skills and know how division works.
When Life gives you Natural Language, just Process it.
Our goal is to produce a list of terms or short phrases that we can suggest to users as search queries. Our documents are composed of structured and unstructured fields. For instance, you might have a
person field with a value like “Buffalo Bill”. It’s easy to imagine that with a little bit of normalization these structured fields might make decent suggestion candidates. But often the high value phrases are buried deep in unstructured fields like
title. In the example above, if we only indexed structured fields we would miss out on “wild west”, which may be much more likely as a query to the uninitiated user.
The task of determining key phrases from text is called terminology extraction. Like most data sciencey problems, there are generally three ways to approach it:
- Ad-Hoc Heuristics — You can hand-tune some rules about what phrases to keep and what phrases to discard.
- Supervised Learning — You can get some training data and train a model to extract phrases for you. For instance, you could represent this task as a binary classification problem where given a word or phrase, a model will give you a yes/no answer.
- Unsupervised Learning — You can use unsupervised techniques like TextRank that exploit patterns in the text to discover important phrases.
In this post, we’ll focus on the the first. It’s a decent way to get started, and these basic NLP techniques will be useful to prepare your data should you choose to move on to more sophisticated strategies later on.
One important caveat is that unlike some terminology extraction tasks, our goal isn’t only to extract interesting terms/phrases, it’s to extract phrases that users are likely to search for, and that would yield a reasonable number of results. So the suggestion corpus should probably contain specialized terminology as well as more general terms that users are likely to use.
We’ll be using NLTK, a python library with a ton of utilities, corpora, and trained-models for common NLP tasks. It’s tremendously chill.
The code in this post will be edited for narrative purposes, but you can find fully functioning examples in the accompanying Jupyter notebook.
We’ll be playing with a dataset of descriptions from the Denver Public Library’s Western Art Collection.
Another name for a sequence of words is an n-gram. A sequence of a single word is a unigram, two words is a bigram, and so on. Note: the term n-gram is sometimes used to denote sequences of other linguistic atoms like characters, syllables, etc. In the context of this blog post, we’re always talking about words. Let’s start by looking at simple numeric approaches to gathering unigrams.
The gist of the techniques we’re going to talk about is to gather statistics about the appearance of terms in your corpus and use them to help you decipher which terms are important. Here’s a really dumb example:
We stream through all of our documents and collect counts for how often each word appears. Easy peasy. This tiny program illustrates some of the pieces in a typical NLP pipeline:
- Tokenization — the process of turning a long string into a sequence of words (or tokens). In this case, we’re breaking up the string whenever we encounter whitespace.
- Normalization — the process of mapping a token to a more canonical form, which allow us to treat variants of a token identically. In this case, we’re just making our analysis case-insensitive by lower-casing everything.
A production pipeline will have more components and will be more sophisticated and resilient to the variance of human-produced text, but the idea is the same: you are building a pipeline of components that create, transform, and filter a stream of tokens. Lucene calls this “Analysis”, or the “Analysis Chain”. A lot of your blood, sweat, and tears will be spent here, and small changes in this pipeline will have out-sized effects on your output.
NLTK has a dictionary specifically designed for capturing token statistics. It defaults missing terms to 0 and has a common API of utility methods that other NLTK objects depend on, so let’s use that and abstract away the analysis:
Now we have a dictionary of words and their frequencies. What do we do with it? We have to decide which words to keep in our suggester. What are the top words?
These aren’t terribly useful. A lot of them are stop words — the glue words that we need to string sentences together but that don’t really add any meaningful information. NLTK has a corpus of english stopwords we can use, let’s try filtering them out:
Getting somewhere, but we can do better. We’ve gotten rid of the glue, but there are still some low-value terms like “two”, “left”, “right”, “side”, etc. No one is going to search for these words in isolation.
Let’s try a slightly different tack. Rather than considering all the terms across the entire corpus, perhaps we can get better results by only considering the best N terms from each document:
How do we measure the importance of terms in a document? An easy way to get started is our good old friend Term Frequency — Inverse Document Frequency (TF-IDF). In case you aren’t old friends, here’s a primer. Below is a sketch of the math for TF-IDF, ignoring a log, a constant, and the many variations you can choose from. It’s OK to ignore some of these details — we’re just trying to develop a mathematical intuition for how it works.
TF-IDF is a real number that estimates the importance of a term (or terms with a summation) in a document (which is implied above). The higher, the better.
There are two ideas here. First, as we observed earlier, extremely popular words tend to not be super useful. This idea is codified in the denominator, i.e. Inverse Document Frequency (IDF). The more frequent a word is in your corpus, the larger the denominator, and thus the smaller your TFIDF score. Second, if a document mentions a term like “wild” a ton of times, and that term isn’t particularly popular in the corpus, that term is probably important for the given document. The more a document mentions it, the higher the numerator, and the higher your TFIDF score.
Here’s an example of picking the top three unigrams for a tiny document:
[‘fronting’, ‘snowbank’, ‘business’]
If you have short documents, like we do in this case, the TF part of TF-IDF may be pretty low-signal, in which case you can try using IDF alone or set TF to 1.
Here’s what the top terms look like if we use IDF and limit to 10 terms per document:
These are pretty good. There are still some low-value verbs and prepositions like “via”, “reached”, “near”, “towards”, etc. You’ll see the set change pretty dramatically if you fiddle with the number of terms to keep per document.
What a POS
It seems like we’re mostly interested in nouns. Can we use our knowledge of language to help filter down our candidates? Yes! We can use a Part-of-Speech (POS) tagger. Given a stream of natural language, a POS-tagger will output a stream annotated with syntactic information like “noun”, “adjective”, “preposition”, etc. NLTK comes with a bunch of pre-trained POS models you can leverage:
[(‘Several’, ‘JJ’), (‘men’, ‘NNS’), (‘stand,’, ‘VBD’), (‘and’, ‘CC’), (‘one’, ‘CD’), (‘sits,’, ‘NN’), (‘on’, ‘IN’), (‘the’, ‘DT’), (‘corner’, ‘NN’), (‘boardwalk’, ‘NN’), (‘opposite’, ‘IN’), (‘the’, ‘DT’), (‘hotel.’, ‘NN’)]
You can find a list of tags here. Underneath the hood, POS taggers are often supervised machine learning models trained on corpora painstakingly annotated by humans. They might depend on features like “is the current word capitalized?”, “was the previous word a preposition?”, etc.
Once you have a tagged sentence, you can filter the sequence down to only the tags you are interested in. Here’s what the most frequent unigrams look like after we’ve filtered to noun variants:
Neat. We succeeded in removing stuff like “via”, “reached”, etc. We can also try combining this technique with the IDF approach (i.e. only take the top N nouns per document):
Now that we’ve got a decent list of unigrams, let’s turn our attention to longer sequences, like “wild west”. 2+-grams are sometimes called collocations. NLTK has some built-in facilities for finding these. Let’s look at an example:
The idea is that we collect statistics about how often two terms appear together, and in isolation. Then we use some statistical tests to figure out which pairs occur often enough to suggest that they are a meaningful concept together.
Note: You don’t have to collect your own
BigramCollocationFinder will take a sequence of tokens and do that for you. I did it to give you a glimpse of what’s happening under the hood, but it’s actually easy to screw up, which in fact, I did while writing this post.
NLTK offers a range of statistical tests you can use, but let’s look at the math behind Point-wise Mutual Information because it’s easy and illustrative (again ignoring a log):
If you take any two contiguous words in a corpus chances are good they aren’t a meaningful concept as a phrase. You can think of PMI as encoding that idea in a hypothesis: the two words are independent. Whether or not they appear together is pure chance.
The numerator is the probability that the two words actually occur together. The denominator is the probability that they would appear together if you assumed they were independent. Let’s make that real by doing the math for a simple example:
As the numerator goes up, the terms appear together more often than you would expect if they were truly independent, and thus PMI goes up. High PMI bigrams are probably legit concepts. As the numerator goes down, the occurrence of the terms looks more and more like randomness.
Unfortunately PMI doesn’t have a fixed range. So the PMI values you will see will vary based on your corpus. To filter down to a decent set of bigrams, you’ll have to take a look at the results and come up with a reasonable threshold. You will probably also want to limit your results to only those bigrams that occurred a reasonable number of times — the probabilities will be unreliable for tokens that appear only a handful of times.
Now, here are some bigrams!
TIL about color notes and bracketed cornices. Awesome! Again, we find some non-noun stuff like “made color”, “former tabor”, and “reached via”. We can try playing with POS tagging again to help with that.
Piece of Chunks
This time instead of filtering to noun unigrams, we’re interested in finding noun phrases like “episcopal church” or “watercolor paintings”. Often these will contain adjectives, prepositions, etc. So we will need a mechanism to segment our text that allows us to capture this complexity. NLTK calls this “chunking” and provides a parser that will segment some text for you into labeled chunks according to a regex grammar you provide:
(NounPhrase music/NN stands/NNS)
(NounPhrase chair/NN backs/NNS)
Once we have noun-phrase chunks, we can use these to filter the set of candidate bigrams and perform the same analysis:
Mixed bag — we got rid of “reached via” but our first-pass grammar was too restrictive so we also killed “bracketed cornices”. We should go back and adjust it to allow for adjectives and other intervening tags. This is particularly important as you look at trigrams and longer phrases. The challenge with using POS annotations is the complexity of designing a grammar that will strike a reasonable balance between precision and recall.
Another note of caution about POS taggers is that they tend to work better on longer prose as opposed to short strings from structured form fields, and heavily processing the tokens before using them may reduce their effectiveness. For instance, if you lower-case and remove punctuation before you POS-tag, and the model depends heavily on capitalization and punctuation features, you may get poor results.
Trigrams and beyond are left as an exercise to the reader.
Iterating and Evaluating
There’s a lot of painstaking labor involved in any data science task and this is no exception. You’ll probably have to try a bunch of these strategies, and/or combine a few to get reasonable results. There are a ton of hyperparameters we can play with like minimum frequency thresholds, number of terms to keep per document (in the TF-IDF approach), minimum PMI, POS grammars, etc. How do we know if we’re getting anywhere?
First, I would recommend getting a visceral sense for how your corpus changes as you fiddle with the knobs. Looking at the top N in a notebook visualization is not bad, but it also helps to see a more comprehensive changeset. I will often generate a text file and then sort it and use diff:
diff -u unigrams.csv pos_unigrams.csv@@ -3941,30 +3048,20 @@
Even though we’re using heuristics, we can still leverage supervised techniques to tune our thresholds or evaluate our models. For instance, you could take a small sample of phrases and annotate them with yes/no labels, which indicate whether or not we want to keep the phrases in our final list of suggestions. Then we can use any number of binary classifier metrics to evaluate how we’re doing:
Baseline F1: 0.473684
POS Filtered F1: 0.800000
In this example we’re measuring the F1 score, a metric which combines precision and recall into a single figure between [0, 1], the higher the better. Looks like POS filtering works better than just keeping any terms that appear more than 5 times. Sweet!
Remember if you are tuning hyperparameters, you’ll want to split your data into validation and test sets lest you bias your outcomes.
Of course, if you produce enough training data you could just train a classifier, perhaps using some of the techniques we’ve discussed to prepare your candidates or to generate features.
And naturally, the most effective way to evaluate your suggestions, if you have enough traffic, is to get them in front of users and run an AB experiment, probably measuring some of the metrics we talked about in the first post.
In the next post we’ll talk about building an autosuggest engine to serve up the corpus we just constructed.