Unigrams in Elasticsearch: Finding Words by Letter Frequency

Dave Sanderling
Learning.com Tech Blog
6 min readMay 9, 2018

We’ve been using Elasticsearch here at Learning.com for a few years, in order to satisfy a common use case: quickly searching through zillions of log entries for a few keywords. But what if you wanted to zoom in — and search for single words that had high occurrences of a few key letters? This requirement came up during development of Adaptive Keyboarding, our flagship typing app that has the ability to drill students on words containing their top three “practice keys.” For example, if you have been having trouble hitting ‘t’, ‘h’, and ‘e’, our app might prompt you with “teacher,” “theme,” and “Earth.” As it turns out, Elasticsearch solves this very easily.

Maybe we should call it Elastic Keyboarding

A Little Background

Elasticsearch (“ES” from here) is a “real-time search and analytics engine” used by such giants as Wikipedia, GitHub, and StackOverflow. It is basically a document database with a set of highly configurable indexing and search tools. A basic ES instance consists of an index (a database) of many documents (records, in JSON format); each document field is searchable thanks to an inverted index (a mapping of all the unique words in the database to which documents contain them). You might envision the database as a book, where each page is a “document” with a bunch of words on it. At the end of the book is an (inverted) index that tells you, for each word, which pages (which documents) contain that word.

[note: I will use the word ‘database’ when referring to an ‘index,’ so as to avoid confusion with the ‘inverted index’ concept.]

In the case of our logs, a “document” is a log entry, which has fields for timestamp, log text, logging level, environment, etc., each of these fields having its own inverted index — meaning you can search on any or all of them. When you tell ES that you want to find all documents with log text containing “NullReferenceException” (not that our dev team would ever need to; please humor me for the sake of illustration), it looks up that search term in an inverted index for that field. It then returns a list of results that are sorted by relevancy (more on this later).

Analyze All The Things!

When you set up a basic database like the one we use for logs, ES uses a default analyzer to build the inverted index for each field. An analyzer in ES is responsible for taking the raw documents and creating the searchable terms or tokens. By default these terms are the individual words in your documents’ text, but you can customize the analyzer using any combination of:

  • character filters which tidy up the string prior to tokenization (such as “html_strip,” which strips html tags)
  • tokenizers which split the full text into individual tokens, usually by whitespace
  • token filters which can change, remove, or add tokens (such as “lowercase”)

Your analyzer is not only used at setup time, however. Every time a search is performed, the query itself is passed through the same analyzer as the documents were when they were initially indexed — or else you might not get any results! For example, you would usually want the lowercase filter to be applied to both ends, so that “Earth” would match “earth”. ES comes with a lot of interesting smart filters that could even, for example, find “reference” in “NullReferenceException” based on knowledge of English words.

Rainbows and Unigrams

trigrams of fnord

The humble ngram token filter is how we solved our puzzle in Adaptive Keyboarding. An ngram, according to the ES docs, “can be best thought of as a moving window on a word.” Here, n is the size of that window. For example, if we pass the word “fnord” through a 3-gram filter, the tokens ‘fno,’ ‘nor,’ and ‘ord’ are produced. Zooming all the way in, of course, we find the 1-gram, or unigram, which splits a word into single letter tokens. Ta-da! This was exactly what we needed. Since, as mentioned above, a token filter may change or add tokens, the ngram token filter (where n=1) is perfect for indexing our words into searchable one-letter tokens.

By creating a database of words where each word is a document indexed using a unigram analyzer, we are able to search for words containing a high concentration of “practice keys” (the search term). For example, if there is a document for the word “beginning,” the unigram analyzer will index it on the tokens ‘b’, ‘e’, ‘g’, etc. When we search with the query “gbn,” the query string will also be analyzed using unigrams (tokenized into ‘g’ ‘b’ ‘n’) and will likely return the document “beginning” with a high relevancy score.

Show me the code already

When creating our ES database, we first defined the document schema and the analyzer (with the unigram token filter). Since ES speaks REST, we used something like this:

POST unigram_words{ //name of the index
"settings": {
"analysis": {
"filter": {
"my_unigram_filter": { //we name our new filter
"type":"ngram", //magic happens here
"min_gram":1, //define the "window" size
"max_gram":1
}
},
"analyzer": {
"my_unigram_analyzer":{ //we name our analyzer
"type":"custom",
"tokenizer":"standard", //nothing fancy in the tokenizer.
//the token *filter* does the cool stuff.
"filter": [
"lowercase", //ignore case
"my_unigram_filter" //use the filter we defined above
]
}
}
}
},
"mappings": {
"ak_word": { //schema for our document
"properties": {
"word": {"type": "string", "analyzer": "my_unigram_analyzer"},
//use our unigram analyzer on the "word" field
"library": {"type": "string"},
"grade_min": {"type": "integer"},
"grade_max": {"type": "integer"}
}
}
}
}

That’s it! Then we were able to use a script to insert thousands of words, one to a document.

Relevancy Score

When searching through logs, an entry containing two or more instances of your search term would be considered a better match than an entry that contained only one. This is because ES, by default, scores results based largely on term frequency. By the same token (pun totally intended), a word in our AK library containing more of our practice letters would score high. For example, when searching “DL”, the word “doll” scores higher than “old”. Here is that query:

GET unigram_words/ak_word/_search
{
"query": {
"match": {
"word": "dl"
}
}
}

And here are the top 10 results:

doll (score=4.79)
hold (4.23)
would (3.96)
could (3.77)
cold (3.76)
children (3.76)
old (3.76)
all (3.68)
tell (3.68)
did (3.00)

Down the Rabbit Hole (or not)

It seems that everything in Elasticsearch is customizable, including how the scores are calculated. But we really didn’t see a need to dig deeply into that, at least on this first iteration of our app. For our purposes, a practically-out-of-the-box, only-slightly-tweaked solution made things work like a charm. If we get some downtime between future projects (ha!) we might consider honing the scoring algorithm. For example, one of the other things that ES uses to calculate relevancy is inverse document frequency, which means that the more often a term (or a letter, in our case) appears in the entire database, the lower weight it will carry in a specific calculation. That shouldn’t matter to us, so we could probably turn it off and get slightly less biased results, for those students who really need to practice their letter ‘e’. But for how little time we’ve spent developing this feature, we have gotten huge value.

--

--