A Dictionary for Those Tip-of-the-tongue Frustrations
I’ve built a search engine to lookup words from definitions.
Have you ever had that tip-of-the-tongue situation? It’s a state which we know the partial definition while cannot recall the word. Furthermore, we fully believe that such a word exists, yet simply cannot access it. It is a psychological phenomenon that is universal across gender, education, and language differences. While it does not deter me from writing, it does happen often enough to slow down the writing process. Thus, when I was determined to start a new side project, I was eager to tackle this problem.
Fittingly acronymed TOT, the tip-of-the-tongue phenomenon has a rich history of psychological studies. For those already read its Wikipedia article, please allow me to regurgitate a couple of highlights I thought are especially interesting. The “tip of the tongue” expression was believed to have come from the French “avoir le mot sur le bout de la langue”. As a psychological state, it was first described by William James in 1890.
There are two prevailing theories for its cause: Direct-access and inferential views. The direct-access view suggests that the partial recall was due to the insufficient memory strength. While the memory is strong enough for us to recognize its existence, it is not enough to access the actual word. The inferential view theorizes that TOT is unrelated to the memory of the word. Our strong memory of the definition causes us to believe that we can recall the word, when actually we may not.
The solution conceived is a reverse dictionary, aptly named ReverseDict. In its generic definition, a reverse dictionary is any dictionary that organizes words in something other than alphabetical order. A rhyme dictionary, for example, is a reverse dictionary. The one I’ve built uses the definition of each word as the index. We can use a rough definition to retrieve a list of candidates with similar meanings. There are two key technologies utilized: WordNet and Elasticsearch.
Constructing reverse dictionaries used to be a laborious task, because. Fortunately for us, there is WordNet. WordNet is a database of English words constructed and maintained by Princeton. As a database, it’s far more powerful than a simple dictionary. It is lexical, which means that we can easily examine a word in its various forms (e.g. run, running, ran, etc). Furthermore, it is also semantically aware, drawing connections between more generalized concepts (e.g. dog) to their specifics (e.g. german shepherd).
There are a number of ways to access the rich WordNet data. The one I am most familiar with is through the powerful Natural Language Processing (NLP) Python library NLTK. Besides WordNet integration, NLTK is stuffed with NLP goodies such as the latest language models, text corpuses, and parsing techniques. While NLTK is powerful, it can also be overwhelming to use. Therefore, I chose TextBlob instead (and recommend you do too). TextBlob is a wrapper for NLTK and another text processing library (which I won’t discuss now). Its goal is to ease the job of accessing massive power of NLTK (similar to what requests does for urlib3). I’ve published some notes on TextBlob basics if you’re interested in learning more.
As WordNet provided us with the data, Elasticsearch will provide the structure and access to that data. Elasticsearch is an open sourced search engine, designed for scalability, availability, and speed. Our need to parse the inputted description prompted me to use a search engine instead of a database. In a typical dictionary, even reversed ones such as rhyme dictionary, the words are clustered together by the organizing factor (e.g. rhyme scheme). This lends itself well for a database which can be easily optimized to query that field. Our dictionary, however, does not have that luxury. There are numerous ways our users can definite the same word.
Similar to our scenario, search engines are designed to connect the concept of user queries to the contents of web pages. (Why did I choose to use Elasticsearch instead of others, e.g. Lucene and Solr? (⌐■_■) Because I want to work for Elastic, the company behind it.) By simply uploading a document of mapped fields, Elasticsearch will index the available texts for free text search. The primary method which it ranks the entries (our dictionary of words) from the user query (user-provided definitions) is through the vector space model. The vector space model is a method of mapping text documents onto a mathematical space through their word usages. If you are interested, I wrote a post explaining the model in greater details.
It’s eye opening how ubiquitous and universal the tip-of-the-tongue phenomenon occurs. In fact, I’ve already used my own ReverseDict a few times while writing this. By inverting our understanding of a dictionary, I was able to build a reverse dictionary to lookup words from their definitions. You can find the source code on Github.