ELMo Embedding — The Entire Intent of a Query

Published in

Analytics Vidhya

6 min readAug 17, 2019

As a continuation of the Search Query Understanding, here comes the next problem to solve which is known as Holistic Query Understanding. We will talk about this in detail as we have already covered the Reductionist Query Understanding part here already. Here, we are taking the case of e-commerce search. The problem is to find the intent of query lying in a particular taxonomy like L1/L2/L3 category also called category classification.

Customer searching on an e-commerce website

Let’s talk about a real-life online shopping example and you want to look for something like ‘onion 1kg fresh’. Now, in order to make the result more precise, the search results are shown along with the probable category name in which the intent of your query may lie i.e. ‘Grocery & Gourmet Foods’ here in your case. Now, if you click on the predicted category, you will end up seeing products that belong to that particular category and hence, a precise result enhancing the experience.

The solution to the problem is Machine Learning based which we are going to talk about.

The Intent of the Query

This is a multi-class multi-label classification problem in which the input will be a set of search queries labeled with their categories and particular search queries may belong to the more than one category out of many mutually exclusive categories. To elaborate it more, let’s say we have 3 hierarchies of category L1, L2, and L3. A search query may belong to one or more categories of each hierarchy e.g. search query — “apple” may lie in “Electronics” and “Groceries” in L1, “Laptop” and “Fruit” in L2 and so on. We have a Deep Learning based model using Elmo as Embedding layer in it.

ELMo Embedding

ELMo is created by AllenNLP which unlike Glove, Fasttext, Word2Vec, etc. provides the contextualized word embeddings whose vector representation for a word differs in a sentence to sentence.

ELMo is a deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus.

The ELMo were trained on the 1 Billion Word Benchmark, approximately 800M tokens of news crawl data from WMT 2011.

Why ELMo? Motivation towards using ELMo

Other word representation models like Fasttext, Glove, etc. also generates the embeddings of words, but we chose Elmo because of few reasons. Let’s visit one by one:

Elmo provides the embedding of a word that is present inside a sentence i.e. a word may have a different meaning depending on the context where it is being used similar to the example above in the photo. The word “Apple” may be a ‘Brand’ name as well as a ‘fruit’. So, if a query is given like ‘Apple Juice’ the embedding generated for the token ‘apple’ here will be different from the one in ‘Apple Laptop’. And in general e-commerce search query, this case is likely to happen.
Another reason, since LSTM network is being used internally in ELMo model, we need to bother about the word that is not present in the dictionary of the training dataset as it generates the character embedding as well. It allows the network to use morphological clues to form robust representations for out-of-vocabulary tokens unseen in training. We generally face the out-of-vocabulary token problem in case of the brand names.
ELMo furthermore won the best paper award at NAACL-HLT 2018, one of the top conferences in the field.

Implementation using TF Hub

TensorFlow Hub is a platform to publish, discover, and reuse parts of machine learning modules in TensorFlow. By a module, it means a self-contained piece of a TensorFlow graph, along with its weights, that can be reused across other, similar tasks. By reusing a module, a developer can perform transfer learning. The pre-trained Elmo model is also present on Tensorflow Hub.

#Sample Code to get instant embedding
elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)
embeddings = elmo(["apple juice", "apple tablet"],  signature="default", as_dict=True)["elmo"]

The output shape of embedding will be 1024 for each word and hence, a shape of [2,2,1024] for two sentences each having two words in the above example. This model needs to be used in Lambda layer if using Keras for modeling. The Elmo model weights have already been downloaded to a folder, so we need not make a network call to Tensorflow hub while training.

curl -L "https://tfhub.dev/google/elmo/2?tf-hub-format=compressed" | tar -zxvC /content/elmo_module

Remember to initialize the graph and tables before calling the Elmo Embedding layer.

with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())  
  sess.run(tf.tables_initializer())
  embed=sess.run(embeddings)

Model after Integration

Query Categorization Model

Here we are using in-memory caching by using cachetools library of python. As the weights are very large in the Elmo model, each time iterating through the graph for finding an embedding of a sentence which is already generated once before is not a good idea as it will increase the training time and we also may run out of RAM memory. After one epoch, the embeddings of all the search query will be cached and returned easily without going through TensorFlow hub library.

After getting the embedding of a sentence, we are using BiLSTM to extract more contextual representation for out-of-vocabulary words and then two dense layers followed by a sigmoid activation layer at the end. Since it’s a multi-label classification and a query is belonging to three different hierarchy we are using Sigmoid layer as the sum of probabilities needs not to be 1. Here, in the example we have taken 1500 categories, say 500 in each of the hierarchy L1, L2, and L3. In order to show the top 2 categories in each hierarchy, we need to take the category which is having the highest probability from the sigmoid layer at the end.

TF-Serving Response

As the model is a bit complex having a large number of weights, it will affect the overall response time in TF Serving. It is dependent on the number of categories is being predicted at the end as well. The model size has also grown to ~850MB in size.

The Elmo can be replaced with the BERT as well. BERT has several variants available on TF Hub. The Elmo/BERT model can be used for unsupervised learning and can be trained on custom datasets as well by creating a custom layer in Keras . But, remember make sure you have enough resources like GPU, memory, etc before proceeding for this task.

Conclusion

So, now since we can predict the taxonomy hierarchy for a particular search query, we can leverage it to display the same to the end-user to select and display more precise results having the products which user has intended to buy by enhancing the search request to the engine. As it’s an ImageNet Moment of NLP, Transformer based algorithms for language modeling like BERT are reigning in the era. Other language models are Universal Language Model Fine-tuning (ULMFiT), and the OpenAI Transformer which has achieved state-of-the-art on a diverse range of tasks in Natural Language Processing.

Thanks for reading the article. Please watch the space for more updates in the future and do read the other stories as well.