Building Intelligent Question Answering Systems with ELMo

Published in

Saarthi.ai

9 min readJul 31, 2019

In this article we demonstrate how we can build an intelligent Question Answering System based on deep contextualized word representations or word embeddings, using ELMo.

If you’re familiar with the Question Answering system, you must have seen its implementation using techniques like attention-based Recurrent Neural Networks (RNN). However, the goal of this article is not to showcase how one can achieve state of the art results in Question Answering Systems, but rather to learn and explore different solutions.

We will be using the Stanford Question Answering Dataset 1.1 (SQuAD) in this Question Answering System tutorial.

Understanding the Question Answering System

Question answering (QA) is a computer science discipline within the fields of natural language processing (NLP)and information retrieval(IR) , that involves building a system capable of automatically answering questions posed by human beings in a natural language.(Wikipedia)

A system’s understanding of natural language is defined by its ability to translate sentences into an internal representation such that it is able to generate valid responses to questions asked by a user . Valid responses mean answers that are relevant to the question asked by the user. By internal representation of natural language, we mean that the sentences must precisely map semantics or meaning of their statement.

Stanford Question Answering Dataset (SQuAD)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset consisting of a set of Wikipedia articles (context) and questions posed on those articles by a group of crowdworkers, where the answer to every question is a text segment , or a span , from the corresponding reading passage, or the question might also be unanswerable. SQuAD 1.1, is a previous version of the SQuAD dataset that contains more than a 100,000 question-answer pairs on more than 500 articles, making it perfect dataset to build a Question Answering System using ELMo.

Installations :

import warnings
warnings.filterwarnings(‘ignore’)
import pickle
import numpy as np
import pandas as pd
import json
from textblob import TextBlob
import nltk
from scipy import spatial
import torch
import spacy

Firstly , import the data from JSON to a pandas Dataframe .

# import the training dataset
train = pd.read_json(“data/train-v1.1.json”)# validation dataset
valid = pd.read_json(“data/dev-v1.1.json”)train.head(5)

You will get the following output :

train.iloc[2,0]['paragraphs'][0]

Output :

{'context': 'Montana i/mɒnˈtænə/ is a state in the Western region of the United States. The state\'s name is derived from the Spanish word Montana (mountain). Montana has several nicknames, although none official, including "Big Sky Country" and "The Treasure State", and slogans that include "Land of the Shining Mountains" and more recently "The Last Best Place". Montana is ranked 4th in size, but 44th in population and 48th in population density of the 50 United States. The western third of Montana contains numerous mountain ranges. Smaller island ranges are found throughout the state. In total, 77 named ranges are part of the Rocky Mountains.',
 'qas': [{'answers': [{'answer_start': 112,
     'text': 'Spanish word montaña (mountain)'}],
   'question': "Where does the state's name come from?",
   'id': '5733bd9bd058e614000b6199'},
  {'answers': [{'answer_start': 370, 'text': '4th'}],
   'question': 'What is the states rank in size?',
   'id': '5733bd9bd058e614000b619a'},
  {'answers': [{'answer_start': 387, 'text': '44th'}],
   'question': 'What is its rank in popularion?',
   'id': '5733bd9bd058e614000b619b'},
  {'answers': [{'answer_start': 590, 'text': '77'}],
   'question': 'How many ranges are part of the Rocky Mountains?',
   'id': '5733bd9bd058e614000b619c'},
  {'answers': [{'answer_start': 103, 'text': 'from the Spanish word montaña'}],
   'question': "Where does the state's name come from?",
   'id': '5733f0e34776f41900661573'}]}

Therefore, for each entry in the training set , we have a :

context
question
text

You have a closed dataset here, meaning that the answer to a question lies within the context which is provided.

Why Deep Contextualized Word Embeddings do it better?

You must be familiar with the term “Word Embeddings”. If you are not you can check out more over here. There are various approaches to word embeddings like word2vec, GloVe etc.

The basic concept behind all these word embeddings is to represent a word entity (natural language) in form a dimensional vector containing numerical values (internal representation), which makes it easier for a computer to understand them for various downstream tasks.

Traditionally, we represented a word using the one hot vector notation and averaged the vectors of all the words in a sentence, this method was known as the bag of words approach. Each sentence can be tokenized into words, vectors for these words is then found using GloVe embeddings and the average of all these vectors is taken to get the sentence representation.

This technique has worked decently, but is not a very accurate approach as it doesn’t cares about the order of words and therefore the semantic and syntactic meaning of a word as a part of a sentence is completely lost. Therefore, we needed some other form of representation , one in which the contextualised meaning of word as a part of the sentence is retained.

ELMo

ELMo is a deep contextualized word representation that models both the syntactic and semantic characteristic of word use in a sentence. These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They can easily be used with an existing model and have significantly improved the state-of-the-art result across a wide range of challenging NLP problems, including sentiment analysis, question answering and textual entailment.

Coming back to the SQuAD problem, you’ll be using ELMo to find the sentence from a given context that contains the right answer to the user’s question. This will be achieved as follows:

Find the vector representation of each sentence ( both the context and the text ) and question using ELMo .

In case of context you can tokenize it into multiple sentences in order to generate embeddings for each individual sentence . You can use spaCy to perform for this task .For instance , if we use spaCy to tokenize the above context :

text = train.iloc[2,0][‘paragraphs’][0][‘context’]nlp = spacy.load(‘en_core_web_md’)text = text.lower().replace(‘\n’, ‘ ‘).replace(‘\t’, ‘ ‘).replace(‘\xa0’,’ ‘) #get rid of problem chars
text = ‘ ‘.join(text.split()) #a quick way of removing excess whitespace
doc = nlp(text)sentences = []
for i in doc.sents:
  if len(i)>1:
    sentences.append(i.string.strip()) #tokenize into sentencesprint(len(sentences))
print(sentences)

Output :

7
['montana i/mɒnˈtænə/ is a state in the western region of the united states.',
 "the state's name is derived from the spanish word montaña (mountain).",
 'montana has several nicknames, although none official, including "big sky country" and "the treasure state", and slogans that include "land of the shining mountains" and more recently "the last best place".',
 'montana is ranked 4th in size, but 44th in population and 48th in population density of the 50 united states.',
 'the western third of montana contains numerous mountain ranges.',
 'smaller island ranges are found throughout the state.',
 'in total, 77 named ranges are part of the rocky mountains.']

Cosine similarity: In case of unsupervised learning we simply evaluate the similarity between the question and each sentence in the context and select the sentence with greatest similarity score.

You can simply use TensorFlow Hub to load the ELMo model, it saves a lot of time as the hub contains a large number of pre-trained models that can simply be loaded by writing a few lines. For instance:

# loading the model url = “https://tfhub.dev/google/elmo/2"
embed = hub.Module(url)

Generating the ELMo embeddings for each statement in the above context :

# This tells the model to run through the ‘sentences’ list and return the default output (1024 dimension sentence vectors).embeddings = embed(sentences,signature=”default”,
                   as_dict=True)[“default”]#Start a session and run ELMo to return the embeddings in variable xwith tf.Session() as sess:  
  sess.run(tf.global_variables_initializer())
  sess.run(tf.tables_initializer())
  context = sess.run(embeddings)# Number of sentences in the context:
print(len(context))
# Embeddings for the context:
print(context)

Output :

7
[[ 0.36899847  0.47257534  0.32068962 ... -0.20088142  0.9022676
  -0.02785916]
 [ 0.15339084  0.10321682  0.25895926 ... -0.40209684  0.72194535
   0.02366676]
 [ 0.21802966  0.12451631  0.1021833  ... -0.40154076  0.6026789
   0.04383117]
 ...
 [ 0.21858774  0.4698825   0.20587952 ... -0.14859062  1.0065801
   0.35133246]
 [ 0.320227    0.0681043   0.3059626  ...  0.04991718  0.0975334
  -0.13061483]
 [ 0.15054972  0.07059898  0.13760993 ... -0.1717388   0.30478427
   0.06774213]]

Similarly, you can generate the embeddings for questions. Let’s try to answer the following question from the given context (Note that the answer to the question would be one of the statements from context):

Question : ‘What is the states rank in size?’

question = 'What is the states rank in size?'
q_list = []
q_list.append(question) 
embeddings = embed(q_list, signature=”default”,
                    as_dict=True)[“default”]#Start a session and run ELMo to return the embeddings in variable xwith tf.Session() as sess:
   sess.run(tf.global_variables_initializer())
   sess.run(tf.tables_initializer())
   ques = sess.run(embeddings)

Therefore, evaluating the cosine similarity between the embedding generated for each question and each statement in the context, you get the statement having the highest similarity score as your answer.

from sklearn.metrics.pairwise import cosine_similarityscore = []
index = 0for i in x_context:
  statement = []
  statement.append(i)
  value = cosine_similarity(x,statement)[0][0]
  answer = []
  answer.append(value)
  answer.append(index)
  index += 1
  score.append(answer)
  
  score.sort()
  e = score[-1][1]
  ans = sentences[e]
  print(ans)

Output :

montana is ranked 4th in size, but 44th in population and 48th in population density of the 50 united states.

In case of supervised learning the process of creating a valid training, set can be a bit tough as the number of sentences in each part is not fixed and the answer can also range from a single word to multiple words. Stanford team have used a multinomial logistic regression for the above problem and have created 180 million features (sentence detection accuracy the achieved for the above model was 79%).

To further improve the performance of our system we can further use one of the two methods:

Dependency Parsing
A dependency parser analyses the grammatical structure of a sentence, establishing relationships between “head” words and words which modify those heads.

Semantic Role Labelling

Semantic Role Labeling (SRL) models recover the latent predicate argument structure of a sentence. SRL builds representations that answer basic questions about sentence meaning, including “who” did “what” to “whom,” etc.

For instance let’s visualize our data using spaCy tree parsing . I am using the same example that i used earlier.

import nltk
from nltk import Tree
import spacydef to_nltk_tree(node):
    if node.n_lefts + node.n_rights > 0:
        return Tree(node.orth_, [to_nltk_tree(child) for child in node.children])
    else:
        return node.orth_en_nlp = spacy.load(‘en’)
[to_nltk_tree(sent.root).pretty_print()  for sent in  
 en_nlp(ques).sents]

Question : “What is states rank in size?”

Answer statement from the context : Montana is ranked 4th in size, but 44th in population and 48th in population density of the 50 united states.

[to_nltk_tree(sent.root).pretty_print()  for sent in  
 en_nlp(ans).sents]

So the objective here is to match the roots of the question with the roots of the sentences present in the context . Higher is the number of roots matched with a particular sentence , higher is the probability that question is answered by that particular statement.

Conclusion :

I hope this article would have provided you with atleast a general understanding about how to implement a Question Answering system using techniques other than sequence models. Plus you must have also gained some insight into using pre-trained word embeddings like ELMo , natural language frameworks like spaCy and important NLP techniques likes Semantic role labelling and Dependency Parsing.

For similar articles follow our Facebook page.

References :

ELMo: Deep contextualized word representations

AllenNLP is an open-source NLP research library, built on PyTorch.

allennlp.org

ELMo: Contextual language embedding

Using deep contextualised language representations from ELMo to create a semantic search engine and why context is…

towardsdatascience.com

Building a Question-Answering System from Scratch— Part 1

First part of the series focusses on Facebook Sentence Embedding

towardsdatascience.com

Building Intelligent Question Answering Systems with ELMo

Understanding the Question Answering System

Stanford Question Answering Dataset (SQuAD)

Why Deep Contextualized Word Embeddings do it better?

ELMo

Conclusion :

ELMo: Deep contextualized word representations

AllenNLP is an open-source NLP research library, built on PyTorch.

ELMo: Contextual language embedding

Using deep contextualised language representations from ELMo to create a semantic search engine and why context is…

Building a Question-Answering System from Scratch— Part 1

First part of the series focusses on Facebook Sentence Embedding

Written by Rajat Newatia