Automatic Question Answering — Document Retriever (Machine Learning)

Published in

Sinch Blog

4 min readJun 7, 2022

Have you ever used Google or Wikipedia to sake your curiosity? I often use these systems to search a topic or get quick information. I can imagine that you do the same. In this blog post, we are going to create a system like them, called Question Answering (QA) system. A good QA system allow users to extract knowledge from data in a natural way by asking questions.

QA is an Artificial Intelligence (AI) task that answers users’ questions (natural language queries) using a large collection of documents; by document you can understand as web pages, Word documents, PDF documents, FAQs, and so on. In summary, QA systems query many documents to extract an answer to user questions; this consists of two main models: (1) document retriever — retrieve the most useful documents that may contain the answer to a given question; (2) document reader — a machine reader carefully examines the retrieved contexts and identify the correct answer. In this blog post, we are going to explore the Document Retriever using AI techniques.

Source: Intro to Automated Question Answering

Let’s see the Document Retriever from Wikipedia, just to better understand our goal:

import wikipedia as wiki

k = 5
question = "What are the tourist hotspots in Portugal?"

results = wiki.search(question, results=k)
print('Question:', question)
print('Pages:', results)----- OutputQuestion: What are the tourist hotspots in Portugal?
Pages: [ 
  'Tourist attraction', 'Portugal',
  'Goa', 'Tourism', 'Algarve'
]

We can notice a few things:

The question is made in pure natural language; and
We must inform the number k of documents to be retrieved.

Thus, the Document Retriever must be able to (1) process natural language, usually using Natural Language Processing (NLP) techniques; and (2) sort the most useful documents based on a question.

Document Retriever

Document Retriever in QA is usually implemented using TF-IDF or BM25, which matches keywords between the question and the documents representing them as sparse vectors. Conversely, we can also use dense embeddings, like Word2Vec and BERT. In this case, synonyms or paraphrases that consist of completely different words, but with similar meaning, may still be mapped to vectors close to each other.

This model can be implemented in two steps: (1) transform the question and the documents into vectors; and, in sequence, (2) compute the similarity between them and return the k most likely documents with the answer. For the first task, we can use TF-IDF, Word2Vec, etc. At last, for the second task, we can use the Nearest Neighbors algorithm.

Development

We will evaluate our model using the Stanford Question Answering Dataset. It contains over 100k questions and answers. Also, for simplicity, let’s explore only the TF-IDF implementation, but you can see an implementation using Word2Vec as well in the Jupyter Notebook (linked in the end of the blog post).

“TF-IDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document […] It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling.” [Wikipedia]

from sklearn.feature_extraction.text import TfidfVectorizertfidf_configs = {
    'lowercase': True,
    'analyzer': 'word',
    'stop_words': 'english',
    'binary': True,
    'max_df': 0.9,
    'max_features': 10_000
}

embedding = TfidfVectorizer(**tfidf_configs)

The TF-IDF setups are: (I) English preprocessing; (II) Analyze at word-level; (III) Binary importance, means that the same word is only computed once per document to avoid counting the same word several times; and (IV) Sparse vector size is 10k, so the not-frequent words will be ignored.

from sklearn.neighbors import NearestNeighbors

retriever_configs = {
    'n_neighbors': 3,
    'metric': 'cosine'
}

retriever = NearestNeighbors(**retriever_configs)

Meanwhile, the Nearest Neighbors setups are: (I) Compute the similarity between the question and document vectors using the cosine function — the most popular function to work with NLP problems; and (II) retrieve the 3 most likely documents to answer the question.

OK. How good is this model? Retrieving only one document, the model reached a low accuracy of 43.22%. Note, this is a difficult problem because we have many documents (~18k documents) and some are similar each other, thereby it becomes complicated to retrieve exactly the right one. In contrast, retrieving the top-3 documents the model was able to reach a high accuracy of 98.92% (hitting 86,650 cases out of 87,599).

Conclusion

Question Answering systems are capable of respond users’ questions using Natural Language Processing algorithms, they consist of Document Retriever and Document Reader techniques. In this blog post, we discussed how to elaborate a Document Retriever and evaluated the strategy using a real dataset. As mentioned, this is a complex problem, due to the high number of documents. Nevertheless, our baseline solution reached a high accuracy of 98.92%. Besides this approach, we also have other popular models to work with document retrieval such as BM25 and DPR. Jupyter Notebook:

NLP - Document Retrieval for Question Answering

Explore and run machine learning code with Kaggle Notebooks | Using data from Stanford Question Answering Dataset

www.kaggle.com