Artificial Intelligence in Answer Sentence Selection.

Abhishek Srinivasan
Analytics Vidhya
Published in
5 min readJun 6, 2020

Business Problem.

Addressing the problem of question answering using Artificial Intelligence.Given a particular question,we find a suitable,more appropriate solution from a set of answers.The system can be used to solve comprehensions where the users need to read a whole paragraph and find the answer for a question with low latency requirement.

Source of Data.

The data used in this case study is from Wiki-QA data-set,a new publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering. Most previous work on answer sentence selection focuses on a dataset created using the TREC-QA data, which includes editor-generated questions and candidate answer sentences selected by matching content words in the question. WikiQA is constructed using a more natural process and is more than an order of magnitude larger than the previous dataset.

Existing approaches to the Problem.

Now,these kind of problems are solved using search-engine type of architectures which requires high latency.The algorithms also do not perform extremely well. Okapi BM25 is one such example. BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document. It is a family of scoring functions with slightly different components and parameters.

Improvements to the existing approaches.

We will be using deep-learning based model to improve the search result and rank documents accordingly.Interestingly the metrics used in this domain(MRR) show better scores when compared with the traditional BM25 method.We will be LSTMs(Long Short Term Memory )units to capture the dependencies between the words and calculate the similarity between the question and the answer.This is the core idea of the Answer Sentence selection model using deep-learning.

Exploratory Data Analysis.

There are basically 3 fields in the WikiQA dataset namely Question,Answer and the label.

Looking at the Distribution of the Label field.

It is a highly imbalanced dataset.It is roughly 25:1 inclined towards the negative class.

There are totally 29208 records in the dataset.The dataset has no missing values.

For embedding the Question and Answer field,we need to pad the text sequences.So Answer and Question length distributions play an important role in deciding the number.

Answer Field.

Fixing 400 characters as padding for answers field.

Question Field.

Fixing 80 characters as padding for questions field.

First Cut Approach.

  1. The problem is posed as a binary class classification
  2. The system returns 1 if the answer is more appropriate and if the answer is less appropriate,it returns 0.
  3. Idea of the system is to get both the question and the answer and output an relevance score.
  4. If the relevance score is greater than a threshold,it returns 1 else 0.The value of the threshold completely depends on the architecture of the model and the data used in training the model.
  5. Initially the question set and the answer set is preprocessed and cleaning.
  6. Chunking is also applied to structure the question and the answer.
  7. A 300 dimensional word vector is made for each and every word present in both question and answer.(Tokenization)
  8. .It is then embedded into a suitable shape to input into a bidirectional stacked LSTM.
  9. The system uses a multilayer or stacked bidiredirectional LSTMs for getting the context of the question and answer.
  10. In simple words,it uses RNNs to get the meaning of questions and answers.
  11. We are using bidirectional RNNs to get the context from both the sides for a particular word present in the question and answer
  12. The output of the LSTM is subject to dense and a softmax layer to classify the answer.
  13. If possible other algorithms can also be applied and the output can be stacked together.

Metric to be Used.

In this business problem,since we are ranking the answer,we will be using MRR metric.

The mean reciprocal rank is a statistic measure for evaluating any process that produces a list of possible responses to a sample of queries, ordered by probability of correctness. The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer: 1 for first place, ​1⁄2 for second place, ​1⁄3 for third place and so on.

The reciprocal value of the mean reciprocal rank corresponds to the harmonic mean of the ranks.

Feature Engineering.

In this problem,there are 2 text columns which are preprocessed.One is Questions and the other one is Answers.So an additional feature is created which counts the number of common words between the text columns.Higher the number of common words,higher is the chance of the answer being correct.The other feature is the length of the answer field.This feature could also be very useful as the length of the answer can also play a important part in finding the appropriate answer for the question.

Architecture of the Model.

Architecture of the Model.

The Question and the answer field is subject to stacked bidirectional LSTMs so that to capture the context of the respective fields.There are two additional fields also in the architecture.One is the number of common words between the question and the answer.Other is the length of the answer sentence.They are subjected to a dense layer and then merged with the existing layer.The last 6 layers are included to prevent over fitting to the training model.

Model Performance.

We are getting a MRR score of about 76.3 percent.

Alternate Solution.

NEURAL VARIATIONAL FOR TEXT PROCESSING.

This business problem can also be approached with a slightly different model.Instead of using 2 bidirectional LSTMs and additional features,we can use use 3 bidirectional LSTMs without additional features.The batch normalization features of the previous layer remain the same.The performance of this approach is said to be about 69.5 percent.This approach was based on a research paper.

BERT Approach.

This business problem can also be solved using BERT Approach.BERT stands for Bidirectional Encoder Representations from Transformers.You can find more about BERT here.

Future Work.

This model will be included with speech to text input and designed with a speech to text so that it can be upgraded like a AI Assistant.It can also be used to get the summary of a big book and answer questions from that book like a AI Assistant.

References.

  1. https://aclweb.org/aclwiki/Question_Answering_(State_of_the_art)
  2. https://arxiv.org/pdf/1511.04108.pdf
  3. https://arxiv.org/pdf/1707.06372.pdf
  4. https://www.aclweb.org/anthology/P15-2116.pdf
  5. https://arxiv.org/pdf/1905.12897.pdf
  6. https://machinelearningmastery.com/develop-bidirectional-lstm-sequence-classification-p ython-keras/
  7. https://arxiv.org/ftp/arxiv/papers/1801/1801.02143.pdf

Abhishek Srinivasan- https://www.linkedin.com/in/abhishek-srinivasan-1a6099147/

--

--