Building a Question Answering System Part 2: Document Retrieval

Authors: The CASL Team

Petuum, Inc.
Jun 3 · 6 min read
CASL Forte

Welcome back to our blog post series of building your own Question and Answering system! In the last post (Building a Question Answering System Part 1: Query Understanding in 18 lines of Code), we introduced how to implement Question Understanding — — the first step of a Q&A system in just 18 lines of code using Forte. By following this blog post, you will learn to build the Document Retrieval step and connect it with Question Understanding from Part 1.

Recall from the last blog post, Document Retrieval is the step immediately after Question Understanding that involves identifying and retrieving a limited number of highly related documents from a large document pool. It relies on the outcome of Question Understanding to provide information about the intention of the user question and uses it to form a query to the document pool.

We use a pre-built ElasticSearch index as our document pool which contains more than 190,000 scientific and medical papers from the CORD-19 dataset. If you’re interested in learning how we built the pre-indexed corpus, our code can be found here. The ElasticSearch backend provides a fast method to quickly filter information from a large corpus, which is great for larger corpora with millions of documents.

Example: Open Research Dataset

The task for Document Retrieval is now very clear — — take the DataPackcreated by Question Understanding containing Sentence(by NLTK), Token(by NLTK and AlenNLP) and PredicateLink(by AllenNLP) and convert them into an ElasticSearch query that is used to search for documents that are most relevant to the user question. Follow along and build one yourself!

Environment Setup

Re-activate the python environment from last time

Option1: Using Conda

conda activate forte_qa

Option 2: Using Python venv

source env/bin/activate

Clone and cd into our repo

git clone https://github.com/petuum/composing_information_system.git cd composing_information_system

set PYTHONPATHexport PYTHONPATH=$(pwd):$PYTHONPATH

Build the ElasticSearch index

On another terminal, start an ElasticSearch backend

Follow this guide to download and untar ElasticSearch binary. Then start an ElasticSearch backend by

./bin/elasticsearch

Switch back to the first terminal and build the index

Option1 (recommended): using sample data provided in our repo

python examples/pipeline/indexer/cordindexer.py — data-dir sample_data/cord_paper

Option 2: using the complete CORD-19 dataset

Download the CORD-19 dataset from this link. And untar the file to get the document_parses/pdf_json folder. Then

python examples/pipeline/indexer/cordindexer.py — data-dir <path-to-cord19-dataset>/document_parses/pdf_json

Recap: Question Understanding as a Forte pipeline

In our previous blog post (Building a Question Answering System Part 1: Query Understanding in 18 lines of Code), we introduced the following code for the Question Understanding step (which comes before the Document Retrieval step that we’re building in this post):

Extending the Forte pipeline with Document Retrieval

Building forward from the above code, let’s implement Document Retrieval in just 3 steps:

  1. Convert the current DataPack in to a MultiPack
  2. Create a search Query
  3. Retrieve Relevant documents

Convert the current DataPack into a MultiPack

The above step (Question Understanding) outputs a DataPack which is good for holding data about one specific piece of text. A complicated pipeline like the one we are building now may need multiple DataPacks to be passed along the pipeline and this is where MultiPack can help. MultiPack manages a set of DataPacks that can be indexed by their names. MultiPackBoxer is a simple Forte processor that converts a DataPack into a MultiPack by making it the only DataPack in there. A name can be specified via the config.

The following code shows how to get a DataPack in MultiPack by name.

Create an ElasticSearch query

As mentioned earlier, we use ElasticSearch to index our document because it is a search engine with fast search responses perfectly suited for our pipeline.

Prior to this step, all the processors used in the pipeline were provided by Forte available in the forte-wrappers repo. In this step, we need to build a customized processor called ElasticSearchQueryCreator that, as its name suggests, creates a search query using features extracted above. Building a Forte processor is as simple as inheriting a proper base class and implementing the required methods for data processing. In our case, we need to create a class that inherits Forte’s QueryProcessor and implement some required methods. The implementation is shown below.

We will break down the code sample and explain in a more it step-by-step format.

Define a class that inherits Forte’s QueryProcessor

A QueryProcessor in Forte is a special processor that generates a Query entry for the downstream processor.

Implement the following methods required by QueryProcessor

default_configs

As its name suggests, this method is used to define the default configurations for this processor. It is also a good place to document (using docstring) all the configuration fields for this processor.

ElasticSearchQueryCreator requires two ElasticSearch related configurations “size” and “field”, where “size” is the number of results ElasticSearch should return, and “field” is the field you want to search from the ElasticSearch index you just built. The third configuration is used by Forte to locate the DataPack created at the Question Understanding step.

_process_query

This is the method that contains the main logic of this processor. The implementation may look lengthy, but the logic is simple.

  1. Get the query DataPack using get_pack
  2. Extract arg0, arg1, verb from the query DataPack
  3. Determine whether the query is asking for arg0 or arg1
  4. Construct a match_phrase ElasticSearch query
  5. Return the pack to store query value and the constructed query

Note. Step 1 and 2 are handled by the helper function query_preprocess — the code can be found here.

We can now attach the ElasticSearchQueryCreator to our pipeline.

Try it out yourself!

Retrieve relevant documents from the ElasticSearch database.

The last thing is to send this query over to our ElasticSearch backend and retrieve the results. Forte provides a ElasticSearchProcessor that does this for us.

Finally, we can run Document Retrieval end-to-end.

Looking forward

Document retrieval helps quickly locate tentative document that contains the answers the question is looking for. In the next blog post, we will show how to digest the documents and convert them into concise answers.

Why use Forte?

We hope you liked this guide to building a Q&A pipeline in Forte! We created Forte because data processing is the most expensive step in AI pipelines, and a big part is writing data conversion code to “harmonize” inputs and outputs across different AI models and open-source tools (NLTK and AllenNLP being just two examples out of many). But conversion code needs to be rewritten every time you switch tools and models, which really slows down experimentation! That is why Forte has DataPacks, which are essentially dataframes for NLP — think Pandas but for unstructured text. DataPacks are great because they allow you to easily swap open-source tools and AI models, with minimal code changes. To learn more about DataPacks and other time-saving Forte features, please read our previous blog post (Forte: Building Modular and Re-purposable NLP Pipelines)!

About CASL

CASL provides a unified toolkit for composable, automatic, and scalable machine learning systems, including distributed training, resource-adaptive scheduling, hyperparameter tuning, and compositional model construction. CASL consists of many powerful Open-source components that were built to work in unison or leveraged as individual components for specific tasks to provide flexibility and ease of use.

Thanks for reading! Please visit the CASL website to stay up to date on additional CASL and Forte announcements soon: https://www.casl-project.ai. If you’re interested in working professionally on CASL, visit our careers page at Petuum!

CASL Project

An open toolkit for composable, automatic, and scalable learning.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store