Building a Question Answering System Part 2: Document Retrieval
Authors: The CASL Team
Welcome back to our blog post series of building your own Question and Answering system! In the last post (Building a Question Answering System Part 1: Query Understanding in 18 lines of Code), we introduced how to implement Question Understanding — — the first step of a Q&A system in just 18 lines of code using Forte. By following this blog post, you will learn to build the Document Retrieval step and connect it with Question Understanding from Part 1.
Recall from the last blog post, Document Retrieval is the step immediately after Question Understanding that involves identifying and retrieving a limited number of highly related documents from a large document pool. It relies on the outcome of Question Understanding to provide information about the intention of the user question and uses it to form a query to the document pool.
We use a pre-built ElasticSearch index as our document pool which contains more than 190,000 scientific and medical papers from the CORD-19 dataset. If you’re interested in learning how we built the pre-indexed corpus, our code can be found here. The ElasticSearch backend provides a fast method to quickly filter information from a large corpus, which is great for larger corpora with millions of documents.
The task for Document Retrieval is now very clear — — take the
DataPackcreated by Question Understanding containing
Token(by NLTK and AlenNLP) and
PredicateLink(by AllenNLP) and convert them into an ElasticSearch query that is used to search for documents that are most relevant to the user question. Follow along and build one yourself!
Re-activate the python environment from last time
Option1: Using Conda
conda activate forte_qa
Option 2: Using Python venv
Clone and cd into our repo
git clone https://github.com/petuum/composing_information_system.git cd composing_information_system
set PYTHONPATHexport PYTHONPATH=$(pwd):$PYTHONPATH
Build the ElasticSearch index
On another terminal, start an ElasticSearch backend
Follow this guide to download and untar ElasticSearch binary. Then start an ElasticSearch backend by
Switch back to the first terminal and build the index
Option1 (recommended): using sample data provided in our repo
python examples/pipeline/indexer/cordindexer.py --data-dir sample_data/cord_paper
Option 2: using the complete CORD-19 dataset
Download the CORD-19 dataset from this link. And untar the file to get the
document_parses/pdf_json folder. Then
python examples/pipeline/indexer/cordindexer.py --data-dir <path-to-cord19-dataset>/document_parses/pdf_json
Recap: Question Understanding as a Forte pipeline
In our previous blog post (Building a Question Answering System Part 1: Query Understanding in 18 lines of Code), we introduced the following code for the Question Understanding step (which comes before the Document Retrieval step that we’re building in this post):
Extending the Forte pipeline with Document Retrieval
Building forward from the above code, let’s implement Document Retrieval in just 3 steps:
- Convert the current
DataPackin to a
- Create a search Query
- Retrieve Relevant documents
Convert the current
DataPack into a
The above step (Question Understanding) outputs a
DataPack which is good for holding data about one specific piece of text. A complicated pipeline like the one we are building now may need multiple DataPacks to be passed along the pipeline and this is where
MultiPack can help.
MultiPack manages a set of
DataPacks that can be indexed by their names.
MultiPackBoxer is a simple Forte processor that converts a
DataPack into a
MultiPack by making it the only
DataPack in there. A name can be specified via the config.
The following code shows how to get a
MultiPack by name.
Create an ElasticSearch query
As mentioned earlier, we use ElasticSearch to index our document because it is a search engine with fast search responses perfectly suited for our pipeline.
Prior to this step, all the processors used in the pipeline were provided by Forte available in the forte-wrappers repo. In this step, we need to build a customized processor called
ElasticSearchQueryCreator that, as its name suggests, creates a search query using features extracted above. Building a Forte processor is as simple as inheriting a proper base class and implementing the required methods for data processing. In our case, we need to create a class that inherits Forte’s
QueryProcessor and implement some required methods. The implementation is shown below.
We will break down the code sample and explain in a more it step-by-step format.
Define a class that inherits Forte’s
QueryProcessor in Forte is a special processor that generates a
Query entry for the downstream processor.
Implement the following methods required by
As its name suggests, this method is used to define the default configurations for this processor. It is also a good place to document (using docstring) all the configuration fields for this processor.
ElasticSearchQueryCreator requires two ElasticSearch related configurations “size” and “field”, where “size” is the number of results ElasticSearch should return, and “field” is the field you want to search from the ElasticSearch index you just built. The third configuration is used by Forte to locate the DataPack created at the Question Understanding step.
This is the method that contains the main logic of this processor. The implementation may look lengthy, but the logic is simple.
- Get the query
- Extract arg0, arg1, verb from the query
- Determine whether the query is asking for arg0 or arg1
- Construct a
- Return the pack to store query value and the constructed query
Note. Step 1 and 2 are handled by the helper function
query_preprocess— the code can be found here.
We can now attach the
ElasticSearchQueryCreator to our pipeline.
Try it out yourself!
Retrieve relevant documents from the ElasticSearch database.
The last thing is to send this query over to our ElasticSearch backend and retrieve the results. Forte provides a
ElasticSearchProcessor that does this for us.
Finally, we can run Document Retrieval end-to-end.
Document retrieval helps quickly locate tentative document that contains the answers the question is looking for. In the next blog post, we will show how to digest the documents and convert them into concise answers.
Why use Forte?
We hope you liked this guide to building a Q&A pipeline in Forte! We created Forte because data processing is the most expensive step in AI pipelines, and a big part is writing data conversion code to “harmonize” inputs and outputs across different AI models and open-source tools (NLTK and AllenNLP being just two examples out of many). But conversion code needs to be rewritten every time you switch tools and models, which really slows down experimentation! That is why Forte has DataPacks, which are essentially dataframes for NLP — think Pandas but for unstructured text. DataPacks are great because they allow you to easily swap open-source tools and AI models, with minimal code changes. To learn more about DataPacks and other time-saving Forte features, please read our previous blog post (Forte: Building Modular and Re-purposable NLP Pipelines)!
CASL provides a unified toolkit for composable, automatic, and scalable machine learning systems, including distributed training, resource-adaptive scheduling, hyperparameter tuning, and compositional model construction. CASL consists of many powerful Open-source components that were built to work in unison or leveraged as individual components for specific tasks to provide flexibility and ease of use.
Thanks for reading! Please visit the CASL website to stay up to date on additional CASL and Forte announcements soon: https://www.casl-project.ai. If you’re interested in working professionally on CASL, visit our careers page at Petuum!