Gain valuable corporate insights from a 10-k report in 5 easy steps with dense passage retrieval (DPR) and extractive question answering (extractiveQA) using deepset.ai’s Haystack

In this article, we are going to help sales account executive Phil extract useful knowledge from corporate annual report pdfs with the help of deep learning (we use dense passage retrieval (DPR) and extractive question answering (eQA)).

Sebastian
12 min readApr 17, 2022

Table of Contents

  • The Setup: A Sales Account Executive’s Approach to Cold Calling
  • Beginning with the End in Mind: Taking a peek at the final result
  • Implementing a tool to auto-extract knowledge from 10-k annual reports in 5 steps
  • The results: investigation and interpretation
  • Next steps: Improvements and Accessibility
  • Summary

The Setup: A Sales Account Executive’s Approach to Cold Calling

Phil works for a large digital corporation as an executive account manager. With thirty big accounts a year, he is very, very careful on who to engage with. And before he speaks with his accounts, he makes sure he knows everything about their space. Therefore, he reads corporate annual reports (10-k), and looks for certain aspects such as:

Does the target company make digital experiences a strategic initiative?

If they do, then he has a reason to contact someone at that company. Since Phil is calling his contacts live, it’s absolutely vital he is well informed about their business’s strategic initiatives before he ever picks up the phone.

Phil told me that he would benefit most from a tool that could accurately and quickly identify strategic insights from business reports simply by asking the tool a set of key questions.

Phil’s Questions:

An ExtractiveQA model will only be able to find answers to a question, if the answer exists in the documents it’s looking at. And this is why a Yes/No question isn’t idea. Hence we’ve rephrased the first question ‘Does the target company make digital experiences a strategic initiative?’, and additional questions of Phil as:

  • What efforts are being made in regards to digital experiences?
  • What are the strategic priorities?
  • What is the company’s growth?

Let’s find answers to these questions!

The business benefit of DPR and eQA

For the purpose of this tutorial, I downloaded five arbitrary pdf 10-k business reports (Apple, Broadcom, Pfizer, Salesforce, Walmart), to test our setup. These sample reports and all the accompanying code can be viewed in my Github repo (incl. requirements.txt for dependencies).

Our five business reports combine 483 pages with approximately 450 words per page. This results in around 220,000 words. Assuming that the ordinary person reads 300 words per minute, the time it takes to read these reports cumulates to 12 hours.

Yet, an implementation as presented in this article can cut the time to identify and extract key findings for Phil dramatically.

Beginning with the end in mind

Before we start, let’s have a look at what we will achieve by the end of this article. We will ultimately build a tool to answer Phil’s questions. Here is an example of what the tool provides when we ask the exemplary question “What are the strategic priorities?” in Walmart’s annual report:

Phil may immediately spark a conversation with the CMO of Walmart and discuss how Phil’s digital experience product can enable Walmart to improve their ‘customer-facing initiatives by helping to create a seamless omni-channel experience for their customers.’

Answers to questions ‘What are strategic priorities?’ (actual answers marked in blue)

[In case you have any issues with installing Haystack, there are some instructions at the end of this article. You can also join the Haystack slack. Everyone is very helpful and friendly!]

Implementing a tool to auto-extract knowledge from 10-k annual reports

In the following, we take five steps to help Phil to get answers to his questions with deepset.ai’s Haystack framework.

In Step 1: We load a sample 10-k report pdf and transform it to text with optical character recognition (OCR).
In Step 2: We preprocess the texts by cleaning them.
In Step 3: We set up the retriever with our requirements for 10-k reports.
In Step 4: We set up the reader with our requirements for 10-k reports.
In Step 5: We use both the retriever and reader to analyze the 10-k report of Walmart.

Before we begin, we have to import our dependencies.

from haystack.utils import convert_files_to_docs, export_answers_to_csv
from haystack.nodes import FARMReader, DensePassageRetriever, PreProcessor, PDFToTextConverter
from haystack.document_stores import FAISSDocumentStore
from haystack.pipelines import ExtractiveQAPipeline
import pandas as pd

First, we convert our pdf reports to text. The function ‘convert_files_to_docs’ reads multiple text files (txt or .docx files) and parses them to text. Honestly, this is so cool. We can also use OCR to read in strictly formatted business reports. Let us do this only with the Walmart report to reduce complexity in this article. It’s also not difficult as you can see below. Kudos to deepset.ai for integrating xpdf!

# we can import all reports in one go with:
# all_docs = convert_files_to_docs(dir_path="reports/")
# in this instance, we only use the report of Walmart, but feel free to use other ones with
converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["en"])
all_docs = [converter.convert(file_path="reports/walmart-10k.pdf", meta=None)[0]]

Next, we preprocess the text of the docs, with the ‘preprocessor’, by setting a few parameters that should improve the results further along the way. The parameters have the following purposes:

  • clean_empty_lines: This removes empty lines if there are more than two of them.
  • clean_whitespace: This removes leading and trailing whitespaces in a document part.
  • clean_header_footer: Uses a heuristic to clean headers and footers.
  • split_by: Type we will split a document by. This can be a word, sentence or passage.
  • split_length: Only documents with less than the length of the split_by value (like only 150 words).
  • split_respect_sentence_boundary: Respect the boundaries of a sentence when splitting.

In their article on parameter tweaking, the authors of Haystack recommend to split_by ‘word’ as sentence and passage may fluctuate widely in combination with a split_length of ‘100’ words. Additionally, it is suggested to avoid loss of context when splitting sentences by setting split_respect_sentence_boundary to ‘True’. Finally, use the sliding window method (setting split_overlap to a positive integer, e.g. 3) to maintain context for the encoding (mind: this will inflate the corpus).

# Setting our parameters for the preprocessing
preprocessor = PreProcessor(
clean_empty_lines=True,
clean_whitespace=True,
clean_header_footer=True,
split_by="word",
split_length=100,
split_respect_sentence_boundary=True,
split_overlap=3
)
# Actual preprocessing
preprocessed_docs = preprocessor.process(all_docs)

Now, we instantiate the document store, and write all pre-processed documents to it.

try:
# Instantiate the document store
document_store = FAISSDocumentStore(faiss_index_factory_str=”Flat”)
# Save all preprocessed documents to the document store
document_store.write_documents(preprocessed_docs)
except ValueError:
# Reset document store, to make sure it is fine :)
document_store.delete_documents()

At this point, we have our pre-processed documents in the document store. Next, we need to prepare them (retrieve them for analysis with a deep neural net). Retrievers help narrowing down the scope for the reader by identifying relevant documents to pass on to a reader.

This ‘narrowing down’ can be done with:

  • Sparse retrievers: algorithms based on counting the occurrences of words (bag-of-words) or
  • Dense retrievers: use neural network models to create “dense” embedding vectors (embeddings are a type of word representation that allows words with similar meaning to have a similar representation).

We could use a sparse retriever to optimize for speed, but since this is a small example we’ll use a dense retriever in the proceedings to optimize for quality.

# We set up our retriever with the preferred parameters.
retriever = DensePassageRetriever(
document_store=document_store,
query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
max_seq_len_query=64,
max_seq_len_passage=256,
batch_size=16,
use_gpu=True,
embed_title=True,
use_fast_tokenizers=True,
)
# We calculate the embeddings for all of our documents in the document store
document_store.update_embeddings(retriever)

Next, we need a Reader. A reader scans the texts given by retrievers in detail and extracts the best answers. Readers are based on deep learning models. A reader with its QA model is the most expensive component of a question-answering pipeline — both computationally and time-wise. Hence, it makes sense to qualify whether a query is given as keywords (“digital initiative”) or as phrased questions (“What are the digital initiatives?”). In the case of keyword queries, you might want to perform a regular document search instead of employing a full semantic search pipeline as it is faster.

Since we have fully phrased questions, we use a reader with a QA model:

# After having initialized our retriever, we initialize our reader
reader = FARMReader(
model_name_or_path="deepset/roberta-base-squad2", use_gpu=False
)

Finally, we add our building blocks together with a Haystack Pipeline. We use an ExtractiveQAPipeline that combines a retriever and a reader to answer our questions. Further pipelines can be found here. The python instruction is as simple as:

pipe = ExtractiveQAPipeline(reader, retriever)
pipe.draw() # prints our pipeline
our extractiveQA pipeline

We are almost done! Now, we bring in Phil’s questions, so that our reader can use them to search in our documents. Mind that extractiveQA does well in searching for answers in large text corpora. Hence, questions should be phrase with question words (“What are efforts regarding digital experiences?”) rather than as yes/no questions (“Are efforts for digital experiences?”). If the question would be phrased so that the answer shall be yes or no, then a deep neural net has a hard time finding the correct answer in the documents (credits & thanks to Tuana from deepset.ai!). Hence, we form the following questions:

questions = ["What are efforts regarding digital experiences?",
"What are the strategic priorities?",
"What is the company's growth?"]

We can configure how many candidates the reader and retriever returns for our questions. The higher the top_k for the retriever, the better (but also the slower) the answers. As a general rule, you might want to remember:

  • If you are unhappy with the results: increase top_k_retriever.
  • If the system is too slow: lower the top_k_retriever.

Here, we use an arbitrary top_k_retriever of 20:

i = 0
for question in questions:
prediction = pipe.run(query=question, params={"Retriever": {"top_k": 20}, "Reader": {"top_k": 5}})
export_answers_to_csv(output_file="answers/result-" + str(i) + '.csv', agg_results=prediction)
i += 1

The Results

Investigating the results

We take a look at the results, and they are pretty breathtaking. You can go through Walmart’s annual report yourself, and identify the answers. Let us see what our extractiveQA pipeline gave us.

Question 1: What efforts are being made in regards to digital experiences?

Answers to question 1 (actual answers marked in blue)

Question 2: What are strategic priorities?

Answers to question 2 (actual answers marked in blue)

Question 3: What is the company’s growth?

Answers to question 3 (actual answers marked in blue)

2 out of 3 questions were impressively correct. The last question was a numeric one, and quite tricky but Haystack found the right numbers for the international market. This is a solid base to start from, and can help Phil to contact the CMO of Walmart.

Evaluation of the results

If we would have used a deep neural net that we fine-tuned or trained ourselves, it would be a great time to evaluate how good its performance is. Since we used a deep neural net out of the box this is not needed.
However, we would set up a test dataset, to understand if our efforts went well. Then, we would evaluate the retriever and the reader separately. For the retriever we would look into the recall score and for the reader we could use the F1 score and the semantic answer similarity (SAS).

Interpretation of the results

In general, current results are very good. They are a starter to look for the specifics in a given text (or extensive report), and the context allows us to understand them as a human. Personally, I am quite impressed.

The reader is very confident with the results of these questions as the confidence percentage is usually above 50% (you can see this if you run the accompanying code from the Github repo).

There are still a myriad of ways on how to improve the situation for Phil, and to further advance his needs.

Next Steps

Avenues for improvements our neural search results

  • Use a reader model that is trained or, at least, fine-tuned on 10-k, 10-q, financial reports. An example on how to do that can be found in tutorial 2 by deepset.ai.
  • Using knowledge distillation as explained here. We could transfer knowledge from a large model to a smaller model with fewer parameters to perform inference faster. Alternatively, we can improve the prediction quality of a model while retaining its size by first training a larger model and then distilling that model into the smaller model architecture of the original model.
  • Using different retrieval algorithms (sparse vs. dense) which is done in tutorial 1 (sparse) by deepset.ai and tutorial 6 (dense) by deepset.ai.
  • Since business reports often have tables, we could consider using models that are fine-tuned on tabular data like Google Tapas (as exemplarily done in tutorial 15 by deepset.ai).
  • We could further try to improve the phrasing of the questions by asking the same question in a variety of ways / paraphrasing the questions or using a deep neural net that helps with phrasing questions (e.g., via questions generation in tutorial 13 by deepset.ai); explained here.
  • Also, next generation neural search techniques as in understanding semantic search can help us.

Avenues for making our neural search accessible for Phil

  • Deploy this service as a microservice to a cloud platform & expose it as a REST-API or quickly build a prototype for Phil to interact with. The latter can be done with Gradio or Streamlit.
  • Annotate the documents with any metadata that Phil may deem important as described in Beyond Vanilla Question Answering.
  • Consider re-ingesting qualified data to further fine-tune the model (data flywheel), to continuously improve the neural net and its outputs.

Summary

Based on the business need of senior sales account executive Phil, we used the Haystack framework from deepset.ai to ask questions for querying annual financial reports from S&P 500 company Walmart. The results of the DPR are pretty impressive. While they allow to scale massively, there are still aspects to be minded for humans in the loop:

  • Based on the answers, the user has to be trained in how to ask questions in order to achieve his desired results.
  • Special care must be taken when defining the context of a question to ensure that the results are relevant.

Since, I cannot fully identify what is most useful to the manager Phil, I am not further advancing my ‘question phrasing skills’ to score better in this very tutorial. Yet, I am amazed by the results.

Thank you deepset.ai

Thank you @deepset.ai for the wonderful tutorials that I used as a base for developing this business use case.

Deepset.ai is an AI startup from Berlin. They are ‘building a semantic layer for the modern tech stack — driven by the latest NLP and open source.’ I was happy to go through their tutorials since I appreciate how clearly they are written. Special thanks to you Tuana for consulting me on how to phrase questions, and to Joy for helping me phrasing this article.

Thank you for reading this article and for any comments, Seb.

Appendix

Please find instructions to run this code locally and to fix bugs that occurred on my machine.

## Dependencies to install on a Mac OS M1X chip.
## In order to run this script on my local machine, I had to download / install the following dependencies.
## Download & install xpdf tools to process PDF documents with OCR
# !wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-mac-4.03.tar.gz
# !tar -xvf xpdf-tools-linux-4.03.tar.gz && sudo cp xpdf-tools-mac-4.03/bin64/pdftotext /usr/local/bin
## Download & install ocr (optical character recognition to read in pdfs with Python)
# !pip install 'farm-haystack[ocr]' -q
## Download & install FAISS (i.e., Facebook AI Similarity Search, a library that allows developers to quickly search for embeddings of multimedia documents that are similar to each other.)
# !pip install 'farm-haystack[faiss]' -q
## Reduce the output of a nasty error models
# import os
# os.environ["TOKENIZERS_PARALLELISM"] = "false"
## configuration to have nicer printing in pandas
# pd.set_option('display.max_colwidth', None)
## Installing pygraphviz for pipe.draw() on Mac M1
# brew install graphviz
# python -m pip install \
# --global-option=build_ext \
# --global-option="-I$(brew --prefix graphviz)/include/" \
# --global-option="-L$(brew --prefix graphviz)/lib/" \
# pygraphviz

--

--