Sitemap

Making Neural Search Queries Accessible to Everyone with Gradio — Deploying Haystack’s Semantic Document Search with Hugging Face models in Gradio in Three Easy Steps

In this article, we are going to help non-technical users benefit from semantic document search in pdf documents by providing them a straightforward interface via Gradio by Hugging Face.

8 min readApr 26, 2022

--

Press enter or click to view image in full size

Introduction

Phil is an enterprise account executive from a digital corporation. He faces the problem of efficiently extracting information from corporate annual business reports. In a previous article, we were looking at the 10-k business plan of Walmart Inc. Semantic document search can help Phil tremendously.

Update: Now, you can try the service on Hugging Face Spaces.

Table of Contents

  • A Look at the End Result
  • Making Semantic Search accessible with Gradio App in 3 Steps
  • Extensions of the Gradio App
  • Next Steps regarding Gradio & Haystack
  • Summary

Semantic search is powered by deep neural nets and allows you to search documents based on similar phrasing and context (word embeddings). This article by deepset.ai explains the specificities of semantic document search. In their article, the authors provide an intuitive illustration towards Semantic Document Search which is finding similar documents to queries:

Press enter or click to view image in full size
Semantic document search: Find the most similar document (e.g. paragraphs or sentences) to a query

With deepsets.ai’s Haystack framework, we can answer Phil’s questions about a corporation’s situation in a given business report. He uses this knowledge to prepare for important sales calls in an efficient manner.

A Look at the End Result: Our Final User Interface with Gradio App

In this previous article, we introduced how to run semantic document search within pdf documents from Jupyter notebooks. However, there is one caveat: Since Phil is not a NLP engineer, we want to make sure that our semantic document search can be used in a convenient way. Therefore, we are going to create a front end to allow Phil to rapidly and securely make use of the tool on his machine. All code of this project is available in this Github repository.

How will we achieve this? We will build a Gradio app with semantic document search capabilities. Let us take a look at what our Gradio app will look like once we are done:

Press enter or click to view image in full size

After having imported our dependencies, we have to follow three simple steps in order to search pdf documents intelligently. In addition to the Haystack dependencies, we need Gradio. The requirements to install these modules with pip can be found in the accompanying git repository in the requirements.txt.

from haystack.nodes import FARMReader, PreProcessor, PDFToTextConverter, DensePassageRetriever
from haystack.nodes import ElasticsearchRetriever
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.pipelines import ExtractiveQAPipeline
from haystack.utils import launch_es
import gradio as gr

Making Semantic Search accessible with Gradio App

Press enter or click to view image in full size

In step 1, we introduce a few functions to be able to use Haystack within Gradio and to optimize our semantic document search for speed.

First, we define our preprocessor. If you are unsure about how to set these parameters, this comprehensive article on parameter tweaking provides guidance.

preprocessor = PreProcessor(
clean_empty_lines=True,
clean_whitespace=True,
clean_header_footer=True,
split_by=”word”,
split_length=100,
split_respect_sentence_boundary=True,
split_overlap=3
)

As we have seen in the gif above, we are able to glance at the results of our questions. However, how can we get an indication that these results are meaningful? Well, we use the confidence of the neural net as a bearing. For this, we provide the “score” of any search result. [Additionally, we could also show the context in which the answer was identified by appending the “context” to the list of fields.]

def print_answers(results):
fields = [“answer”, “score”] # “context”
answers = results[“answers”]
filtered_answers = []
for ans in answers:
filtered_ans = {
field: getattr(ans, field)
for field in fields
if getattr(ans, field) is not None
}
filtered_answers.append(filtered_ans)
return filtered_answers

Also, we need a ‘run_once’ function to ensure that we are reading and analyzing a pdf document when it is uploaded only once (we will use this function as a decorator later). This way, we do not have to prepare a pdf every single time we ask a question, but we can retrieve multiple questions from it.

def run_once(f):
def wrapper(*args, **kwargs):

if not wrapper.has_run:
wrapper.has_run = True
return f(*args, **kwargs)
wrapper.has_run = False
return wrapper
Press enter or click to view image in full size

There is little left to make semantic document search work: We initialize our reader, load the document store, and initialize our retriever. To get results fast, we will use a sparse retriever here. The notable differences between sparse and dense retrieval methods are speed and quality.

launch_es()
reader
= FARMReader(model_name_or_path=”deepset/roberta-base-squad2")
document_store = ElasticsearchDocumentStore(host=”localhost”, username=””, password=””, index=”document”)
retriever_es = ElasticsearchRetriever(document_store=document_store)
pipe = ExtractiveQAPipeline(reader, retriever_es)

As a last step before querying, we need to process the pdf document. We convert the pdf into text chunks, preprocess them (with the parameters that we defined in the beginning of step 2), and write them into the ElasticSearch Document store.

@run_once # use a decorator, as we process the pdf only once.
def written_document(pdf_file):
converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=[“en”])
document = [converter.convert(file_path=pdf_file.name, meta=None)[0]]
preprocessed_docs = preprocessor.process(document)
document_store.write_documents(preprocessed_docs)
return None

Now, we use the document chunks from the document store, and run queries (i.e., we ask questions) on them. If the current results are too slow, then we can get faster answers by adjusting the top_k value of the retriever. A retriever is an excavator. It sifts through a large amount of documents by performing quick vector computations to find the best candidate documents for the reader. Think of the reader as a pickaxe. It looks only at the candidates and passes them through a pre-trained transformer model that determines which answer fits a query best.

def predict(question, pdf_file):
written_document(pdf_file)
result = pipe.run(query=question, params={“Retriever”: {“top_k”: 20}, “Reader”: {“top_k”: 5}})
answers = print_answers(result)
return answers
Press enter or click to view image in full size

Congratulations! We just set up a semantic document search tool. Now, let us make it accessible to Phil! Gradio.app is amazing. The team behind it created a very straightforward approach to making machine learning models testable. Let us explore some of its very cool features that will be helpful for Phil.

Press enter or click to view image in full size

Launching this front-end is as easy as (1) plugging in a predict() function, (2) defining inputs (we use both a text box for the question query, and a file uploader for the pdf document to search through), and (3) define the output which will be text. That’s it.

Extensions of Gradio App

Yet, there are some nice additional features, such as a flagging option (to vote on the quality of the result), an interpretation tool (that allows us to understand how the answers were formed), and a theme chooser (we are using the dark-grass one in this very instance). In the code below, we can read how you can make these amendments.

title = “Search PDF Business Reports with Sparse Passage Retrieval”
iface = gr.Interface(fn=predict,
inputs=[gr.inputs.Textbox(lines = 3, label=’Ask an open question!’),
gr.inputs.File(file_count=”single”, type=”file”, label=”Upload a pdf”)
],
outputs=”text”,
title=title,
flagging_options=[“top”, “medium”, “bad”],
interpretation=”default”,
theme=”dark-grass” # “default”, “huggingface”, “dark-grass”, “peach”
)

Yet, Gradio does not stop there. You can run your Gradio app locally out of jupyter, in your local browser, or through a web api if you activate the ‘share=True’ comment. Oftentimes, it makes sense to prevent your tool from unintended use. For these situations, you can protect your Gradio app with a user and password combination. This is as simple as adding ‘auth’ with a login name and a password to the Interface. We even can enable a queue if we have a lot of server requests.

iface.launch(
# share=True,
# auth=(“admin”, “pass1234”),
# enable_queue=True
)

If we run this last instruction, then we get the following interface:

Press enter or click to view image in full size

This was easy! We just learned how to use semantic document search with one of the best open-source semantic document search solutions and explored Gradio as a means to make our tool instantly usable. Last, let us look at a few further steps, and summarize our findings.

Now, we can ask Phil’s questions through a front end! When using the tool, mind that extractiveQA does well in searching for answers in large text corpora. Hence, questions should be phrase with question words (“What are efforts regarding digital experiences?”) rather than as yes/no questions (“Are efforts for digital experiences?”). If the question would be phrased so that the answer shall be yes or no, then a deep neural net has a hard time finding the correct answer in the documents (credits & thanks to Tuana from deepset.ai!).

Potential Next Steps:

@Gradio App

  • We can build an actual web-app that gives us even a bit more convenience.

@Haystack Semantic Search

Summary

Based on the business need of senior sales account executive Phil, we used the Haystack framework from deepset.ai for querying annual financial reports from S&P 500 company Walmart. Additionally, we rapidly iterated a front-end by using Hugging Face’s Gradio.

While this solution is incredibly fast for developing MVPs/sharing your models, we still have a few aspects to optimize. Also, it is important that users understand that the way they ask questions is crucial.

Thank you deepset.ai & Hugging Face

Deepset.ai’s Haystack and Hugging Face’s Gradio are so cool! With these two frameworks, one can essentially get an immediate solution to tedious processing of pdf documents to get reasonable context-based results. Thanks to deepset.ai, Hugging Face and Gradio for offering such amazing integrations and possibilities.

Thank you for reading this article and for any comments, Seb.

--

--

Responses (1)