Building end-to-end Question-Answering system for Hindi Language using Haystack
This tutorial will give you clairvoyance on building NLP based QA system. Here I’m gonna cover the full stack (from raw txt files to the interactive web app). I’ll use Deepset.ai’s Haystack to build this QA system for the Hindi language.
Below are the indexes of this Question-Answering fabrication pipeline which we’ll follow through:
- Brief about QA system
- Haystack Pipeline Building
- Documents preparation & Indexing
- Custom training data preparation & Fine-tunning
- Setup up search pipeline with Fine-Tuned model
3.. Inference on a web app using Streamlit
Brief about QA system
Remember the passage comprehension in which you read a passage, understand it, remember it, and then you’re supposed to answer some questions from the paragraph. So in this traditional approach, it takes a human’s cognitive skills to understand questions and find appropriate answers from the given paragraph.
Now if I mentioning cognitive skills here then AI’s role would definitely come. So with the help of NLP (Natural Language Processing) and Information Retrieval, we try to achieve that human-level understanding, remembering, and finding accuracy. And these systems are called NLP-based QA systems.
Although I just mentioned one such use-case of QA systems and this is just a brief overview of the question answering system. If you’re interested in reading more about it, you can refer to this.
Haystack Pipeline Building
Haystack is a python lib that proved to be very convenient and handy when it comes to implementing the Semantic Search and QA system Pipeline. I’ve already covered the Semantic Search part in my previous blog post. And to know more about Haystack, refer to their official docs.
Also, if you don’t want to get into technical implementation details, you can directly jump into my github repo (colab notebook included too).
Other python lib & their info which I’m gonna use:
sentence-trasnformer: Sentence Transformer is very handy in providing various pretrained transformer-based models to embed a sentence or document. To check out these models (use-case wise), click here.
Streamlit: An open-source app framework that provide the easiest way for data scientists and machine learning engineers to create beautiful, performant apps.
st-annotated-text: To display annotated text on streamlit web app.
Document preparation & Indexing: All you need to provide the directory location where your txt files are stored. In my case, I’ve around 2k History documents in the Hindi Language. Now call the get_data_haystack_format() method, it will return basically List[Dict] as required by Haystack.
Once we get the data in the aforementioned format, we gonna index it by calling get_haystack_document_store(). I’ve used Haystack’s InMemoryDocumentStore() because I didn’t have much data. But if you have a large amount of data, you can use other DocumenStore i.e. ElasticSearch, FAISS, etc.
Custom training data preparation & Fine-Tunning: Since I’ve History data (In the Hindi language) in txt files (with Question, Answer) only, I transformed this to SQUAD format in order to fine-tune on any base model. I’ve prepared dataset manually on weekends 😅and covered very basic level question answer pair which doesn’t need much reasoning. Below is the sample snippet of transformed data in the json file:
Now we have the training json file ready, Hereafter Haystack provides a simple & efficient abstraction to perform fine-tune tasks. You just need to provide three things: txt files dir (contain the only txt), training json file, base model type, and Haystack will handle it all.
#Infra: Google colab GPU runtime (Recommended)
#fine_tunning_time: ~ 30- 45 min
Setup search pipeline with Fine-Tunned model: In Haystack, there is 3 main component for search pipeline: DocumentStore, Retriever, Reader. There is one other component called Finder which is just a binding abstraction of Retriever & Reader.
We’re already done with DocumentStore. Now we’d set up Retrieve & Reader. You can think of Retriever & Reader as filters that would pin down the exact results based on whatever method you’ve provided. I’ve used TfidfRetriever() in Retriever, there are other options like BM25, EmbeddingRetriever, etc. And don’t forget to check out the compatibility of DocumentStore-Retriever.
Also, In my previous blog, I’ve explained semantic search using EmbeddingRetriever in detail. You can check it out if you want to experiment with a different retriever.
Now the output of Retriever extracted docs will be the input of Reader. In Reader, we’ll leverage the latest transformer-based big language models to understand the query not only syntactically but semantically too and try to find the pin-down exact results. That’s why here we gonna use our fine-tuned model as it has been already trained on custom data, it will be prone to understand query better. After all these, we need to bind the retriever & reader with Finder. And then Finder will take care of results in-out flow b/w retriever & reader.
Here we come to the final component of the pipeline, Inferencing. Logically, we’ve already created a pipeline already. It just we need to stitch these components together to make inferences.
Inference on web app using streamlit
You’d all agree that streamlit is the go-to lib for all ML engineers, Data Scientists out there to showcase ML experiments. You just need to plug your inference method into it and web app ready to launch. I did the same. Also, I’ve used st-annotated-text lib to display annotated text on the streamlit app.
Voila 🎉🎉!!!! You’ve just built an end-to-end Question Answering system with your own Fine-tuned model with Haystack lib and served results on a web app. Let’s check out the results on the web app 😋
That’s it. Now you’ve got an end-to-end understanding of building a Question-Answering system with Haystack. So go, fire up the colab session and try out different models, do experiments.
Thanks for the read, folks. Have a good day!