Retrieval QA with Prompt — NO API

Retrieval QA | Prompt | Haystack |

Published in

Data And Beyond

4 min readMay 23, 2023

Retrieval Question-Answering (QA) is an impressive technology that excels at extracting answers from a given context. It empowers us to interact directly with our data, obtaining relevant information swiftly. However, it’s important to note that while Retrieval QA provides us with a selection of options that the machine deems as potential correct answers, it is ultimately up to us, as humans, to interpret and make sense of the results. If you’re unfamiliar with the process of creating a Retrieval QA system, I recommend reading the following information:

Creating AI RetrievalQA using thousands of Industries Standard. — This time, NO API

Python | Haystack| Texts

medium.com

This leads me to contemplate the possibilities of augmenting the outcome of Retrieval QA by integrating it with a prompt. By doing so, the machine can refine the final answer by utilizing the distilled information obtained from Retrieval QA. This approach would enable the machine to tailor its response more accurately to the specific context provided in the prompt. The combination of these two techniques holds potential for enhancing the accuracy and relevance of the generated answer.

If that’s what you are looking for, keep reading…

Installation and Data Preparation:

To begin, let’s set up the necessary environment for our experiment. We’ll install the required libraries, including farm-haystack and sentence-transformers. Additionally, we'll import relevant libraries, such as: pandas and os

!pip install -q farm-haystack sentence_transformers

import pandas as pd
import os
from haystack import Pipeline
from haystack.nodes import TextConverter, PreProcessor
from haystack.document_stores import FAISSDocumentStore, InMemoryDocumentStore
from haystack.pipelines.standard_pipelines import TextIndexingPipeline
from sentence_transformers import SentenceTransformer
from haystack.nodes import EmbeddingRetriever, FARMReader
from haystack.pipelines import ExtractiveQAPipeline
from haystack.nodes import PromptNode, PromptTemplate

and then, load the dataframe, the dataframe used here is a collection of thousands of rolling stock standards saved a s a pickle dataframe:


# Load the dataframe from a pickle file
df = pd.read_pickle('./standards.pkl')
df['num_chars'] = df['text'].apply(lambda x: len(x))
df = df[df['num_chars'] != 0]
df = df[['name', 'url', 'text']]
selected_columns = ['name', 'text']
df = df[selected_columns]

Since there are a lot of documents contained in the dataframe, we might as well filter it based on specific topic. in this case we use “running dynamic”

#choose topic
import re
keyword = "running dynamic"
filtered_std = df[df['text'].str.contains(keyword, flags=re.IGNORECASE)]

And save it to a file for further use


# Create a folder to store the text files
folder = 'text_files'
doc_dir = 'text_files'
if not os.path.exists(folder):
    os.mkdir(folder)
# Replace file extensions in file names
filtered_std['name'] = filtered_std['name'].str.replace('.pdf', '.txt')


# Loop through each row of the dataframe
for index, row in filtered_std.iterrows():
    # Get the file name and text for this row
    file_name = row['name']
    text = row['text']
    
    # Create a new text file with the given file name and write the text to it
    file_path = os.path.join(folder, file_name)
    with open(file_path, 'w', errors='ignore') as f:
        f.write(text)

Indexing and Retrieval:

Next, we need to index our data and configure the retrieval pipeline. We’ll use the FAISSDocumentStore and InMemoryDocumentStore to store and index our documents. The TextIndexingPipeline will handle the indexing process, including text conversion, preprocessing, and document storage.

#defining pipeline
document_store = InMemoryDocumentStore()
indexing_pipeline = Pipeline()
text_converter = TextConverter()
preprocessor = PreProcessor(
    clean_whitespace=True,
    clean_header_footer=True,
    clean_empty_lines=True,
    split_by="word",
    split_length=1000,
    split_overlap=20,
    split_respect_sentence_boundary=True,
)

indexing_pipeline.add_node(component=text_converter, name="TextConverter", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="PreProcessor", inputs=["TextConverter"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["PreProcessor"])

Once defined, we will use the indexing_pipeline to ingest the data

#reads all related files
files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
indexing_pipeline.run_batch(file_paths=files_to_index)

retriever = EmbeddingRetriever(
    document_store=document_store, embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1"
)
document_store.update_embeddings(retriever)

reader = FARMReader(model_name_or_path="bert-large-uncased-whole-word-masking-finetuned-squad", use_gpu=True)

pipe = ExtractiveQAPipeline(reader, retriever)

Basically, the retrieval QA pipeline has been built, and thus will be able to read and select answer from given context. in this case, the query is : “vehicle at what speed that must perform dynamic performance test?”

#QnA
k = 5
query = "vehicle at what speed that must perform dynamic performance test?"

prediction = pipe.run(
    query=query,
    params={
        "Retriever": {"top_k": k * 3},
        "Reader": {"top_k": k * 2}
    }
)

But, it will only shows top_k answer based on the machine understanding. Thus combining the end result of this RetrievalQA with a prompt to generete an answer will deemed best


answer_contexts = []
for i in range(k):
    answer_context = prediction['answers'][i].context
    answer_context = answer_context.replace('\n', ' ')  # Remove line feeds
    answer_contexts.append(answer_context)
joined_contexts = ' '.join(answer_contexts)

prompt_node = PromptNode(model_name_or_path="google/flan-t5-base", use_gpu=True)
prompt_text = "Consider you are a rolling stock consultant provided with this query: {query} provide answer from the following context: {contexts}. Answer:"
output = prompt_node.prompt(prompt_template=prompt_text, query=query, contexts=joined_contexts)
print(output[0])

This will yield short answer instead of list of options:

V adm 60 km/h

The application of this combined approach of Retrieval QA with prompts is incredibly versatile and adaptable. By simply changing the context, such as using a dataframe representing a constitution, we can tailor the system to assist legal professionals, providing them with relevant information and insights specific to constitutional law. Similarly, if we replace the dataframe with medical books, the system can be transformed into a valuable tool for doctors and healthcare practitioners, enabling them to access accurate medical knowledge and aiding them in making informed decisions.

The flexibility of this approach allows us to leverage its power in various domains and professions. By adapting the underlying data and context, we can create specialized systems that cater to the specific needs and expertise of different fields. Whether it’s law, medicine, engineering, or any other domain, integrating Retrieval QA with prompts offers the potential to enhance professional expertise and decision-making processes.

Retrieval QA with Prompt — NO API

Retrieval QA | Prompt | Haystack |

Creating AI RetrievalQA using thousands of Industries Standard. — This time, NO API

Python | Haystack| Texts

Written by bedy kharisma