Creating AI RetrievalQA using thousands of Industries Standard. — This time, NO API

Python | Haystack| Texts

Published in

Data And Beyond

4 min readMay 4, 2023

Question Answering (QA) is a natural language processing task that involves answering questions posed in natural language. A popular method for QA is retrieval-based QA, where the system retrieves a document or a set of documents relevant to the question and extracts the answer from them. An example of it have been discussed this in previous article, see here:

Creating an AI trained using thousands of industries standards

Python | Langchain | Dataframe

medium.com

But, in previous article, it uses openai LLM, thus requires a lot of tokens. Thus makes me wonder, if there is a away we can do that without token. AKA FREE. In this article, we will walk through how to create a retrieval-based QA system using no tokens with the Farm-Haystack library in Python.

First, let’s prepare the data. We will use a dataframe of standards that I have collected and stored in the form of pickle dataframe. Then filter it based on a specific keyword. We will then create a folder to store the text files and loop through the filtered dataframe to write the text to separate files in the folder. This will allow us to index the text files and search for the answer to a given question.

!pip install -q farm-haystack[colab,preprocessing,faiss]

import pandas as pd
import os
import re

# Load the dataframe from a pickle file
df = pd.read_pickle('./standards.pkl')
df['num_chars'] = df['text'].apply(lambda x: len(x))
df = df[df['num_chars'] != 0]
df = df[['name', 'url','text']]

# Choose topic
keyword = "running dynamic"

# Filter
filtered_std = df[df['text'].str.contains(keyword, flags=re.IGNORECASE)]

# Create a folder to store the text files
folder = 'text_files'
doc_dir  = 'text_files'

if not os.path.exists(folder):
    os.mkdir(folder)

# Replace file extensions in file names
filtered_std['name'] = filtered_std['name'].str.replace('.pdf', '.txt')

# Loop through each row of the dataframe
for index, row in filtered_std.iterrows():
    # Get the file name and text for this row
    file_name = row['name']
    text = row['text']
    
    # Create a new text file with the given file name and write the text to it
    file_path = os.path.join(folder, file_name)
    with open(file_path, 'w', errors='ignore') as f:
        f.write(text)

Next, we will create a pipeline for indexing the text files and a retriever for searching for the relevant documents. We will use the FAISSDocumentStore for indexing and an EmbeddingRetriever for retrieving. The EmbeddingRetriever uses a pre-trained sentence-transformer model for encoding the text.

from haystack import Pipeline
from haystack.nodes import TextConverter, PreProcessor
from haystack.document_stores import FAISSDocumentStore

document_store = FAISSDocumentStore(faiss_index_factory_str="Flat")

indexing_pipeline = Pipeline()
text_converter = TextConverter()
preprocessor = PreProcessor(
    clean_whitespace=True,
    clean_header_footer=True,
    clean_empty_lines=True,
    split_by="word",
    split_length=1000,
    split_overlap=20,
    split_respect_sentence_boundary=True,
)

indexing_pipeline.add_node(component=text_converter, name="TextConverter", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="PreProcessor", inputs=["TextConverter"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["PreProcessor"])

import os
from haystack.pipelines.standard_pipelines import TextIndexingPipeline

files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
indexing_pipeline.run_batch(file_paths=files_to_index)

from haystack.nodes import EmbeddingRetriever

retriever = EmbeddingRetriever(
    document_store=document_store, 
    embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1"
)

# Update embeddings for the documents
document_store.update_embeddings(retriever)

We will then create a FARMReader object to read the input question and return the most relevant documents, and an EmbeddingRetriever object to retrieve documents based on their embeddings. We will update the embeddings for all previously indexed documents using the document_store.update_embeddings() method.

from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="bert-large-uncased-whole-word-masking-finetuned-squad", use_gpu=True)

Finally, we will use an ExtractiveQAPipeline object to extract the answer from the relevant documents using the reading comprehension model. We will print the top k=5 answers for a sample question.

from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)

k=5
prediction = pipe.run(
    query="What is the maximum speed that must perform dynamic performance test?",
    params={
        "Retriever": {"top_k": k*3},
        "Reader": {"top_k": k*2}
    }
)

for i in range(k):
    answer_text = prediction['answers'][i].answer
    answer_score = prediction['answers'][i].score
    answer_context = prediction['answers'][i].context
    print(f"Answer {i+1}:")
    print(f"Answer text: {answer_text}")
    print(f"Answer score: {answer_score}")
    print(f"Answer context: {answer_context}\n")

Running the script will give us an output that shows the top answers for a given query. While the desired output may not always be the top-ranked answer, the pipeline provides a range of possible answers that can be refined through further iterations of the query. Additionally, the script provides a quick and efficient way to retrieve relevant information from a large corpus of documents without the need for tokenization. By utilizing the powerful FAISS and sentence-transformers libraries, the retrievalQA pipeline is able to generate accurate and relevant answers to complex questions.

Answer 1:
Answer text: service speed +10%)
Answer score: 0.7130953073501587
Answer context: lable test results to allow model validation:
maximum test speed (service speed +10%) has been tested over track of a suitable length and quality to
d

Answer 2:
Answer text: 120 km/h
Answer score: 0.5713462233543396
Answer context: ent above lower limit of target test range TL90 in speed range 80 < V  120 km/h, see M.4 with Table M.3
j Possibly exclude exceptional track sections 

Answer 3:
Answer text: 10  km/h
Answer score: 0.5554128289222717
Answer context: The speed shall be constant and not exceed 10  km/h. Tests shall be carried out successfully a minimum of 3 times.  6.1.5.1.3 Track features Figure 3 

Answer 4:
Answer text: TQ
= TL50
Answer score: 0.39110568165779114
Answer context: get = 1,1  I adm;  I
target = 1,0  I adm for quasi-static quantities;  TQ
= TL50 (120 km/h);  very small radius curves (analogous to test zone 4):   R

Answer 5:
Answer text: 60 km/h
Answer score: 0.3750748038291931
Answer context: of the present Clause  7. Vehicles with maximum admissible speed V
adm  60 km/h are granted dispensation from dynamic performance assessment.
7.2 Choi

Creating AI RetrievalQA using thousands of Industries Standard. — This time, NO API

Python | Haystack| Texts

Creating an AI trained using thousands of industries standards

Python | Langchain | Dataframe

Written by bedy kharisma