Creating an AI trained using thousands of industries standards

Python | Langchain | Dataframe

Published in

Coinmonks

4 min readApr 29, 2023

Creating an AI trained on thousands of rolling stock standards may sound like a daunting task, but with the power of Python and the Langchain library, it can be accomplished in just a few lines of code. Here, we would like to show two different methods: load_qa_chain (fewer lines uses a lot of tokens) and retriveal_qa_chain (more code uses less token). The initial steps are similar:

Setting up environment

First, let’s set up our environment by installing the necessary packages and importing the required modules. We will be using Langchain’s and OpenAI modules, as well as other helpful tools for text processing and indexing. Here are the lines of code to do this:

!pip install -q langchain openai chromadb tiktoken pypdf
import os
os.environ["OPENAI_API_KEY"] = "your openai API key here"
from langchain.chains import RetrievalQA #-> for retrieval QA
from langchain.chains.question_answering import load_qa_chain #-> for load QA 
from langchain.indexes import VectorstoreIndexCreator #-> for retrieval QA using vectors index creator. it uses fewer codes with less tokens. So I think this is the best way.

from langchain.document_loaders import TextLoader, PyPDFLoader, DataFrameLoader #-> types of data loader
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter #-> to splits data to chunks

from langchain.embeddings import OpenAIEmbeddings #embeddings
from langchain.vectorstores import Chroma #to create a db of chunks

from langchain.llms import OpenAI #-> for Large Language Model

import pandas as pd
import numpy as np

Next, we need to load our database of rolling stock standards. We have extracting data from thousands of standards stored in google drive following my previous article here:

GETTING DARK DATA FROM GOOGLE DRIVE — NO GOOGLE API

GOOGLE DRIVE | WORD & PDF DOCUMENTS | NO GOOGLE API

medium.com

We’ll be using a Pandas DataFrame to store the standards and filter them based on a keyword. Here’s the code to do this:

df = pd.read_pickle('./standards.pkl')
df['num_chars'] = df['text'].apply(lambda x: len(x))
df = df[df['num_chars'] != 0]

# Choose a topic
keyword = "bogie"

# Filter by keyword
filtered_std = df[df['text'].str.contains(keyword)]

With our standards database loaded and filtered, we can now create a document loader to load the standards into our QA model. We’ll be using Langchain’s DataFrameLoader for this task. Here’s the code to do this:

loader = DataFrameLoader(filtered_std, page_content_column="name")

Actually, there are ways to load data, according to types of data you have. So, the script may differ from one type of data to another. Since I have the data stored in the form of pandas dataframe. I use the DataframeLoader. the page_content_column define the loader to identify which one is your id or page “title”. More ways to load data check here

LOAD QA CHAIN

Now that we have our document loader, we can load our QA chain and run a query. We’ll be using Langchain’s load_qa_chain function to load our QA chain and the OpenAI module for our language model. In QA chain we have option to choose the chain type: stuff, map_reduce, refine, map_rerank. explanation see here

Here’s the code to do this:

chain = load_qa_chain(llm=OpenAI(), chain_type="map_reduce")
query ="what is a bogie?"
chain.run(input_documents=loader.load(), question=query)

Running that code will give us answer:

‘ A bogie is a chassis or framework carrying wheels, attached to a vehicle, thus serving as a modular subassembly of wheels and axles, or a wheeled wagon or trolley, usually mounted in pairs, that supports a railway vehicle body, and allows for swiveling movement relative to the main rails.’

Retrieval QA CHAIN

Next, we’ll be using Langchain’s RetrievalQA module to create a retrieval-based QA system. First, we’ll split our documents into smaller chunks using Langchain’s RecursiveCharacterTextSplitter. Then, we’ll create embeddings using Langchain’s OpenAIEmbeddings and an index using Langchain’s Chroma module. Actually there are ways of embedding you could use, for more options see here. Finally, we’ll create a retriever using our index and Langchain’s as_retriever function. Here’s the code to do this:

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 10000,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_documents(loader.load())

embeddings = OpenAIEmbeddings(model="ada")

db = Chroma.from_documents(texts, embeddings)

retriever = db.as_retriever(search_type="similarity", search_kwargs={"k":5})

Now that we have our retriever, we can create our QA system using Langchain’s RetrievalQA module. We’ll pass in our OpenAI language model, our retriever, and a query. Here’s the code to do this:

from langchain.chains.question_answering import load_qa_chain
qa = RetrievalQA.from_chain_type(
    llm=OpenAI(), chain_type="map_reduce", retriever=retriever, return_source_documents=False)

Here, we can send a query and ask for an answer as following:

query = "what are the parameters of a bogie that has to be complied?"
result = qa({"query": query})
result['result']

Running the code, we will have following answer:

‘ The bogie has to comply with the following parameters: wheelbase, wheel spacing, gauge, overhang, wheel diameter and width, swing radius and bogie centre distance, width, radius of curvature and distance between the bogie pivots, width, axle base, wheelbase, minimum overhang, maximum overhang, distance between bogie pivot centres, height of bogie frame above the rail surface, height of bogie frame above the top of rail, height of bogie frame below the bottom of rail, height of bogie frame below the rail surface, minimum distance between the rail surface and the lower part of the bogie frame, minimum distance between the rail surface and the upper part of the bogie frame, minimum distance between the bogie frame and the lower flange of the rail, maximum mass of 10 tonnes, maximum width of 2.7m, maximum height of 3.7m and maximum length of 9m.’

So, what do you think? imagine if this AI applied to another sectors like law, medic, etc. The possibilities are limitless.