Using LangChain and Pinecone to chat with data.

Apil Adhikari
4 min readJan 3, 2024

Students, Researchers, AI Developers will find this blog useful.

LangChain is a framework designed to simplify the creation of applications using large language models and Pinecone is a simple vector database used for vector search. Using these two powerful tools with OpenAI’s API and streamlit we can make a simple chatbot to chat on the context of the data we feed into it.

The data fed into these types of bots are generally documents like pdfs, readme files, text files and so on.

Now, let’s start coding:

Firstly, let’s install the packages we’ll be using .

!pip install openai langchain tiktoken pypdf unstructured[local-inference] gradio chromadb pinecone-client

We’ll start by importing the necessary packages/libraries.

#Importing necessary libraries
import os
import streamlit as st
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Pinecone, Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chains import ConversationalRetrievalChain
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import DirectoryLoader
from langchain.llms import OpenAI
import pinecone

Let’s now set up our OpenAI API key as an environment variable using the os.environ object.

# Seting up OpenAI API key
os.environ['OPENAI_API_KEY'] = "sk-...APikey"

Now, let’s setup a document loader to load our files.

pdf_loader = DirectoryLoader('path/to/your/directory', glob="**/*.pdf")#glob" is a parameter used for specifying a pattern of file names or paths.

Now, let’s take the loader and create smaller documents.

documents = []
for loader in loaders:
documents.extend(loader.load())

Our documents can be too big for processing which can cause various problems down the line, so let’s split it in to chunks:

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=40)
documents = text_splitter.split_documents(documents)

In the snippet above, the chunk_overlap variable is used to allow over lapping of chunks in the split. For example:

text1 = 'abcdefghijklmnopqrstuvwxyzabcdefg'
text_split=RecursiveCharacterTextSplitter(chunk_size=26, chunk_overlap=4)
splitted_text=text_split.split_documents(text)
print(splitted_text)
#Output ['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

Both the RecursiveCharacterTextSplitter and CharacterTextSplitter work the same way. CharacterTextSplitter is preferred in larger contexts like documents and RecursiveCharacterTextSplitter is used in smaller contexts like strings.

Now, let’s create embeddings for our data and call the Pinecone environment and Pinecone API. For this, you can visit app.pinecone.io and create your API key as well as your environment.

# Creating embeddings
embeddings = OpenAIEmbeddings()
# Initializing Pinecone with the correct API key and environment
pinecone.init(api_key='YOURAPI', environment='YOURENV')
#Define index/envirnoment
index_name = "chatbot-application"

We’re now approaching the final steps of this project. Let’s now create vector store and set up a qa chain.

# Create vector store using Pinecone
vectorstore = Pinecone.from_documents(documents, embeddings, index_name=index_name)

# Set up QA chain
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 2})
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0), retriever)

Finally, let’s make a simple UI using streamlit to run our chatbot.

# Streamlit chat UI
st.write("Chat with your data. Type 'exit' to stop")

chat_history = []

# Check if there is any user input
user_input = st.text_input('Please enter your question:')
if user_input and user_input.lower() != 'exit':
result = qa({"question": user_input, "chat_history": chat_history})
chat_history.append((user_input, result['answer']))

st.write(f'User: {user_input}')
st.write(f'Chatbot: {result["answer"]}')

You should run this code from the terminal like:

streamlit run your_file_name.py

Running this code will give you a link to a localhost server and the server will look something like this:

You can now insert any term in your documents and get the results from your document back in this UI. Congratulations! You have now made a Chatbot trained on your specific data.

Conclusion:

1. Tool Overview: LangChain simplifies large language model applications, while Pinecone serves as a vector database for efficient vector search.

2. Coding Setup: Begin by installing necessary packages and importing libraries for the project.

3. API Key Setup: Configure the OpenAI API key as an environment variable using `os.environ`.

4. Document Loading: Set up a document loader to handle files like PDFs and text files.

5. Document Splitting: Large documents are split into chunks using `CharacterTextSplitter` with parameters for chunk size and overlap.

6. Embeddings and Pinecone Setup: Create embeddings for data using OpenAI and initialize Pinecone with API key and environment.

7. Vector Store and QA Chain: Establish a vector store using Pinecone and set up a Question-Answer (QA) chain for the chatbot.

8. Streamlit UI: Finally, create a simple UI with Streamlit for users to interact with the chatbot on a local server.

9. Execution: The code is run from the terminal, providing a link to a localhost server where users can input terms to retrieve data results.

--

--