4 Simple Steps to Develop a WhatsApp Support Chatbot (Using LLMs, OpenAI & Python)

Aidan Kelly
Data science at Nesta
9 min readFeb 13, 2024

In this blog, we explore how to develop a WhatsApp chatbot powered by a large language model (LLM) that can help people easily access information within support manuals to deal with on-the-job queries.

We cover the following processes involved in creating a chatbot that can engage with large volumes of specific information:

  • text chunking
  • vector embeddings
  • creating a vector database
  • building and querying the chatbot

Background

Nesta is a UK innovation agency focused on social good, working towards three goals — equality in early years, healthier lives and sustainability. One of our core ambitions is to slash household carbon emissions by 30% by 2030, with a big focus on accelerating the adoption of heat pumps. In the UK, this is still a relatively niche heating method for family homes, so an important part of our work involves helping gas heating engineers to work with this technology. Often newly trained heat pump installers lack confidence in taking on their first jobs after training, and this can discourage them from developing their work in the sector. To improve confidence, Nesta’s data scientists have been experimenting with chatbots to see how to enable installers to easily access support materials that can help them whilst on the job.

The Challenge

We wanted to develop a straightforward WhatsApp chatbot prototype that integrated a LLM with a database of heat pump installation information.

In this early proof of concept stage, we used a single heat pump installation manual, but the prototype has the scope to be expanded with further technical support material, such as textbooks on heating and cooling systems, to improve the accuracy, usefulness, and relevance of responses.

Whilst we use our project as a case study, this prototype approach is applicable to anyone attempting to build a chatbot to access discrete support materials such as manuals, training guides etc.

The Case Study

In this use case we demonstrate how to build a simple WhatsApp chatbot prototype using a large language model interfaced to a vector database that can answer questions about the NIBE F2040 heat pump.

This blog focusses on creating the mechanism for accessing the support information — that is the LLM and Retrieval Augmented Generation (RAG) aspects of the chatbot prototype using Python packages from OpenAI, LangChain and Pinecone.

For those who want to know how to actually integrate the LLM with WhatsApp, I highly recommend the comprehensive guide by my Nesta colleague: “Combining WhatsApp with large language models: prototyping with Twilio and Flask”.

The Chatbot

Before going into detail, let’s see an example of the chatbot in action!

Example of a simple WhatsApp chatbot prototype using a large language model interfaced to a vector database to answer questions about the NIBE F2040 heat pump.

Technical Overview

What follows from here is a technical overview of how to build a RAG chatbot.

Step 1: Text Chunking

The first step to build the chatbot involves processing the source text, in this case a heat pump installation manual which is typically extensive and complex. This is where text chunking comes into play. Chunking breaks down large text documents into more manageable, bite-sized pieces. It’s not just about reducing the amount of text, but also preserving the context in a format that the LLM can then recognise.

To do this we need to implement a process known as ‘sentence tokenisation’.

We leveraged fitz from the PyMuPDF Python library for extracting the text from the PDF installation manual and nltk.sent_tokenize for segmenting the text into the individual sentences. These segmented text chunks are then stored in separate lists and are ideal for our next step of embedding.

# Code to extract and chunk text from PDF files in a specified directory
# To run this code, you'll need the NLTK and PyMuPDF (fitz) libraries.
# Install them using: pip install nltk PyMuPDF
import nltk
import fitz # PyMuPDF
import os

pdf_dir = '/Users/user_name/sustainability/installer_chatbot/air2water_HP_installation_guides/'
texts = []
metadata_tags = [] # List to hold metadata tags
dir_contents = os.listdir(pdf_dir)
list_of_pdfs_paths = [os.path.join(pdf_dir, f) for f in dir_contents if f.endswith(".pdf")]
for pdf_file in list_of_pdfs_paths:
with fitz.open(pdf_file) as pdf_document:
pdf_text = ''
for page in pdf_document:
pdf_text += page.get_text()
texts.append(pdf_text)
pdf_file_name = pdf_file.replace(pdf_dir, "")
metadata_tags.append(pdf_file_name) # Store the filename as metadata tag

# ... Extract and chunk text from PDFs ...
chunked_texts = []
chunked_metadata_tags = []
for text, metadata_tag in zip(texts, metadata_tags):
chunks = nltk.sent_tokenize(text)
chunk_tags = [metadata_tag] * len(chunks)
chunked_texts.extend(chunks)
chunked_metadata_tags.extend(chunk_tags)

Step 2: Vector Embeddings

The next step is to transform the text chunks into ‘embeddings’ which the LLM can interpret. This is a crucial step in bridging the gap between natural language and the machine’s understanding.

In this example, an embedding process involves converting individual sentences into numerical ‘vectors’. These vectors capture two critical attributes of the text: its meaning and context. This involves leveraging OpenAI’s API and using their text-embedding-ada-002 model to convert the natural language chunks into high-dimensional representations (aka the embeddings).

You can optimise your model for speed or precision but in this example our model prioritises efficiency and faster performance at the expense of nuanced understanding, a compromise we’re happy to make given that we are at prototyping stage.

In this case the set of objects are chunks of the installation manuals text and we use OpenAI’s `text-embedding-ada-002` embedding model to turn the chunks of text to high-dimensional vectors. The image was taken from this PineCone article.

Creating vector embeddings for text chunks from a PDF is similar to translating a story into a secret code that only computers can understand. Each sentence is transformed into a series of numbers that captures its essence, allowing the computer to see how all the different sentences are related to each other.

These embeddings form the backbone of our vector database. The database enables us to efficiently search and retrieve technical content for our chatbot.

Creating the Embeddings

First of all, we set some of the initial variables, such as defining our API key from OpenAI, mapping the PDF name of the manual to the relevant website link as well as some counters for tracking progress.

client = OpenAI(api_key='<INSERT OPEN API KEY>')
map_pdf_to_web_dictionary = {'Nibe_F2040_231844-5.pdf':'https://www.nibe.eu/assets/documents/16900/231844-5.pdf'}
embeddings = []
metadata_list = [] # This contains our metadata with chunk and source
document_counter = 1 # Starting with the first document
chunk_counter = 0 # Initialise chunk counter
prev_tag = None # Keep track of the previous taThe next step is to iterate through the chunked text and embed each of these chunks whilst keeping track of the relevant metadata.

The next step is to iterate through the chunked text and embed each of these chunks whilst keeping track of the relevant metadata.

# Loop through the chunked texts and chunked meta tags created from the previous code snippet.
for text, tag in zip(chunked_texts, chunked_metadata_tags):
response = client.embeddings.create(input=text,
model='text-embedding-ada-002') # Use your chosen model ID)
#returns a single embedding for each chunk of text so only accesses first element
embedding = response.data[0].embedding
embeddings.append(embedding)
# Increment document counter if the tag changes (indicating a new document)
if tag != prev_tag:
document_counter += 1 if prev_tag is not None else 0
chunk_counter = 0 # Reset chunk counter for a new document
prev_tag = tag # Update the previous tag
# Create the metadata with chunk number and the PDF name from the tag
metadata = {'chunk': chunk_counter, 'pdf source': tag, 'source': map_pdf_to_web_dictionary[tag], 'text': text}
metadata_list.append(metadata)
# Increment the chunk counter
chunk_counter += 1

To conclude this process, we generate an ID for each of the embeddings and create a Pandas dataframe consisting of these embeddings, the embeddings IDs and their relevant metadata.

# Create the ids based on the document and chunk counters
ids = [f"{document_counter}-{i}" for i in range(len(embeddings))]


# Create a DataFrame of the embeddings which is used for our database later on
hpInstallerEmbeddingsDF = pd.DataFrame({
'id': ids, # Use the ids generated
'values': embeddings, # The vector embeddings which have been generated
'metadata': metadata_list # All the necessary metadata
})
An example of what the DataFrame with the embeddings looks like.

Step 3: Creating a Vector Database

Next, we need to create a vector database using Pinecone, a cutting edge vector database service, to store the vectors. These vector databases allow for the efficient storage and retrieval of high-dimensional data, which is essential for our LLM chatbot to understand and respond accurately to a variety of technical queries.

The process of indexing these embeddings in a vector database is analogous to organising a vast library. Just as a librarian categorises books so as to enable quick retrieval, Pinecone is able to index our embeddings to facilitate fast and relevant responses from the installer chatbot. This indexing is a key component in enabling the chatbot to sift through technical information and provide reliable and accurate answers.

Why a vector database, you might ask? The answer lies in its ability to handle similarity searches effectively. Unlike with traditional databases, vector databases are organised in a way that stores similar items together, which means they excel in quickly finding the closest matches in high-dimensional space — this is a key requirement for our chatbot to understand the nuances of natural language.

We can create a free API key with Pinecone and subsequently create an index where we can upload our embeddings to:

import os
import pinecone
import time

# Index name for the heat pump chatbot
hp_chatbot_index_name = 'chatbot-onboarding'

# Set up Pinecone API key and environment
PINECONE_API_KEY = os.getenv('<PINECONE API KEY>') or '<PINECONE API KEY>'
PINECONE_ENVIRONMENT = os.getenv('gcp-starter') or 'gcp-starter'
pinecone.init(
api_key=PINECONE_API_KEY,
environment=PINECONE_ENVIRONMENT
)

# Create a new Pinecone index if it doesn't exist
if hp_chatbot_index_name not in pinecone.list_indexes():
pinecone.create_index(
name=index_name,
metric='cosine',
dimension=1536, # 1536 dimensions of the text-embedding-ada-002 (number of dimensions of the vector embedding)
metadata_config={'indexed': ['chunk', 'source']}
)
time.sleep(1)
# Initialise the Pinecone index
hp_chatbot_index = pinecone.GRPCIndex(hp_chatbot_index_name)
# Upsert data from DataFrame to the Pinecone index
hp_chatbot_index.upsert_from_dataframe(hpInstallerEmbeddingsDF, batch_size=100)

Step 4: Building and Querying the Chatbot

This is the stage where we create the chatbot. We have processed our technical text and stored the embeddings in a vector database. Now it’s time to bring these elements together to create a chatbot that not only understands the queries but also provides informative and accurate answers.

Imagine a library with countless books, where each book represents a piece of technical information about how to install the NIBE F2040 heat pump. Our previous steps have effectively organised this library, making it easily navigable. Now we are introducing the librarian — our chatbot. This chatbot is not just any librarian, it’s able to understand complex technical queries in natural languages and knows where to find the answers, in this case about the NIBE F2040 heat pump.

It’s like having a specialist who has thoroughly studied and understood every detail of this specific installation manual. When a question is posed, the chatbot interprets the nuances and intentions, ensuring a contextual understanding of the question.

However, comprehending the query is just the start. The chatbot then employs the vector database — this functions as a swift and accurate search engine, identifying which text embeddings best align with the posed query.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# Initialise OpenAI embeddings
OPENAI_API_KEY = os.getenv('<INSERT OPEN API KEY>') or '<INSERT OPEN API KEY>'
embed = OpenAIEmbeddings(
model='text-embedding-ada-002',
openai_api_key=OPENAI_API_KEY
)

# Set up Pinecone vector store
hp_chatbot_index = pinecone.Index(hp_chatbot_index_name)
vectorstore = Pinecone(hp_chatbot_index, embed.embed_query, "text")

#Set up the chatbot language model
#Create a QA chain with retrieval from vectorstore
#language capabilities and fast response times, ideal for interactive
#chatbot applications.
#Temperature set to 0.5 for deterministic, consistent responses, ensuring reliable and precise chatbot's answers.
llm = ChatOpenAI(
openai_api_key=OPENAI_API_KEY,
model_name='gpt-3.5-turbo',
temperature=0.5
)
#Create a QA chain with retrieval from vectorstore
#When "Retrieval QA" uses the chain_type "stuff" in LangChain, it
#indicates a process where the system searches through external
#documents to find relevant information. Once this information (the
#"stuff") is identified, it is “stuffed” (fits within context window)
#into the LLM.
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever()
)

# Example question
question = "Can you tell me how to deal with condensation run off for the NIBE F2040 heat pump?"
answer = qa(question)

What’s next?

OpenAI, PineCone and LangChain enabled us to rapidly prototype an LLM application for heat pump installers to answer queries about a specific heat pump.

The next steps for this prototype will be to do more involved prompt engineering, scaling the vector database to include more heat pump installation manuals, as well as user testing.

This is an early prototype to start exploring the potential and usefulness of generative AI in the sustainability sector. Our hope is that this prototype will inspire others to explore this technology and its huge potential for social good.

While we’re excited about the prospects, we’re also mindful of the challenges and limitations inherent in current generative AI technologies, including accuracy in complex scenarios, potential biases in responses, and the need for continuous monitoring and updates. Acknowledging these challenges is crucial as we strive to improve and refine our approach.

We’ve added the jupyter notebook here and we will be sharing our code on GitHub.

--

--