Pair programming an LLM-app

Set up, develop and debug a RAG

Published in

The Deep Hub

12 min readJul 17, 2024

As a Data Science and AI teacher, my students often ask, “How do you build Large Language Model (LLM) systems from scratch?” Most tutorials show the final product without explaining the development process. Here, I’ll share the decisions, bugs, and struggles involved.

Whether you’re new to LLMs or want to deepen your knowledge, this article is for you. This is the first in a multipart series that will teach you to set up, test, manage, and deploy an LLM-based app in the cloud. The diagram below outlines what will be covered in this article.

Structure of the LLM system we will build in this article (source: the author)

Please take a moment to like and share the article if you liked it. Follow me to be notifed when new articles in the series come out

0. What is RAG?

LLMs excel at answering questions and powering chatbots, but their knowledge is limited to their training data. Retrieval Augmented Generation (RAG) is essential for accessing new information, such as private data, by retrieving relevant data from a database and incorporating it into the LLM prompt. Think of the LLM as a brain and RAG as the act of looking up a book in a library. So if the LLM doesn’t know something it can use RAG to look it up in a database. Unlike googling a question, the LLM is confined to the information already stored in its library.

Clicking here to open notebook

1. Loading the data

In this example, we’ll extract Empire magazine’s top 100 movies of all time and feed it into a database. Let me know in the comments if you agree with the list! 😉

Example of a movie review, containing the title, a short description and a link to a full review (source: Empire)

! pip install -q langchain langchain_community tiktoken langchain_openai \
langchain_text_splitters docarray langsmith

import os
from langchain_community.document_loaders import WebBaseLoader

MOVIES_URL = "https://www.empireonline.com/movies/features/best-movies-2/"
# Note: this site allows all user agents to crawl it

1.1 Test connection

Before we write the actual code to fetch web data for our LLM, we need to ensure that the page is accessible. Crawling can be complex, so it’s prudent to verify our access with a simple HTTP request via WebBaseLoader .

# this cell is purely for testing during development (not for the final code)
full_page = WebBaseLoader(MOVIES_URL).load()
print(full_page[0].page_content[:200])

The 100 Best Movies Of All Time | Movies | %%chann

1.2 Get data

We don’t need the entire page’s data but only a small fraction, so bs4.SoupStainer comes to the rescue. It allows us to extract information from HTML without parsing everything first and then filtering the results. So, we'll run the WebBasedLoader with the bs_kwargs argument to configure the strainer inside.

from bs4 import SoupStrainer


def is_target_element(elem: str, attrs: dict) -> bool:
    """
    Returns true if the HTML element is what we want to extract. 
    """
    # get the movie description
    div_class = "listicleItem_listicle-item__content__Lxn1Y"
    div_mask = (elem == "div" and attrs.get("class") == div_class)
    # get the movie title
    h3_class = "listicleItem_listicle-item__title__BfenH"
    h3_mask = (elem == "h3" and attrs.get("class") == h3_class)
    return div_mask or h3_mask
strainer = SoupStrainer(is_target_element)

movie_scraper = WebBaseLoader(
    MOVIES_URL,
    bs_kwargs = {
        "parse_only": strainer
    }
)
movie_reviews_raw = movie_scraper.load()

2. Process data

All the movie reviews come as a single document, but we want to split them and remove the links to the full movie reviews at the end of each block

import re
from langchain.docstore.document import Document


def split_movies(page: Document) -> list[Document]:
  """
  Split page into a list of movie reviews
  """
  page_parts = page.page_content.strip().split("\n")
  names_n_reviews = [p for p in page_parts if not p.startswith("Read")]
  pattern = r'^\d*\)? '
  movie_names = [re.sub(pattern, "", name) for name in names_n_reviews[::2]]
  movie_reviews = [
      f"{name}: {description}"
      for name, description in zip(movie_names, names_n_reviews[1::2])
  ]
  movie_docs = [
      Document(review, metadata={**page.metadata, "rank": i, "name": name})
      for review, i, name in zip(movie_reviews, range(100, 0, -1), movie_names)
  ]
  return movie_docs

movie_reviews = split_movies(movie_reviews_raw[0])
print(f"extracted {len(movie_reviews)}")
movie_reviews[0]

extracted 100

Document(page_content="Reservoir Dogs: Making his uber cool and supremely confident directorial debut, Quentin Tarantino hit audiences with a terrific twist on the heist-gone-wrong thriller. For the most part a single location chamber piece, Reservoir Dogs delights in ricocheting the zing and fizz of its dialogue around its gloriously -and indeed gore-iously) - intense setting, with the majority of the movie's action centring around one long and incredibly bloody death scene. Packing killer lines, killer needledrops, and killer, er, killers too, not only is this a rollicking ride in its own right, but it also set the blueprint for everything we've come to expect from a Tarantino joint. Oh, and by the way: Nice Guy Eddie was shot by Mr. White. Who fired twice. Case closed.", metadata={'source': 'https://www.empireonline.com/movies/features/best-movies-2/', 'rank': 100, 'name': 'Reservoir Dogs'})

3. Set up database

3.1 RAG in a nutshell (part 1)

Our next goal is to make various movie reviews accessible to our Large Language Model. The steps are:

Embedding Selection: Convert text into its numerical representation (embedding) to capture its essence. Similar texts will have similar vectors (e.g., the embedding for “man” is similar to that for “woman” and very different from “meteor”).
Chunking: Split the original text (movie reviews) into smaller pieces before inserting them into the database.
Database Creation and Insertion: Insert the chunks and their corresponding embeddings into a database.
Data Retrieval: After embedding the user query, find the closest vector to the query vector (using cosine similarity). Return the chunks corresponding to the k closest matches (you define k).

Then pass the retrieved information to the LLM (more details in part 2 of this article).

3.2 Embedding selection

We will create an embedding using OpenAI API’s models. This assumes you already have an OpenAI API key set in you environment as OPENAI_API_KEY. If you don’t have it you can create an API KEY as shown here. With the basic free tier you can use all the code in this notebook without concern.

from langchain_openai import OpenAIEmbeddings


# OpenAI has multiple models, transforms the text in longer vectors (here 
# length of 3072) and carries out more information about the original text. 
# It is also more expensive and requires more space to store.
EMBEDDING_MODEL_NAME = "text-embedding-3-large"
embeder = OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME)

# testing embeding
test_embedding = embeder.embed_query("What is 'Hello World'?")
print(test_embedding[:5])
print(f"the model {EMBEDDING_MODEL_NAME} generates embeddings"
      f" of length: {len(test_embedding)}")

[-0.015853295102715492, -0.056399740278720856, -0.014421384781599045, 0.019666852429509163, -0.017855048179626465]
the model text-embedding-3-large generates embeddings of length: 3072

3.3 Chunking

By default RecursiveCharacterTextSplitter uses the following separators ["\n\n", "\n", " ", ""] in sequence.

It first tries to create chunks with as many paragraphs as possible without exceeding the chunk_sizelimit (using “\n\n” as a separator).
If a paragraph exceeds the limit, it then splits based on lines (“\n”).
If a line exceeds the limit, it splits based on words (“ “).
If a word exceeds the limit, it splits by individual characters.

This recursive process stops once a condition is met, removing the chunked text and repeating for the rest of the string.

from langchain_text_splitters import RecursiveCharacterTextSplitter


text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    length_function=len,
)
text_splitter.split_text(movie_reviews[5].page_content)

["Donnie Darko: A high school drama with a time traveling, tangential universe threading, sinister rabbit featuring twist, Richard Kelly's deliberately labyrinthine opus was always destined for cult classic status. A certifiable flop upon its theatrical release, Kelly's film was one of the early beneficiaries of physical media's move to DVD, with the movie gaining a fandom in film obsessives who could pause, play, and skip back and forth through it at will. Any attempt to synopsise the movie is a fool's errand, but there's more than a hint of\xa0It's A Wonderful Life in the way we see Donnie (Jake Gyllenhaal, in a star-making turn) experiencing how the world would be worse off if he survives the jet engine that mysteriously crashes through his bedroom. That the film, with all its heavy themes and brooding atmosphere, manages to eventually land on a note of overwhelming optimism is a testament to Kelly's mercurial moviemaking. A mad world (mad world) Donnie Darko's may be, but it's also one",
 "brooding atmosphere, manages to eventually land on a note of overwhelming optimism is a testament to Kelly's mercurial moviemaking. A mad world (mad world) Donnie Darko's may be, but it's also one that continues to beguile and fascinate as new fans find themselves obsessed with uncovering its mysteries."]

3.4 Database: creation and insertion

The index helps us manage our vector database, so named because data is retrieved by comparing the vector (embedding) of the query to the vectors (embeddings) of the chunks in the database. While there are many vector databases available, we’ll keep it simple for this example. Given the small dataset, we can store all the data in memory using DocArrayInMemorySearch.

from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch


# from_documents is the method that inserts or list of documents in the DB
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeder,
    text_splitter=text_splitter,
).from_documents(movie_reviews)

3.5 Data Retrieval

Create a retriever from the vector DB

retriever = index.vectorstore.as_retriever()

# this cell is purely to test everything worked as planned don't put it
# in the final code

# find the closest matches to the query
relevant_movies = retriever.vectorstore.similarity_search(
    "Can you recommend me an adventure movie?",
    k=3 # by default k=4
)

# As you can see below we succesfully extracted 3 adventure movie
for doc in relevant_movies:
  print(doc.page_content)

Indiana Jones And The Last Crusade: You voted... wisely. There may only be 12 years' difference between Harrison Ford and Sean Connery, but it's hard to imagine two better actors to play a bickering father and son, off on a globetrotting, Nazi-bashing, mythical mystery tour. After all, you've got Spielberg/Lucas' own version of James Bond... And the original Bond himself.
Raiders Of The Lost Ark: In '81, it must have sounded like the ultimate pitch: the creator of Star Wars teams up with the director of Jaws to make a rip-roaring, Bond-style adventure starring the guy who played Han Solo, in which the bad guys are the evillest ever (the Nazis) and the MacGuffin is a big, gold box which unleashes the power of God. It still sounds like the ultimate pitch.
Lawrence Of Arabia: If you only ever see one David Lean movie... well, don't. Watch as many as you can. But if you really insist on only seeing one David Lean movie, then make sure it's Lawrence Of Arabia, the movie that put both the "sweeping" and the "epic" into "sweeping epic" with its breath-taking depiction of T.E. Lawrence's (Peter O'Toole) Arab-uniting efforts against the German-allied Turks during World War I. It's a different world to the one we're in now, of course, but Lean's mastery of expansive storytelling does much to smooth out any elements (such as Alec Guinness playing an Arab) that may rankle modern sensibilities.

4. LLM configuration

4.1 RAG in a nutshell — part 2

Having successfully set up an in-memory vector database (managed by an index) populated with movie review data, our next steps are to:

1. Set up the LLM: select a model and establish an API connection.

2. Create the prompt template: develop a “meta prompt” that wraps the user’s query.

3. Create the RAG chain: This chain retrieves data from the database based on the user query, inserts the retrieved content into the “meta-prompt” sent to the LLM, and includes the actual user query.

4. Answer user question: provide the final answer.

4.2 Setting Up the LLM

We will use OpenAI’s GPT-3.5 Turbo, available in the free tier with a limit of 3 requests per minute (RPM) as of now. For advanced models like GPT-4, you’ll need to spend at least $5 to move to Tier 1.

There are many other model providers. Using the LangChain framework instead of OpenAI directly helps prevent vendor lock-in, allowing easy switching between vendors with minimal effort.

# this will use the API key set up above
# note: we are note using the OpenAI API directly but using it via langchain
from langchain_openai import ChatOpenAI


LLM_MODEL_NAME = "gpt-3.5-turbo"
llm = ChatOpenAI(
    model=LLM_MODEL_NAME,
    # higher temperature means more original answers so we set it to the max
    temperature=2,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)
# testing that the LLM works, you will observe that the final answer provides
# a count of the tokens used in the prompt and in the reply (completion_tokens)
# Which are what you will be charged for outside of the free tier
llm.invoke("Hey how are you GPTie?")

AIMessage(content="Hello! I'm just a computer program, so I don't have feelings, but I'm here to assist you. How can I help you today?", response_metadata={'token_usage': {'completion_tokens': 31, 'prompt_tokens': 15, 'total_tokens': 46}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-153ebd54-cf18-4662-822d-e104ba1306fd-0', usage_metadata={'input_tokens': 15, 'output_tokens': 31, 'total_tokens': 46})

4.3 Create prompt

The main reason to create a prompt template rather then an f-string it integrate it with the rest of LangChain and to more clearly estrablish the various roles in the process (eg. “system”, “human”, etc.). Here we ask the LLM to provide movie recommendations as if it was movie wizard.

from langchain_core.prompts import ChatPromptTemplate


# define how the LLM should respond in general
system_message = """
When asked a question reply as if you were the wizard of movies with the \
knowledge about movies. Try to be funny were possible but base you answers in \
the information provided in the context section.\
"""
# wraps the user query represented by {question} and provides documents from
# the vector DB in {context}
# WARNING / SPOILER: there is a typo here on purpose, to make the code work. 
# directly replace {question} with {input}
human_message = """
User question:
{question} 
-----------------------------------------
Context:
{context}
"""
# put all the messages together into a single prompt
chat_template = ChatPromptTemplate.from_messages([
    ("system", system_message),
    ("human", human_message),
])

4.4 Create the RAG chain

A LangChain chain is a sequence of modular components (like prompts, models, or tools) linked together to process inputs and generate outputs in a structured manner. It combines the following elements:

Prompt Template
LLM
Approach for integrating retrieved documents from the database

While the first two elements are straightforward, the third might be a bit more obscure. Indeed, if we retrieve five documents from the vector database, how do we combine them into a single block of text to insert in {context}? The most obvious way is to stuff them all together, meaning appending the page_content of each document one after the other, which is what create_stuff_documents_chain does. However, there are many other methods, for example for very large documents each document can be summarizing before the summaries are stuffed together.

from langchain.chains.combine_documents import create_stuff_documents_chain


combine_docs_chain = create_stuff_documents_chain(llm, chat_template)
# you can print combine_docs_chain to the how the chain was built, it is in
# LCEL (Lanchain expression language) which is beyond the scope of the tutorial

The next chain adds a retriver so that our RAG can access the movie data

from langchain.chains import create_retrieval_chain


chat_chain = create_retrieval_chain(retriever, combine_docs_chain)
# print chat_chain to view the LCEL code behind it

QUESTION = "Can you recommend me an adventure movie?"

# it is only now that your request goes to the LLM
# the dictionary key represents what you want to replace in the template
chat_answer = chat_chain.invoke({"question": QUESTION})

KeyError: 'input'

4.4.1 Debugging

So we notice something went wrong … Now let’s try to debug it. With this method the debug trace will be printed on the console, use FileCallbackHandlerinstead to direct it to a file. Another alternative is use langchain.debug = True to debug all chains, but this only prints to the console. In a future tutorial we will learn to use LangSmith to manage this more efficiently.

from langchain.callbacks.tracers import ConsoleCallbackHandler


try:
  chat_answer = chat_chain.invoke(
      {"question": QUESTION},
      config={'callbacks': [ConsoleCallbackHandler()]}
  )
except KeyError as e:
  print(f"KeyError: {e}")

Unlike python error traces that are read bottom to top these ones are read top to bottom. Scrolling furth on the first chain/error line you would find this message retrieval_docs = (lambda x: x[\"input\"]) | retriever\n\n\nKeyError: 'input'" meaning the retriever expected an'input'key in the dictionary passed to the invoke method. The solution is to replace 'question'with'input' in the dictionary. Additionally, ensure the key in the dictionary matches the input variable in the prompt template.

# we changed the human message by replacing {question} with {input}
human_message = """
User question:
{input} 
-----------------------------------------
Context:
{context}
"""
fixed_chat_template = ChatPromptTemplate.from_messages([
    ("system", system_message),
    ("human", human_message),
])
# we used the same chains as before
combine_docs_chain = create_stuff_documents_chain(llm, fixed_chat_template)
chat_chain = create_retrieval_chain(retriever, combine_docs_chain)

Now that we fixed our chain we can query the LLM again and now it works

from pprint import pprint

adventure_movies = chat_chain.invoke({"input": QUESTION})
# we use pprint rather then simply print to have all the text fit the screen
pprint(adventure_movies["answer"])

Can you recommend me an adventure movie?

4.4 Answer the user question

We can try with another more complicated question whose answer requires a bit more thinking. Yet after a quick inspection of the context we can sea each movie retrived is indeed a surrealist one and that LLM did base its answer on the context and that the answer is pertinent to the question. These are the 3 elements of the RAG triad in a future article.

# this answers are not deterministic and may change when you run it
surrealist_movies = chat_chain.invoke({"input": "Which surrealist movies should I watch ?"})
for key, val in surrealist_movies.items():
  print(10 * "-" + f" {key} " + 10 * "-")
  pprint(val)

Which surrealist movies should I watch ?

Thanks for reading, if you enjoyed it don’t forget to clap (remember you can go up to 50 claps 😉).