RAG Detective: Retrieval Augmented Generation with website data

Ian Kelk
16 min readDec 11, 2023

--

This article was produced as part of the final project for Harvard’s AC215 Fall 2023 course.

Authors: Ian Kelk, Alyssa Lutservitz, Nitesh Kumar, Mandy Wong

A colorful image in a New Yorker style, depicting a puppet inspired by Bert from Sesame Street, dressed as a detective. This unique puppet has a long, orange nose, a unibrow, and a mop of black hair. He is wearing a classic detective outfit, complete with a trench coat and a fedora hat. The puppet is examining documents with a magnifying glass, looking focused and inquisitive. The scene is vibrant and detailed, capturing the essence of a detective’s investigative work.
Image generated using OpenAI DALL-E 3 and edited in Figma by author

Large language models (LLMs) like GPT-3.5 have proven to be capable when asked about commonly known subjects or topics that they would have received a large quantity of training data for. However, when asked about topics that include data they have not been trained on, they either state that they do not possess the knowledge or, worse, can hallucinate plausible answers.

It is not usually possible to research company offerings using an LLM; in order to directly compare products and services, we need data that is more recent than what the model was trained on. The problem we wish to solve is to find a way to get up-to-date answers about companies that correspond to the information on their website.

As well, in order to fulfill milestones for this course that would not otherwise be addressed, we fine-tune a BERT model to perform financial sentiment analysis when GPT-3.5 reports that the response may be financial in nature.

Proposed Solution

There are two main ways of addressing this limitation: fine-tuning and retrieval augmented generation (RAG).

Fine-tuning is the process of continuing to train the model using your own data with a significantly smaller learning rate. The newly-gained knowledge is then encapsulated in the model weights themselves. However, fine-tuning requires another copy of the model and the associated costs of hosting it, as well as the risk of “catastrophic forgetting,” where the model forgets previously learned information.

RAG, however, makes use of a source of knowledge, typically a vector store of embeddings and their associated texts. By comparing the predicted embeddings of the query to the embeddings in the vector store, we can form a prompt for the LLM that fits inside its context and contains the information needed to answer the question.

Our solution has three major components:

  • A chatbot that uses scraped data from the website'ssitemap.xml file—a file intended to guide search engines to all scrapable links on the site—in a manner that’s more specific and insightful than using a search engine. The LLM should only use this context to answer the question and not insert its own training data or hallucinate an answer. This is simple to test with questions like “Who is Kim Kardashian?” which would be clearly known to the model, and ensure it replies that this answer “is not within the context provided.”
  • A real-time scraper of websites on the application through asynchronous calls to the API.
  • Financial sentiment analysis on relevant completions from the LLM. As part of the prompt for GPT-3.5, we ask if its response is financial in nature. If it says it is, then our fine-tuned BERT model is called and classifies the response, displaying a plot of the probabilities and an appropriately cute, totally non-copyright-infringing Bert puppet.

The RAG component

To understand RAG, let’s use a simplified analogy. There are two types of test questions a person might be asked; the first is a simple request for a fact:

  • What is the capital of France?
  • Who was the first person to climb Mount Everest?
  • In what year did Canada gain independence from Great Britain?

These questions don’t require any special skills to answer; someone with the right reference material could just look up the correct response.

A New Yorker-style cartoon of a student writing a test while sneakily cheating with a history textbook. The scene captures the student’s sly and mischievous behavior in a humorous and classic New Yorker cartoon style.
For some tests, ”a textbook is all you need." Image generated using OpenAI DALL-E 3

The other type of question is one where it doesn’t matter if you’re secretly hiding a textbook on the material; it involves a studied skill. If you haven’t studied and practiced, you won’t be able to respond in a satisfactory way. Some examples could be:

  • Write a poem in German.
  • Write a computer program to calculate the first million prime numbers.
  • Compose a symphony in the style of Beethoven

Even if there were a lot of mathematics texts on elliptic curves surrounding you, it's unlikely that you could prove Fermat's Last Theorem unless it was in one of the books verbatim!

A black and white New Yorker-style cartoon of a student looking extremely puzzled and overwhelmed, surrounded by numerous math textbooks. In the background, a chalkboard is densely filled with complex mathematical equations. The scene is rendered in monochrome to capture the classic essence of traditional New Yorker cartoons.
For others tests, all the books in the world can’t save you! Image generated using OpenAI DALL-E 3

In the world of large language models, RAG is used to solve the first type of problem, where it’s easy to cheat. Fine-tuning is used to solve the second, where a model would likely have to actually learn the material in order to solve it. RAG is easier—it doesn’t require retraining a model, you don’t have to deal with the internal works of a model, and you can adjust the data the model “cheats” off of rather easily. Interestingly, it also significantly reduces the amount a model “hallucinates” answers—a common issue for LLMs when they invent fictional but plausible answers based on insufficient training data. The only hard part of RAG is finding the relevant data to give a model; models have limits on how much they can be prompted with, and a 500-page history textbook is just too long.

Here’s the basics of how RAG works:

  • Data Organization: Imagine you’re the little guy in the cartoon above, surrounded by textbooks. We take each of these books and break them into bite-sized pieces—one might be about quantum physics, while another might be about space exploration. Each of these pieces, or documents, is processed to create a vector, which is like an address in the library that points right to that chunk of information.
  • Vector Creation: Each of these chunks is passed through an embedding model, a type of model that creates a vector representation of hundreds or thousands of numbers that encapsulate the meaning of the information. The model assigns a unique vector to each chunk—sort of like creating a unique index that a computer can understand.
  • Querying: When you want to ask an LLM a question it may not have the answer to, you start by giving it a prompt, such as “What’s the latest development in AI legislation?”
  • Retrieval: This prompt goes through an embedding model and transforms into a vector itself—it's like it’s getting its own search terms based on its meaning and not just identical matches to its keywords. The system then uses this search term to scour the vector database for the most relevant chunks related to your question.
  • Prepending the Context: The most relevant chunks are then served up as context. It’s similar to handing over reference material before asking your question, except we give the LLM a directive: “Using this information, answer the following question.” While the prompt to the LLM gets extended with a lot of this background information, you as a user don’t see any of this. The complexity is handled behind the scenes.
  • Answer Generation: Finally, equipped with this newfound information, the LLM generates a response that ties in the data it’s just retrieved, answering your question in a way that feels like it knew the answer all along.
The image is a two-panel cartoon. Both panels depict a woman asking a man wearing a “ChatGPT” sweater, “What’s the newest MacBook?” The cartoon is a commentary on the difference in information provided when using training data only versus incorporating additional context.
Image generated using OpenAI DALL-E 3 and comic bubbles and text by author using Comic Life 3

To accomplish this goal, we built an application on top of a vector store called “Weaviate”. We built a web scraper in Python that crawls a given website’s sitemap.xml, which is a listing of pages used to help search engines crawl the site. Due to the seemingly endless variability of internet websites, it turned out to be a bit of a challenge.

The Scraper Component

Our web scraper is built to handle many of the challenges of modern web architecture, capture data often missed by conventional scraping methods, and stream the scraping activity in real time to our web application. This endpoint represents the vital data collection phase in our solution designed to fetch web data, transition it through various stages of storage and processing, and ultimately index it in the vector store for swift retrieval.

The scraper uses the PythonBeautifulSoup library to sift through the HTML and CSS contents initially. However, some modern websites rely on JavaScript for dynamic content generation, creating a hurdle for standard scraping methodologies that rely purely on HTTP requests. The scraping system resolves this issue by employing Selenium WebDriver, a tool that simulates a real user’s interaction with web pages through a “headless” browser—that is, Google Chrome running without a graphical front end—that fully supports dynamic content loading. If the scraper’s initial efforts to extract data via direct HTTP requests are stymied or yield data below a set threshold, Selenium is engaged. This technique ensures that the scraper can successfully access content that would not be available through static page loads.

The scraper also bypasses unnecessary elements, such as images, to mitigate overhead and facilitate faster processing. Once the data is collected, control is reverted back to BeautifulSoup to filter and extract the text from the HTML.

Post-extraction, the data is serialized and then saved in Google Cloud Storage as CSV files, serving as a backup of the data for the vector store. The scraped data is then chunked and inserted into Weaviate using an orchestration library, LlamaIndexwhere it is added to the vector store index.

A New Yorker-style cartoon featuring a robot using a paint scraper to humorously scrape words off of computer screens, symbolizing ‘data scraping’. The cartoon creatively and whimsically interprets this modern concept, maintaining the essence of a classic New Yorker cartoon.
“Looks like someone took ‘Ctrl-X’ a bit too literally!” Image generated using OpenAI DALL-E 3

Orchestration

Let’s look at what happens after the data is scraped in a bit more detail, as there is some orchestration to be done to insert it into the vector store. These steps were mentioned at an abstract level above, but more specifically, the steps involve:

  • Taking each scraped website and breaking it into chunks. The lengths of these chunks have a profound impact on how well the retrieval process works.
  • For each chunk, generate an embedding vector using OpenAI’s text-embedding-ada-002 model. Both Weaviate and LlamaIndex integrate this model natively.
  • Insert this embedding vector and the text chunk into the Weaviate vector store.

For the retrieval step when prompting the model:

  • Take the prompt and put it through text-embedding-ada-002 to get an embedding.
  • Using that embedding, find the chunks that should answer the prompt, then prepend them to the query and send it to GPT-3.5

The actual chunking of the documents is somewhat of an art in itself. GPT-3.5 has a maximum context length of 4,096 tokens, or about 3,000 words. Those words represent the sum total of what the model can handle—if we create a prompt with a context 3,000 words long, the model will not have enough room to generate a response. Realistically, we shouldn’t prompt with more than about 2,000 words for GPT-3.5. This means there is a trade-off with chunk size that is data-dependent.

With smaller chunk_size values, the text returned produces more detailed chunks of text but risks missing information if they’re located far away in the text. On the other hand, larger chunk_size values are more likely to include all necessary information in the top chunks, ensuring better response quality, but if the information is distributed throughout the text, it will miss important sections.

Let’s use some examples to illustrate how this trade-off works, using the recent Tesla Cybertruck release event. While some models of the truck will be available in 2024, the cheapest model—with just RWD—will not be available until 2025. Depending on the formatting and chunking of the text used for RAG, the model’s response may or may not encounter this fact!

In these images, blue indicates where a match was found and the chunk was returned; the grey box indicates the chunk was not retrieved; and the red text indicates where relevant text existed but was not retrieved. Let’s take a look at an example where shorter chunks succeed:

Exhibit A: Shorter chunks are better… sometimes. Background desk image generated using OpenAI DALL-E 3 and text by author using Pixelmator Pro.

In the image above, on the left, the text is structured so that the admission that the RWD will be released in 2025 is separated by a paragraph but also has relevant text that is matched by the query. The method of retrieving two shorter chunks works better because it captures all the information. On the right, the retriever is only retrieving a single chunk and therefore does not have the room to return the additional information, and the model is given incorrect information.

However, this isn’t always the case; sometimes longer chunks work better when text that holds the true answer to the question doesn’t strongly match the query. Here’s an example where longer chunks succeed:

Exhibit B: Longer chunks are better… sometimes. Background desk image generated using OpenAI DALL-E 3 and text by author using Pixelmator Pro.

After some experimentation, we opted to use chunks 1,000 tokens long and retrieve two of them to prompt GPT-3.5. Since the GPT-3.5 can handle a 4,096 context length, that should leave plenty of space for an appropriate response.

When we first began the project, we were doing the chunking, indexing, and retrieval ourselves, and it worked just fine. There was a learning curve to learning GraphQL, which is what Weaviate uses for a query language.

An example of the GraphQL we had to use to test if a previous website and timestamp had already been inserted into the vector store.

The appeal of using a library like LlamaIndex is that it abstracts away this orchestration, allowing us to swap out other vector stores if we want to (Weaviate has many competitors in the space, such as Milvus, Qdrant, Pinecone, and others emerging all the time). Using LlamaIndex also allows us to experiment later with more complex RAG implementations, such as tree-structured data and recursive prompting. However, using such a new library came with its own share of challenges, specifically the lack of proper documentation. The vast majority of their help resources were examples, and if those examples didn’t fit our use case, there was no recourse other than asking the developers on Discord or reading the source code ourselves.

The BERT Component

To fulfill the requirements for model hosting for the course, we fine-tuned a BERT model, trained it, and hosted it via a pipeline on Google Vertex. By adding this model to our application, we can actually prompt GPT-3.5 to return a flag along with its response and let us know if it thinks the answer it’s giving is financial in nature. When it does, we can display an appropriately comical, non-copyright-infringing Bert puppet and also a plot of the probabilities returned by the model.

The process of refining the BERT model’s training was as much about the data as it was about the technical configuration. The goal was to teach the model to discern financial sentiments within texts—a skill useful for anyone needing insights from complex financial news. Since sentiments in such articles can be subtle and not immediately apparent, fine-tuning a specialized model was essential to providing laypersons with clear indicators of the underlying sentiment.

To train our BERT model, we used thefinancial_phrasebank dataset, which is composed of sentences labeled with sentiments by individuals well-versed in finance. However, an intriguing issue arises with such a dataset: variance in the levels of agreement among annotators, which can implicitly influence the model’s learning and its subsequent predictions.

When we fine-tuned the BERT model using these datasets—which represent 100%, 75%, 66%, and 50% annotator agreement—it seemed pretty clear that the more the annotators agreed, the better the trained model.

When just using the various datasets, the more agreement, the better the model performance. Screenshot from Weights & Biases platform.

Suspicious right? What if the models weren’t actually better with higher consensus, but instead, a higher level of consensus implies that the financial statements are just... easier to classify?

It’s not hard to imagine that higher consensus might skew the model towards recognizing only clear-cut sentiments while neglecting those that are more nuanced, which are oftentimes the reality in complex financial texts.

To address this, we set out to de-bias our data. We aimed to construct a training and testing split that would give a fair representation of all sentiment clarity levels, ensuring the model’s utility across varying scenarios. This required carefully programming the data splitting process, taking extra steps like random shuffling, stratified sampling, and creating balanced partitions for validation and test sets. By doing so, we mitigated the risk of the model merely performing well on obviously sentimental data and instead ensured it could handle a realistic mix of financial texts.

The de-biasing flow. For a full explanation of the process see our W&B report. Created on draw.io.

Post-de-biasing, when we evaluated the model using a more balanced approach, an interesting trend surfaced—the model trained with 75% annotator agreement displayed the highest F1 score, deviating from our initial findings. It appeared, as we had suspected, that the best dataset is actually a compromise between full annotator agreement and more complex financial statements that provoke disagreement.

After de-biasing, the 100% agreement data model (AllAgree) was no longer the best performing, but instead the 75% agreement data showed the most promise. Screenshot from the Weights & Biases platform.

Putting it all together

The application architecture consists of a FastAPI service and a front-end container connected via Nginx. The FastAPI service hosts several endpoints that handle tasks such as streaming responses for specific operations, managing queries with streaming responses, listing website addresses, retrieving timestamps for specific websites, fetching URLs and financial flags for a given query, leveraging Vertex AI’s Prediction API for sentiment analysis, and processing input for effective sitemap scraping.

The tech stack of libraries and tools used to build and deploy the application. Headings from AC215 course slides and inner content by the author.

The front end is built in HTML and JavaScript as a single-page application using both synchronous and asynchronous functions. It’s styled with CSS, uses multiple effects for loading indicators and sliding panels, and uses Google’s Material Design library to create modern-looking text inputs and buttons.

We generated 30 different non-copyright-infringing off-brand Bert images using OpenAI’s DALL-E 3, 10 each for positive, negative, and neutral sentiment. Depending on which sentiment is found, the front end chooses one of the appropriate sentiment images at random to display along with the classes.

An animation showing the querying process for the company ai21.com. Bert images were generated using OpenAI DALL-E 3.

Real-time updates on the progress and completion of each step are provided to clients during the sitemap scraping process. Nginx acts as the gateway, efficiently routing requests to the appropriate endpoints, ensuring a cohesive and responsive user experience.

An animation showing the scraping for the cohere.com. Bert images were generated using OpenAI DALL-E 3.

Deployment

The RAG Detective App is deployed using Ansible for automation and reproducibility. The deployment process involves enabling the required Google Cloud Platform (GCP) APIs, setting up GCP service accounts, and creating a Docker container with the necessary software. Ansible playbooks are then used to build and push Docker containers to Google Cloud Registry (GCR), create a compute instance (VM) server in GCP, provision the instance with the required configurations, and setup the Docker containers in the instance. The process further involves configuring Nginx as the web server on the compute instance, ensuring proper setup and access to the Docker containers within the instance.

For scaling, we use Kubernetes. Ansible playbooks create and deploy a Kubernetes cluster. This involves building and pushing Docker containers to GCR and creating the Kubernetes cluster. Node scaling, where GKE automatically adds more nodes to the cluster, and pod scaling, run by the Kubernetes Horizontal Pod Autoscaler, which modifies the number of pod replicas based on resource utilization, are both involved in the scaling process in the event of increased demand. Load balancing is achieved through Kubernetes Services and GKE’s integration with Google Cloud Load Balancer, ensuring even distribution of incoming traffic.

One thing to note is that the Weaviate instance will not scale with our other containers, and implementing database scaling is beyond the scope of the course and a science in its own right. For scaling Weaviate, horizontal scaling is the primary approach, involving the addition of more nodes to the cluster to distribute the workload evenly. Alongside this is load balancing, which helps in evenly distributing incoming requests across the available nodes. Data sharding can be done as well, where data is partitioned across multiple nodes, allowing for efficient processing of queries. Kubernetes can also be used to automate the scaling and management of Weaviate instances, ensuring that the system remains robust and efficient as it grows to meet rising demands.

Lessons Learned

We learned some great general lessons about systems and operations:

  • Authentication can be a major pain point. Almost every time we deployed something for the first time, we would have issues where Google Cloud, Google Vertex, or OpenAI weren’t being properly authenticated or had the wrong permissions. Getting the hang of authentication and secrets management (we found using libraries like dotenv useful) is a major component of becoming skilled at operationalizing AI products.
  • GPT models are forward-thinkers. In order to decide whether to invoke our BERT model on the LLM’s answer, we asked it to return a flag that indicated whether the answer was financial in nature. Initially, we had it generate the flag and then the answer, until we realized that GPT models cannot classify text prior to generating it, even if they’re currently in the process of generating it. Moving the flag to the end of the model’s streamed response was a little trickier, but it works better. One thing we did notice: GPT-3.5 is a lot worse at reasoning than GPT-4, and it often makes errors here that GPT-4 does not make.
  • Use bleeding-edge libraries at your own risk. Despite having a working prototype using only Weaviate, we used LlamaIndex for our orchestration with the goal of using its more advanced RAG architectures. It’s documentation is full of example code but lacks a proper library reference, and we had to dive into its source code on a number of occasions. We could continue work on this project to add more complex RAG methods, which would both justify the use of LlamaIndex as well as give better responses to queries.
  • If you are going to use a new library, make sure to check out the Discord chat. As a workaround to poor documentation, often the authors of the libraries are active on Discord and will be able to give direct guidance when you get stuck.
  • Datasets, as usual, can contain hidden biases. We used thefinancial_phrasebank dataset, and as previously mentioned, there were hidden reasons why greater annotator consensus seemed to result in a better model. We should always be cautious about taking seemingly obvious paths with data!
A colorful image in a New Yorker style, depicting a puppet inspired by Bert from Sesame Street, dressed as a detective. This unique puppet has a long, orange nose, a unibrow, and a mop of black hair. He is wearing a classic detective outfit, complete with a trench coat and a fedora hat. The puppet is examining documents with a magnifying glass, looking focused and inquisitive. The scene is vibrant and detailed, capturing the essence of a detective’s investigative work.
And that’s the end! Image generated using OpenAI DALL-E 3

Overall, this was a fun project involving very modern solutions to very modern problems, and the experience we gained both with training and deploying models on the cloud—as well as creating our own RAG architecture—was a great way to spend the course.

--

--

Ian Kelk

Looking for ways to use AI to improve the human condition. Software development, video production, occasional comedian. Currently wrestling gators in Florida