You Built a RAG Proof of Concept…Now What?

Daniel Bukowski
10 min readDec 11, 2023

--

Follow me on LinkedIn for daily posts.

What do you do next after your RAG proof of concept is functioning?

The “Now What?” Moment

Earlier this year my Neo4j colleague Alex and I started working on a prototype RAG application to assist our colleagues with using the Neo4j Graph Data Science Library (we were later joined by another colleague, Alexander, who has done incredible work). We have written about this application and it was the basis of our GenAI and GDS Road to NODES Workshop in October 2023.

After the NODES Presentation we had to decide where to take the project next. We demonstrated that grounding the LLM improved responses and that using a Neo4j graph database provided several advantages over using a standard vector database. Now what?

Pondering What’s Next for My Project — Image generated by author using DALL-E

In this article I will share a my experience determining the next steps for a RAG POC once it is functional. This is an experience I know several friends, colleagues, and customers have gone through with similar projects.

Preliminary Evaluation

Recently, at the AI Engineer Summit, co-founder and CEO of LlamaIndex Jerry Liu delivered an outstanding presentation about Building Production-Ready RAG Applications. I highly recommend that anyone working on a RAG POC watch this video.

One of the notable parts of Jerry’s talk was his early focus on “evaluating” your RAG application. He recommends that before you iterate on other aspects of the application (i.e., data, embeddings, retrieval, or synthesis), you first establish how you will evaluate performance.

I am sure few developers define evaluation criteria when starting a RAG POC (I didn’t). However, I had informal measures in the back of my mind that would indicate if this project was moving in the right direction. Therefore, while I wholeheartedly agree with Jerry’s emphasis on defining evaluation metrics early in the project, I think simple, high-level technical and functional evaluation criteria can be sufficient when you are just getting started. The goal at this point is to identify if the project should continue moving forward, not to measure total accuracy or quantify business value. Those are critical measures, but can come later.

Evaluating Your Project — Image generated by author using DALL-E

Functional Evaluation

Functional evaluation can be straightforward — did additional data improve the LLM’s responses? Early in the POC, this can be as simple as identifying a few questions where the initially the LLM struggled, but then showed improvement with the additional data. That’s it.

For our project, one question about the Neo4j Graph Data Science library showed the most dramatic improvement. At the time, GPT-3.5 and GPT-4 had knowledge cutoffs in 2021 and there had been substantial updates to the Neo4j GDS library since. To test GPT-4 we asked it, “What are the most important hyperparameters in the Neo4j Graph Data Science implementation of the FastRP embedding algorithm?”

  • Without grounding GPT-4 hallucinated and returned eight hyperparameters, four of which are not actual hyperparameters or applicable to FastRP.
  • With grounding GPT-4 responded with five FastRP hyperparameters, all of which are in the Neo4j GDS implementation of FastRP and that multiple subject matter experts agreed were the most important to generate effective FastRP embeddings.

We had other examples as well, but this one resonated the most with our colleagues and leadership, and it helped build momentum to move the POC forward.

Technical Evaluation

Hand-in-hand with the functional evaluation, the technical evaluation can start simple as well — “does the POC work?” Put another way, are you able to ask the LLM a question with grounding and receive a response? At this point the implementation does not have to be efficient or pretty.

My technical background is in data science, not software development, and my first iteration of the POC was cobbled together across four Jupyter Notebooks. There was not even a front end — I used the Jupyter Notebook to submit my questions. We have made significant improvements since, but at the time notebooks were sufficient to get the project off the ground and demonstrate functional effectiveness.

High-Impact Improvement Areas

So what comes next after you have demonstrated basic functional and technical success? Based on my experience there are several high-impact areas to consider as you iterate on your RAG POC:

  • Grounding Data
  • Selecting a Database for the Future
  • Data Pre-Processing
  • Orchestration Layer
  • User Interface
  • Logging User Interactions
  • Code Refactoring

These areas are not in a specific order — I could make an argument that any of them should be first. This blog will only be a high-level survey of the areas, but I hope to publish more detailed blogs in the future and in collaboration with my colleagues working on the project.

Grounding Data

Data is key for Retrieval Augmented Generation and I have written extensively about the importance of having high-quality grounding data. If you started small, as many POCs do, it only makes sense to incorporate more grounding data as you continue developing the project.

We Need More Data — Image generated by author using DALL-E

However, you should be careful to not add more data just for the sake of having a larger grounding database. In addition to potential embedding and storage costs, matching and retrieving out-of-date, messy, or irrelevant data can be detrimental to the performance of your application. For example, we had to be careful to not accidentally include data that referenced deprecated syntax or legacy versions of Neo4j that are no longer supported.

Selecting a Database for the Future

Many RAG POCs are built on lightweight, open source vector databases like ChromaDB. These are great options for getting started and prototyping a use case, but they may not scale with your project. This is where it makes sense to plan for the future.

I am an employee of Neo4j, but I am also a huge fan of graph databases and believe they are an excellent option grounding a RAG application. Neo4j has a vector index capability, and at the most basic level you can use Neo4j as just a vector database while you are getting started (we did!).

However, Neo4j is well suited to grow with your POC as the grounding data increases and the use case becomes more sophisticated. By planning for the future now, you won’t have to consider a major redesign or database change down the road. I strongly believe graphs provide the best combination of flexibility and capability as you are building a RAG application.

How Do I Store and Connect This Data? — Image generated by author using DALL-E

Data Pre-Processing

There are several aspects of data processing that can significantly boost the performance of your application. Three I will briefly highlight here are:

  • addressing errors, outliers, and noise in the grounding data.
  • implementing the proper document splitting strategy.
  • optimizing the size of text chunks.

Identifying and addressing errors, outliers, and noise in the data is relatively straightforward. Like any other data science project, RAG applications are subject to “garbage-in, garbage-out.” Addressing this in your grounding database can make the database more efficient (especially as the application grows) and reduce the chances of providing bad context to an LLM.

Implementing the proper splitting strategy is also critical. For example, some splitters look for newline or similar characters that may not be in all document types. PDFs, websites, code files, or Jupyter Notebooks may all require different tools or strategies as well to produce effective chunks. When first building a POC, good enough can be good enough, but as the application evolves this is an area well worth time and effort.

I Need to Optimize This Part — Image generated by author using DALL-E

Optimizing document splitting or “chunking” can also have a dramatic impact on performance. Finding the optimal chunk size is part art and part science, and often is determined by the type of source document (i.e., unstructured text vs. code vs. website, etc…). The LLM you are using and its context window size should also factor into the splitting strategy, as larger pieces of context can quickly eat up capacity. However, as with the splitting strategy, experimenting and attempting to optimize chunk size is also well worth time and effort.

Orchestration Layer

Another factor to consider is whether you want to begin using an orchestration layer such as LangChain or LlamaIndex. Both are open source libraries with functionality that helps throughout the entire RAG (or other GenAI) workflow, from scraping and ingestion to parsing LLM responses. These libraries are also help make your code interoperable between LLMs and platforms (i.e., switching among OpenAI, GCP, Bedrock, Llama2, etc…). Overall, we have found these libraries to be very useful.

The Maestro Engineer — Image generated by author using DALL-E

At the same time, both LangChain or LlamaIndex are evolving open source projects, and we have run into reliability and other challenges from time-to-time. It is unclear if either library is fully enterprise ready at the moment. Additionally, as cloud platforms upgrade their APIs they may incorporate some features found in LangChain and LlamaIndex. For a POC it is likely well worth your effort to explore orchestration layers, but as fast-evolving open source projects they may not be suitable for enterprise production environments which may lead to refactoring down the road.

User Interface

My first RAG POC comprised four Jupyter notebooks. This included how I submitted questions to the LLM and displayed the response via a Python API. In order to better demonstrate the tool, one of the first steps my colleague Alex took was to implement a front-end so that we could share the tool with other uses. An easy and popular option is Streamlit, though others also exist.

POC Interface — Image generated by author

The ability to allow users to interact with the tool via a front-end made it much easier to share, demonstrate, and gather user-feedback. Additionally, the front-end enabled us to add toggles for temperature, number of documents to retrieve, and even different LLMs. If you intend to demo the POC or share it with users to test, investing the time to set up a simple interface is a good area to prioritize. It certainly was for us.

Logging User Interactions

Once you move past the initial POC, you will likely want to capture user interactions with the application in order to understand how the application is being used, evaluate LLM responses, and make improvements. One of the key benefits to using a graph database to ground your RAG application is the ability to log user interactions in the same database and then analyze and visualize them with the grounding data. Graph databases can potentially do this better than any other type of database.

LLM Conversation Logged in a Graph Database — Image created by author

As you evaluate whether to move your application to a more robust database, be sure to consider how that database can help you log and visualize the LLM’s operations under-the-hood.

Code Refactoring

The final area I want to discuss is code refactoring. This is a process that will likely continue as long as you are actively developing and maintaining your project, but it can start as soon as the POC is functional.

Rebuilding My Invention — Image generated by author using DALL-E

I have mentioned in this blog that the first iteration of our RAG POC was cobbled together across Jupyter Notebooks. Notebooks are great for early experimentation and to get a project off the ground, but they are generally not sufficient as the project moves forward. Another early effort by my colleague Alex was migrate code from these notebooks into a Github repo. This paid huge dividends and allowed us to share code and implement several additional features in a much more structured, maintainable way.

As your project moves from initial POC to early stage development and then to maturity, it is worthwhile to invest time in making sure the foundations are there for long term development. It certainly was for us

Where to next?

GenAI and LLMs are a fast-evolving technology, with ecosystems and best practices moving just as fast. As you experiment with the technology, and specifically Retrieval Augmented Generation (RAG), there will be many areas where you can focus time and effort to continue developing the project. It is easy to spread your efforts too thin or get distracted by the “new exciting thing.” This has happened to me.

What Will I Build Next? — Image generated by author using DALL-E

Covering seven areas for POC development in this blog felt like a lot. However, there are at least that many additional areas I considered for inclusion. Several of the above fall into the “quick win” category, so choosing one area over another to start does commit you to a multi-week development sprint. I would encourage you to do your research and focus on practical, quick win areas that can build and maintain your project’s momentum. The power of small, incremental wins that you continuously build overtime can pay massive dividends for you, your team, and your project.

--

--

Daniel Bukowski

Graph Data Science Scientist at Neo4j. I write about the intersection of graphs, graph data science, and GenAI. https://www.linkedin.com/in/danieljbukowski/