Context is Everything: How to Approach Fine-Tuning a Grounded LLM Application

7 min readOct 25, 2023

Note: This article and the underlying LLM application were developed with Alexander Gilmore, Associate Consulting Engineer at Neo4j. Follow me on LinkedIn for daily posts.

Background

One of the first questions often asked when developing an LLM application is whether to ground the model, use fine-tuning, or combine both. Several customers have asked this question recently as they plan their LLM projects. My colleagues and I also evaluated fine-tuning as part of our roadmap for developing a grounded LLM application within Neo4j. This article captures current thinking and how I am advising customers to plan their LLM application development.

In summary:

Begin logging as soon as you define a use case. Log user conversations with the base LLM in a graph database and capture feedback about the quality of LLM responses without grounding, prompt engineering, or fine-tuning.
Implement grounding first to incorporate proprietary data and improve areas where the LLM provides poor responses. Iterate here to optimize the grounding dataset and prompt design, while continuing to log LLM interactions and user feedback. Graph databases and graph data science are uniquely well suited for analyzing LLM interactions alongside grounding data.
Explore fine-tuning only after you have optimized grounding and have an in-depth understanding of where the LLM performs well and where it struggles. By this time you should also have sufficient user interaction to develop a high-quality dataset optimized for fine-tuning.

The Neo4j NODES 2023 conference will have several workshops specific to graph databases and LLMs. We hope you join us!

Join us for nodes at https://neo4j.com/nodes-2023

Note: This article focuses on fine-tuning text generation and chat models, though the approach should apply to specialized code generation models.

Fine-Tuning Overview

Fine-tuning enables you to perform additional training on an LLM with examples specific to a task or domain. A similar approach to improving LLMs is few-shot prompting where example inputs and outputs are provided to the LLM in the prompt. Fine-tuning builds upon this by re-training a portion on the LLM’s parameters on more examples than could be included in a single prompt. This has multiple benefits, as both OpenAI and Google identify in their fine-tuning documentation:

Better model performance for specific tasks than can be achieved with few-shot prompting.
Shorter prompts when querying the model because input and output examples do not need to be provided.
Lower costs and improved latency because of the shorter prompts.

At the same time, fine-tuning is not an all-encompassing solution to improving LLM performance. Colleagues and customers have shared experiences where fine-tuning did not appear to provide meaningful improvement to LLM performance. Fine-tuning also cannot fully replace a large corpus of grounding data that could be used for retrieval augmented generation (RAG).

In addition to the compute costs for training, fine-tuning requires time and effort to build a high-quality data set with enough examples for training, testing, and validation. Google requires a minimum of 10 examples for fine-tuning, but recommends between 100 and 500. OpenAI also requires a minimum of 10 examples and says improvements can be seen starting at 50 to 100 examples. The data used for fine-tuning should also be optimized and mimic what the model will see in a production environment, which includes context information as the Google documentation highlights.

So Where Does Fine-Tuning Fit In?

This begs the question, when should you consider fine-tuning an LLM, particularly if doing so may not produce meaningful results? I am currently advising customers to include fine-tuning in their roadmap, but to wait several iterations before experimenting with it. We are taking the same approach with an internal LLM application.

At the same time, I am also advising customers to take a data-centric approach from the start in order to build a high-quality dataset of LLM interactions to use for fine-tuning.

Log from the Start

One of the most important steps you can take is to set up logging infrastructure as soon as you have a use case and begin work towards building a grounded LLM application. This includes:

A front-end (such as Streamlit) to allow the team and potential users to easily ask the LLM questions
A highly-visible method for users to provide feedback (i.e., thumbs-up / thumbs-down) about the LLM’s responses

Example graph data model for logging LLM conversations with context

Ideally, logging user questions and answers will start as soon as your have identified a use case for the application (even before implementing grounding).

Why do this? First, collecting initial questions will help you understand what topics your users need assistance with and how they think the LLM might help. Additionally, by collecting user feedback you can begin understanding where the base LLM answers questions well and where it struggles.

Positively-rated answers will indicate where the LLM already performs well, and therefore where we may not need to put effort into grounding or fine-tuning. Negatively-rated answers will highlight areas where grounding and fine-tuning may be beneficial. Understanding how model performance changes with grounding will also be critical when developing a fine-tuning dataset.

Start with RAG

Grounding, via retrieval augmented generation (RAG), is the approach we see as most successful for organizations who want to leverage the power of LLMs with their own proprietary information. Knowledge graphs, combined with graph data science algorithms, are uniquely well suited for developing a high-quality grounding data set and logging LLM interactions in the same database.

Retrieval Augmented Generation with a Knowledge Graph

Having grounded questions and answers logged in a single graph database is extremely valuable when building a dataset for fine-tuning. Combining questions, grounding context, and the LLM response in a graph database enables you to visualize and understand exactly how the LLM produced the response. The graph structure, where relationships are equal objects as the text nodes, also provide an efficient way to query all of these elements in a way that will replicate what the LLM will see in production (an important aspect of effective fine-tuning).

Continuing to capture user feedback can also help identify where the LLM does well where and where it struggles. Non-grounded questions from the initial logging can also be re-submitted to the application to demonstrate changes in answers and act as examples for fine-tuning.

Optimize Your RAG Implementation

Once you implement grounding and begin collecting user questions, a critical step is to perform an in-depth analysis to understand the user questions, the grounding data set, as well as where the LLM succeeds and where it struggles. This can include improving the quality, efficiency, and diversity of the grounding data set, as we have written and presented about. It can also include adjusting instructions to the LLM for how to use the context and answer the user’s question. It may even include experimenting with different foundation models or types of models (i.e., chat, text generation, or code chat).

We currently see RAG as the approach most likely to enhance how organizations recognize value from implementing LLM-based applications. Therefore, optimizing all aspects of a RAG implementation is how organizations are most likely to realize incremental improvements.

Build a High-Quality Fine-Tuning Dataset

I am a strong advocate of taking a data-centric approach to building grounded LLM applications. With effective logging, iterating on and optimizing the application itself is also a primary way to develop a high-quality dataset for fine-tuning.

A high-quality dataset for fine-tuning will include:

User Questions: By logging questions from the start we will have the largest possible representation of topics users ask the LLM about. This will help us ensure that our fine-tuning dataset is aligned with how users are actually interacting with the application. Questions asked before grounding can also be re-submitted to demonstrate the impact of grounding and determine if certain topics should be addressed via fine-tuning.
Grounding Data: Investing time and effort into curating a high-quality grounding dataset will likely improve the application’s performance while also optimizing the input examples used for fine-tuning.
LLM Responses: A corpus of LLM responses from before and after grounding, rated by users, is also critical for curating a fine-tuning dataset. Positively rated answers represent examples we want to provide to the LLM for training. Negatively-rated responses, especially those that remain negative after grounding, represent areas where either additional grounding or fine-tuning may be required. You should also identify any topics that may have trended negative after grounding.

Conclusion

Fine-tuning is a valuable tool to have available when building an LLM application. For most organizations and use cases however, grounding and RAG appears more likely to deliver value in the near-term. That said, fine-tuning has potential to further improve an LLM application after grounding is implemented and optimized and should be evaluated as part of the overall development plan. There are several steps developers can take to leverage the unique capabilities of graphs and build high-quality datasets for use with fine-tuning.

We hope you will explore how Neo4j and Graph Data Science can help you develop LLM applications for your organization!