Best practices for your ChatGPT ‘on your data’ solution

Learn how to implement design patterns to retrieve the most relevant context and generate a correct answer.

11 min readAug 22, 2023

Can you find the most relevant context in all your data? — (Photo by Rogério Toledo on Unsplash, cropped)

In my last article, I discussed the architecture and data requirements needed to create your own Q&A engine with ChatGPT/LLMs. Today I will walk you through the architectural decisions and approaches you can take to further improve your solution to chat with your documents.

One of the reasons why RAG [1] systems fail is often due to bad retrieval, not your LLM. A good system is as good as the data you provide.

Disclaimer: this article provides an overview of architectural concepts that are not specific to Azure, but are illustrated using Azure services since I am a Solution Architect at Microsoft.

1. Evaluate your current implementation

To improve your solution, first identify the issues in your current implementation. While end users only see an incorrect result, your responsibility is to determine the source of the error. This requires a good understanding and examination of your existing approach.

Before anything else, it’s important to find out whether the system has retrieved the appropriate documents to address the user’s question. If you don’t provide the system with the right documents, how can you expect it to generate the right answer? Here we will need to look into strategies to improve the document retrieval.

If the system did retrieve the right documents, but still provided you with a wrong answer, we will need to have a look at your prompt and model parameters.

(Slide by Colin Jarvis (Solutions Architect — OpenAI) presented on Microsoft Ignite Switzerland)

Now that you know the cause, let’s explore ways to improve accuracy. Luckily, there are existing solutions that can be easily applied using frameworks like LangChain and Semantic Kernel.

2. Tune your chunking

So your solution was able to generate an answer that is partially correct. This is often due to the fact that not the full context has been provided to your language model. To further improve this, understanding your users queries and your dataset is key.

To provide accurate and complete answers to user questions, it’s important to have a clear understanding of the context in which the question is being asked. In some cases, a single page or document may provide enough context to generate a comprehensive answer. However, in other cases, it may be necessary to analyze multiple pages or even multiple documents to fully understand the context of the question and provide an accurate answer.

Moreover, understanding your data is crucial. Consider whether your data can be divided into smaller parts without losing context, or if specific preprocessing steps are needed to obtain metadata, such as the (sub)chapter title in a dense document.

# Split by tokens, with a sliding window
# https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/split_by_token
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=300, chunk_overlap=150
)
chunks = text_splitter.split_text(input_document)

# Split by headers (markdown)
# https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/markdown_header_metadata
from langchain.text_splitter import MarkdownHeaderTextSplitter

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[("#", "Header 1"), ("##", "Header 2")])
chunks = markdown_splitter.split_text(input_document)

Libraries like LangChain provide several pre-built document transformers, but it is not too hard to create your own preprocessing logic. In most cases, a simple approach using a sliding window to chunk your data will suffice. However, for more specific cases, additional logic may be beneficial.

3. Tune your search

This is usually the most crucial aspect. If you don’t give your language model the necessary context, it won’t be able to provide accurate answers. There are several ways to refine your search, so I will suggest some useful techniques to help you approach this from different angles.

3.1 Hybrid Search

Hybrid search is a method of using multiple search algorithms to achieve more precise and relevant search results. It combines the advantages of keyword-based and vector search techniques, resulting in a better search experience.

Keyword-based search algorithms work by matching exact keywords or phrases to search terms. This can be useful for finding exact matches of specific terms, such as “article 21” in the Constitution. However, keyword-based searches can miss relevant results that use synonyms or different phrasing.

Vector-based search algorithms, on the other hand, consider the context and meaning of the words used in the search query. This can be useful for finding related results that might not include the exact search terms but are still relevant to the query. However, vector-based searches can sometimes miss exact matches of specific terms.

Simplified version of a Hybrid Search data flow — (Image by author)

By combining both keyword-based and vector-based search techniques using RRF, hybrid search provides a more comprehensive and accurate search experience. It can identify exact matches of specific terms while also considering the context and meaning of the words used in the search query. This results in more relevant search results and thus provides a more accurate context to your LLM.

# Partial code sample for demonstration purposes
# https://python.langchain.com/docs/integrations/vectorstores/azuresearch
embeddings = OpenAIEmbeddings(deployment_id="text-embedding-ada-002", chunk_size=5)
vector_store = AzureSearch(
    azure_search_endpoint=os.getenv('AZURE_COGNITIVE_SEARCH_SERVICE_NAME'),
    azure_search_key=os.getenv('AZURE_COGNITIVE_SEARCH_API_KEY'),
    index_name=os.getenv('AZURE_COGNITIVE_SEARCH_INDEX_NAME'),
    embedding_function=embeddings.embed_query,
)

docs = vector_store.similarity_search(
    query="What is Azure Cognitive Search?",
    k=3, 
    search_type="hybrid" # or use hybrid_search() method on vector_store
)

3.2 Query parsing and metadata

Users can be unpredictable, as they may send lengthy stories or combine multiple questions into one message. It can help to filter out any extraneous information and identify the user’s main intent.

If you’ve invested time in building a solid search index with metadata, it would be a shame not to utilize it to its full potential. By pre-processing the user’s query and applying relevant filters to your index, you can significantly improve the quality of search results.

This pattern involves a multi-step retrieval process, which helps improve the quality of the context provided to the LLM. First, the user’s query is parsed to retrieve the relevant filters. Then your search index is queried, with these filters specified. The retrieved documents will be provided to your LLM to generate a response.

Simplified version of a RAG flow with metadata — (Image by author)

The latest versions of gpt-35-turbo and gpt-4 have been fine-tuned to work with functions and are able to both determine when and how a function should be called. You can implement the pattern described above by using function calling (preview), to provide a native way for these models to formulate API calls and structure data outputs. For other LLMs you can use implementations like LangChain agents or SelfQueryRetriever.

# Partial code sample for demonstration purposes
# Example of OpenAI Function Calling
# https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/function-calling
messages= [
    {"role": "system", "content": "You're an AI assistant designed to help users search for financial documents. The current year is 2023."},
    {"role": "user", "content": "Can you tell me more information about Microsoft investments in AI this year?"}
]

functions= [  
    {
        "name": "search",
        "description": "Retrieves financial documents from the search index based on the parameters provided",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The question of the user"
                },
                "company_name": {
                    "type": "string",
                    "description": "The name of the company"
                },
                "year": {
                    "type": "number",
                    "description": "The year"
                }
            },
            "required": [ "query" ],
        },
    }
]  

response = openai.ChatCompletion.create(
    engine="gpt-35-turbo-0613",
    messages=messages,
    functions=functions,
    function_call={ "name": "search" }, # always use Search function
)

3.3 Hypothetical Document Embeddings (HyDE)

In our previous methods, we generated embeddings for our questions and compared them to our documents to find relevant answers. But does it make sense to compare the question with the answer? With the advent of large language models, we have new possibilities. We can now generate an answer (which could be not factual correct) and compare this to the factual answer from our own knowledge base.

Hypothetical Document Embeddings (HyDE) [2] is a design pattern that leverages LLMs to create hypothetical documents based on the user’s query. These hypothetical documents serve as a proxy for the ideal answer and can be used to improve the retrieval process. The idea is that by comparing the embeddings of the hypothetical documents to those of the knowledge base, it becomes easier to identify the most relevant pieces of information to provide as context for the LLM.

Simplified diagram of the HyDE flow — (Image by author)

As discussed in arXiv:2212.10496, HyDE remains competitive even when compared to fine-tuned models. However, it is important to test whether this approach is beneficial for your specific scenario.

3.4 Other options

Tuning your search could have been an article on its own. Every day, new approaches and clever tricks are being invented and shared on platforms like Twitter and GitHub.

Some other approaches and patterns that might improve relevancy are:

Tune your vector search algorithm and parameters, find the right balance between accuracy and latency
“Read Retrieve Read” [3] / ReAct [4], iteratively evaluate the question for missing information and formulate a response once all information is available.
Parent Document Retriever [5], fetch small chunks during retrieval to better capture semantic meaning, provide larger chunks with more context to your LLM

4. Tune your prompt and model params

If your LLM is not providing accurate answers despite having the correct information, there are two main areas we will focus on: tuning the prompt and adjusting the model parameters. By optimizing these elements, you can improve the system’s ability to provide relevant and precise answers and mitigate the risk of fabrication (also known as hallucination).

4.1 Tune your prompt

Make it more specific: Ensure that the prompt is explicit and provides enough context to guide the model in delivering accurate results. Include relevant keywords, phrases, or concepts that will help the model understand the query task and knowledge domain better.
Avoid ambiguity: Ambiguity can lead to irrelevant or incorrect results. Be clear and concise with your prompts to minimize confusion. Repetition can be used to emphasize an instruction, but make sure it does not contradict itself.
Use optimal length: Although longer prompts can provide more context, they might also introduce noise. Experiment with different lengths to determine the optimal one for your specific use case.
Few shot learning: Provide a few examples of how ‘the model should behave’ in your prompt. You could add a small text, with a question and an answer for example.

4.2 Choose the right model and parameters

It is important to experiment with different models to find the one that best fits your needs. Models can vary in their capabilities and the number of tokens they can handle, which is known as the context window. While gpt-35-turbo is suitable for most cases, gpt-4 is better suited for complex reasoning tasks, although it may be more expensive and slower. If you are using strategies that involve multiple LLM calls, such as HyDE, you can also choose to use gpt-35-turbo and gpt-4 side by side.

When using (Azure) OpenAI models, it’s recommended to test the performance of the newer version (0613) to see if it provides better results. These models are specifically designed to better understand and follow system messages and instructions, which can improve the accuracy of your LLM.

Confiiguration of model parameters in Azure AI Studio — (Image by author)

Next to the model choice, you can also configure the model parameters to influence the outcome:

Temperature: The temperature parameter controls the randomness of the output. A higher value leads to more diverse results, while a lower value narrows down the output to more focused results. Experiment with different temperature values to find the sweet spot for your application. My personal preference is 0.1 for Q&A with your own data.
Top-k and Top-p: These parameters control the number of tokens considered while generating the output. Adjusting these values can help improve the relevance and accuracy of the results. For example, decreasing the top-k value might lead to more focused results, while increasing it may provide a broader range of answers. Try adjusting temperature or Top P but not both.
Max tokens: Limiting the number of tokens in the output can help ensure that the results are concise and relevant. Experiment with different values to find the optimal balance between completeness and brevity.

The background behind parameters for LLMs can be found in this great article on Hugging Face [6].

If you find that none of the existing models meet your requirements, finetuning a model can be an option, but this comes with a cost. Finetuning is not yet possible with gpt-35-turbo and gpt-4, but this has been announced by OpenAI as a future possibility.

4.3 Iterate and evaluate:

Improving your LLM (RAG) system is an iterative process. Continuously experiment with different prompt structures and model parameter settings, and evaluate the results using appropriate metrics, such as accuracy, relevance, and diversity. Collect user feedback and analyze the system’s performance to identify areas for improvement.

One useful tool for this is PromptFlow (preview), which allows you to run multiple versions of your prompt at scale and verify their outcomes. By using PromptFlow, you can test different prompt structures and parameters and quickly identify which ones are most effective in achieving your desired outcomes.

Getting Started with Azure AI Studio's Prompt Flow

🚀 Dive into the World of Prompt Engineering with prompt flow in Azure AI Studio! 🌟Get ready for an action-packed hour…

www.youtube.com

Conclusion

This article emphasizes that there is not one perfect solution when it comes to designing strategies for RAG, as every approach has its own set of trade-offs and design patterns alone cannot compensate for poor data quality. Ensuring that your data is accurate, reliable, and up-to-date is essential for deriving meaningful insights and making informed decisions.

Ultimately, effective design patterns boil down to clever engineering and a deep understanding of your data and user queries. Continual evaluation and iteration are crucial for a successful solution. As with other solutions, the more effort you invest, the better the outcome. It’s essential to recognize the limitations and implement a system to gather user feedback. Logging plays a significant role in this process.

In the future, tools like Azure OpenAI Studio’s “Add your data (preview)” and open source offerings will become more advanced, easing your workload. However, one constant remains: understanding your scenario and data is paramount. The next topic in this series will discuss how to validate and test your RAG solution at scale.

If you enjoyed this article, a clap is appreciated and feel free to connect with me on LinkedIn, GitHub or Twitter to share what I should write about next time.

References

[1] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., Kiela, D., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (2021), arXiv:2005.11401

[2] Gao, L., Ma, X., Lin, J., Callan, J., “Precise Zero-Shot Dense Retrieval without Relevance Labels” (2022), arXiv:2212.10496

[3] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y. “ReAct: Synergizing Reasoning and Acting in Language Models” (2023), arXiv:2210.03629

[4] Karpas, E., Abend, O., Belinkov, Y., Lenz, B., Lieber, O., Ratner, N., Shoham, Y., Bata, H., Levine, Y., Leyton-Brown, K., Muhlgay, D., Rozen, N., Schwartz, E., Shachaf, G., Shalev-Shwartz, S., Shashua, A., Tenenholtz, M. “MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning” (2022),
arXiv:2205.00445

[5] Harrison Chase, “Parent Document Retriever”. August 2023, https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever

[6] Patrick von Platen, “How to generate text: using different decoding methods for language generation with Transformers”. July 2023, https://huggingface.co/blog/how-to-generate