LLM / GPT Prompt Engineering — going beyond the basics, and how you can mitigate against confusion

9 min readApr 11, 2023

I want to start by saying I’m not trying to jump on the LLM (Large Language Model) bandwagon since the hype of ChatGPT. I have been hugely impressed and been dipping my toes in the GPT waters since GPT-2 back in 2019. Generative models have been a go-to approach for many of my NLP tasks as a first step instead of immediately thinking I have to go and fine tune a model, or worse, train one from scratch on expensive-to-generate training data.

But I’m also not going to pretend I was not utterly blown away with ChatGPT and subsequently GPT-4 and the vast array of Open-Source models that seem to appear on a weekly basis.

My time spent on the models has been significant since the launch of GPT-3, with me being lucky enough to have this as a percentage of my day job. My title is not Prompt Engineer, but to all intents and purposes it could be in part (or as I was jokingly referred as - an AI Whisperer 😁).

If you’re reading this I assume you’re familiar with the well documented key ingredients of a good prompt. Thanks to Riley Goodside for his Twitter feed and his experimentations that have set the groundwork for a lot of these fundamentals approaches and understandings, along with the countless internet articles and blogs authored by different people since the launch of ChatGPT.

In essence, the basic key considerations are:

Start your prompt by telling the model to ’role play’ to establish the context of its understanding and its core skills
Tell it to not make up stuff if it doesn’t know the real answer
Tell it what you want it to do. Give clear concise instructions. I always consider how I would explain it to a child or an elderly relative
If necessary, give an example(s) of the format of the question and the desired answer
If you want it to show its workings out you can end the prompt with something like “Let’s go step by step. Firstly “ and let the model continue that pattern with its generated output

So I’m not going to go over things like that. I assume you know all that.

So why are we here?

What I want to go over are some of the things I have done and have learnt that maybe weren’t as obvious as just writing a good prompt.

Many months back I wanted to use the powers of LLM’s on our own data. To largely ignore what GPT-3 ‘knew’ (information up to the end of 2021) but instead to leverage what GPT-3 ‘understood’ about language etc and so I could use it on our own company data. I originally decided to roll-my-own solution using embeddings and local JSON files as a pseudo vectorstore and to be honest it worked a treat but wasn’t really scalable.

But then I stumbled across the amazing LangChain library. I moved to that and have never looked back. These guys are doing an awesome job and I strongly recommend you take a look if you haven’t already.

I won’t go into details of how I originally set up to use LangChain with our own data. I was going to, but on a quick search I see Jeremy Arancio has done a Medium article that explains it perfectly already.

However here’s a very simple overview of the steps:

The documents(s) we want to use with our LLM are converted to raw text (from pdf, docx , pptx etc)
This raw text is then split into chunks
Each chunk is then converted to its embeddings and stored in a vectorstore
When the user asks a question, that question is also converted to its embeddings and through cosine similarity compared to each of embeddings in the vectorstore
The top k number of chunks are then returned, and a prompt constructed of the instructions / question and the selected chunks
The intention is that the model will find the answer to the question in the selected chunks

The only difference is that I use the gpt-3.5-turbo model and as it is a chat model it has a slightly different API. In essence it’s not just a single long text prompt as you put into a GPT-3 call, but rather System Messages (things like the instructions you would use to start a traditional prompt) and then alternating Human and AI messages, forming a history of the chat. (You can mimic the alternating Human / AI messages to emulate one or few shot examples instead of in the main prompt as you would traditionally). This OpenAI article explains the differences better.

So, I wanted to give some practical and real-world examples of my experiences that I hope will help others.

Set yourself a repeatable and reliable dev and test framework and good practices

You can soon get yourself tied up in knots, especially with chat models and LangChain chains that are doing a lot of clever stuff. It’s easy to forget you are calling a chain rather than just an LLM (meaning more things are going on than just calling the model), what prompts or system messages are being used, etc? This way we can easily define control stop sequences, temperature or numbers of tokens etc.
For example, my base set up would be something like:

def get_llm(max_tokens=1000, temperature=0, stop=None):
    """
    A function that returns an instance of AzureChatOpenAI with specified model parameters. 

    Parameters:
        max_tokens (int): the maximum number of tokens to generate in response to the prompt.
        temperature (float): a value controlling the degree of randomness in the generated text.
        stop (list, optional): a list of tokens at which the text generation should stop. Defaults to None. 

    Returns:
        llm (AzureChatOpenAI): an instance of AzureChatOpenAI with the specified model parameters. 
    """
    model_kwargs={'temperature':temperature}
    if stop: model_kwargs['stop'] = stop
    llm = AzureChatOpenAI(deployment_name='gpt-35-turbo', max_tokens=max_tokens, model_kwargs=model_kwargs)

    return llm


def get_qa_chain(vectorstore_mydocs, llm, memory, query):
    """
    A function that creates a question-answering (QA) system using a retrieval-based approach. 

    Parameters:
        vectorstore_mydocs (VectorStore): a VectorStore object containing the documents to be searched for answers.
        llm (AzureChatOpenAI): an instance of AzureChatOpenAI to use for generating responses to user questions.
        memory (Memory): a Memory object to keep track of conversation history.
        query (str): the user's query or question. 

    Returns:
        chain: the LangChain QA system chain. 
    """  
         
    _DEFAULT_SYSTEM_MESSAGE_QA ="""Use the following pieces of context to answer the users question. 
    If you don't know the answer, just say that you don't know, don't try to make up an answer.
    -----
    {summaries}"""

    prompt = ChatPromptTemplate.from_messages([
                SystemMessagePromptTemplate.from_template(_DEFAULT_SYSTEM_MESSAGE_QA),
                HumanMessagePromptTemplate.from_template("{question}")
            ])

    chain_type_kwargs = {"prompt": prompt}
    chain = RetrievalQAWithSourcesChain.from_chain_type(
        llm, 
        chain_type="stuff", 
        retriever=vectorstore_mydocs.as_retriever(), 
        chain_type_kwargs=chain_type_kwargs
    )

    return chain


def full_answer(qa, query):
    """
    Given a question answering model and a query, returns the full answer to the query.

    Parameters:
        qa (function): A question answering model that takes a dictionary with a "query" key and a "chat_history" key as input, and returns a dictionary with an "result" key as output.
        query (str): The query to be answered.

    Returns:
        str: The full answer to the query, as returned by the question answering model.
    """
    result = qa({"question": query, "chat_history": []})["answer"].replace('\n','')

    return result


llm = get_llm(max_tokens=500, temperature=0.1, stop=["#"])
memory = ConversationBufferMemory(return_messages=True)
vectorstore_mydocs = FAISS.load_local('../data/mydocs', OpenAIEmbeddings(chunk_size=1))

qa_chain = get_qa_chain(vectorstore_mydocs, llm, memory, query)

print(full_answer(qa_chain, "What are the main roles for ninja penguins?"))

So I recommend having a good set of helper functions to not just make it clear for your text splitter, LLM and chain parameters, but clarity on what prompts are used with each, their histories, a consistent content return regardless of single LLM or a chain etc
I know this sounds trivial, and perhaps obvious, but I promise it’ll help

Work up to what you want, and don’t be afraid to go back to basics if it’s not doing what you expect

OK you have a vectorstore of all your docs and have set up a LangChain chain to QA that, and the first prompt returned exactly what you want. Happy days, right? But your next prompt returned garbage (or hopefully you had a good system message and it returned that it ‘didn’t know the answer’ instead of a hallucination 😁). So why and what can you do?
Firstly let’s think about what’s going on. Well, we started off this whole journey with one ‘black box’ insofar as the LLM itself. We passed a prompt and we got an intelligent response. Through good prompt engineering we have got better at taming that beast to get the response we want. And that’s the key, we knew exactly what we were passing in
But when we are using the vectorstore as well as the LLM, we are doubling up the ‘black boxes’. One for the vectorstore (for your chunks of your data that are being returned because they may hold your answer), as well as the existing LLM one. So how can we look at every part bit-by-bit?
Firstly, check your base prompt and system message. Make sure they all make sense and are what you want. If you were to manufacture the body of text that has the answer you want in it, is it being returned using that prompt?
Once we know the surrounding prompt is good if the body of context accompanying it is sound, then lets look at what would be the ‘k’ chunks passed in as part of the prompt to make up this body of context. To look at this outside of the chain, simply call the vectorstore directly on your query via the following commands

 for page in vectorstore_mydocs.as_retriever().get_relevant_documents(query):
    print(page.page_content)

OK so then you have the relevant parts to construct a prompt manually. We have the base prompt, and we have the chunks of text selected from the vectorstore, so build the prompt and try it
We can determine where the issue is and troubleshoot from there. Maybe wrong chunks coming back from the vectorstore because the chunk size was perhaps wrong or no sufficient overlap to keep the context, etc. Or maybe the prompt was not descriptive enough

But my prompt is just not giving me the answer I want

That may well be true, and what can we do about it?
Well that depends on what you are trying to get it to do. Are you trying to be too ambitious? For example, if I was to ask it Look across my data source to see if there are any mentions of ninja penguins, and if so give me the details and also a summarized answer of Yes. If not, then a summarized answer of No. then even with some few shot examples I’m not getting a consistent response, so don’t be afraid to split up the prompts and use the output of one call as in input to another. For example (and the addition of another helper function):

def summarize_answer(llm, query, full_answer):
    """
    Given an app state, a query and a full answer, returns a summarized version of the answer.

    Parameters:
        llm (AzureChatOpenAI): an instance of AzureChatOpenAI to use for generating responses to user questions.
        query (str): The query that was asked.
        full_answer (str): The full answer to the query.

    Returns:
        str: A summarized version of the answer.
    """

    _DEFAULT_SYSTEM_MESSAGE_QA_SUMMARIZE_PART_0 = "You are an expert in taking text and reducing it to a Yes or No summary."
    _DEFAULT_SYSTEM_MESSAGE_QA_SUMMARIZE_PART_1 = "Given the following text, is the overall sentiment for the question '"
    _DEFAULT_SYSTEM_MESSAGE_QA_SUMMARIZE_PART_2 = """' a Yes, No?
If no evidence of it, then the answer is No.
Give a one word answer. 
-----
"""

    summary_prompt = _DEFAULT_SYSTEM_MESSAGE_QA_SUMMARIZE_PART_1 + query + _DEFAULT_SYSTEM_MESSAGE_QA_SUMMARIZE_PART_2 + full_answer
    answer = llm([SystemMessage(content=_DEFAULT_SYSTEM_MESSAGE_QA_SUMMARIZE_PART_0), HumanMessage(content=summary_prompt)])
    answer = answer.content.replace('\n','')

    return answer


query = "Look across my data source to see if there are any mentions of ninja penguins"
full = full_answer(qa_chain, query)
summary = summarize_answer(llm, query, full)
print(summary)
print(full)

By splitting it up you have more control. We are using the full chain for the main answer, but then a base LLM model for the summarization to distill it to a single word answer. We could even set it to return a single token if we wanted, to ensure we just get a one word answer

LLM’s can appear brittle and temperamental, but you are in control

Using the above considerations to get good repeatable clear patterns and knowing what is going on where, along with good understanding of chunk sizing and using splitters correctly actually helped me mitigate some of the brittleness and eye-opening headaches I have seen using them

Conclusion

I hope this has in some ways been useful. Maybe it feels like it’s more ‘stating the obvious’, but I feel that working with LLM’s is more like that than anything else ever in my career.

And this is only the beginning. The possibilities here are mind blowing, and I look forward to where we go next.

If you liked this article, please take a look at the next one where I discuss cutting through the quagmire of using RAG with LLM’s — introducing BRAG.

Thanks for reading.