Improve LLM Performance

7 min readMar 31, 2024

Generative AI is playing a key role in recent progress in the Artificial Intelligence domain. Specifically, when it comes to text the Large Language Models are at the center. Nowadays due to multi-modal functionality, we can also deal with images, speech, videos, etc. using LLM. I will keep my discussion focused on text for now.

People are exploring LLM models for various text-related tasks like translation, entity recognition, relationship extraction, summarization, question answering, etc. LLM models have their own set of problems due to the way it works. To name a few it has challenges like hallucination😇, false assumptions😵, irrelevant responses🙄, and others. To improve the performance, we can try techniques like Prompt Engineering, Retrieval Augmented Generation (RAG), and Finetuning.

Prompt Engineering

It is a well-known and proven approach with LLM. It is the best way to start with to improve performance.

There are different ways to improve prompts. Here I will describe a few approaches.

Instructions: Provide detailed instructions about the task, domain expertise required, details about the structure of the prompt, etc.

For example, “Consider your self as astrophysicist to answer the given query using provided context in the form of numbers only”

Describe Clearly: Structure your sentences to the point, in proper flow, and using simple terms.

Divide in Steps: Sometimes complex tasks require chain of subtasks. So whenever possible try to divide the end task and get answers for each subtask at a time in a progressive way. If not possible at least try to describe these subtasks in the prompt.

For example,
Task: “Two cars a and b start moving in opposite directions with 30 km/h and 40 km/h respectively. Initially, the distance between two cars was 100km. At what point both cars will cross each other from the starting point of car a?”,
Steps: “
1. If two cars a and b moving in opposite directions with 30 km/h and 40 km/h respectively then find relative speed.
2. If cars were 100 km away at the starting find time cars would take to cross each other using relative speed
3. Using the time and speed of cars find the distance traveled by car a”

Few-shot: Provide few sample input-output pairs along with the actual query. So model can consider examples as references and similarly provide output. It can also help to let the model know about the output format.

For example Two shot prompt,
“Query: The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
Outpu: The answer is False.
Query: The odd numbers in this group add up to an even number: 17, 10, 19, 4, 8, 12, 24.
Output: The answer is True.
Query: The odd numbers in this group add up to an even number: 16, 11, 14, 4, 8, 13, 24.
Output: ”

Context: Hallucination is a known problem with LLM. We can provide supportive knowledge, facts, or assumptions as the context for the query. It will help the model to answer based on more relevant and correct information to give an accurate answer. It is mainly helpful when questions are not general but related to a particular person, company, or task.

For example,
“Please use context to answer the given quetion
Context : India is a country with rich heritage and culture. It is the seventh largest country in the world by area. India’s national bird is the peacock, and the Bengal tiger is the national animal. The national song is called Vande Mataram and was written by Bankimchandra Chatterji. India is known for its diversity in languages, religions, food, clothing, and festivals. It is a secular country with a democratic government. India has made significant progress in various fields like science, technology, and space exploration. It is a major economic power and is home to many multinational companies. India is a popular tourist destination with many historical monuments, natural wonders, and beautiful beaches.
Question: What is India’s national bird?”

Separating Tokens: Use proper separating tokens to mention the start and end of different prompt sections. It will help the model to understand input properly. We can also use it to define the end of output, so the model can terminate the response after finite output rather than keep generating random text.

For example,
“<INST> All instruction here </INST>
<CONTEXT> Context text <CONTEXT/>
<INPUT> Input query </INPUT>
<OUTPUT> Output text </OUTPUT>”

Output Format: Accuracy can also be affected based on how we consume output generated by LLM. For tasks like entity or relationship extraction, if LLM provides answers in normal text format we are further required to process output and that can lead to some errors. Prompt Instruction or few shot examples are great places to let the model know about the required output format. It can be a JSON or a list of elements separated by special characters etc.

For example,
“Please provide output in the json form as
{‘Entities’ : [entity 1, entity 2,…],
‘Relations’ : [(entity 1, entity 2), (entity 3, entity 4),…]}”

Unbiased: I have seen people intentionally or unintentionally mentioning certain biases or stating wrong information in input. LLM models are designed to follow instructions, so the model takes that input as fact and generates output based on that. It leads to wrong answers for the query.

For example,
Biased : “Product a is at higher price compare to b so I think a will be better than b. is product a better or b?” ,
Unbiased : “Here are all the details and feedback about product a and b. which one is better?”

Retrieval Augmented Generation (RAG)

As discussed in the prompt engineering section context can help model to answer the query more accurately. Also in specific tasks like answering questions related to the HR policy of a specific company or answering specific product-related queries, it is better to provide query-related documents to model. LLM models have a maximum token limit, so we can’t provide all available documents. Also if we provide long documents with questions then it would be difficult for the model to provide correct answers.

Retrieval Augmented Generation (RAG) is a solution for the mentioned challenges. The given diagram provides the workflow for RAG. We can create a vector database that stores vector representations of different documents. Vector embedding representation captures the high-level information of the document. We can also divide documents into chunks based on the token size allowed.

Now when we get query input, we can use the same method to get an embedding representation of the query. To extract related documents or chunks, we can use a similarity search approach between query and database vectors. Similarity can be measured using different techniques like cosine similarity or L2 distance, etc. Provide extracted documents as context with the query in model input. One needs to experiment with which metric and how many context documents work well with a specific task. I will create a separate blog to explain RAG with a working example of a specific task.

Finetuning

Both prompt engineering and RAG do not update model weights, so they require less computing power and are helpful for quick iteration to start improving the performance of the LLM model. The next approach is fine-tuning the Pre-trained LLM model. Fine-tuning is helpful when we want the model to adapt to domain-specific knowledge. For example, various terminologies and their relations in the clinical domain will not be obvious for models trained on general data. So better to fine-tune the model based on clinical notes or literature. It is also helpful to train the model for a specific task or give output in a certain format or writing style to get the desired output even with zero-shot.

To fine-tune LLMs, we need to collect domain or task-specific data, pre-process it, divide it into chunks, and convert it into model-specific required formats. For fine-tuning, we do not need a huge amount of data as we need to pre-train the model but it should be sufficient to tune the model properly. Use prepared data to update the model weights using gradient descent based on the loss and backpropagation. Experiment with different hyperparameters to achieve the best result. The fine-tuning process requires a considerable amount of GPU computing power and time but once trained it can perform significantly well on domain-specific tasks.

In real-world scenarios, domain knowledge also keeps updating but it is not practical to fine-tune the LLM model every few days. So it is better to apply prompt engineering techniques with proper up-to-date context to get the best out of the fine-tuned model. Also even after capturing domain knowledge when the query is related to a particular thing, we have to use techniques like RAG to get accurate results from fine-tuned models. Based on domain and task we can decide on the frequency at which the model should be fine-tuned. I will provide a working example of the fine-tuning process shortly in a different blog.

People are still exploring various ways to get the best performance from LLMs. You should experiment with different techniques because no one approach always works. As I said above, it is advisable to start with prompt engineering and RAG as it does not require a lot of computing power and time. Let me know if you have any new ideas or approaches that have worked for you to improve LLM performance. Thanks for reading and happy learning !!!

Improve LLM Performance

Written by Dhaval Parmar