RAG: Part 6: Prompting and Inferencing

11 min readApr 11, 2024

Prompting is the bridge between us and powerful LLMs, allowing us to harness their capabilities effectively. If done correctly, we can get the most out of it

Photo by Emiliano Vittoriosi on Unsplash

In this series, we have seen how chunking, embedding, and retrieval can be done. This blog post dives into prompting and inferencing.

Prompting techniques

There are various prompting techniques which can be clubbed with RAG which will enhance their capabilities.

1) Zero-Shot Prompting

Zero-shot prompting is a technique for instructing LLMs without any specific examples included in the prompt itself. It relies on the LLM’s understanding of the world and its ability to follow instructions.

Example:

Imagine you want to write a poem in the style of Shakespeare. Instead of providing the LLM with several Shakespearean poets as examples, you could use a zero-shot prompt like this:

Prompt: Write a poem about love (you can mention any personal moment as well) in Shakespearean style.

Key Takeaways:

No training examples are provided in the prompt itself.
Relies on the LLM’s pre-trained knowledge and ability to follow instructions.

2) Few-Shot Prompting

Few-shot prompting is a technique that bridges the gap between zero-shot and full fine-tuning for LLMs. It provides the LLM with a few labelled examples within the prompt itself, guiding it towards the desired output format and content.

Example:

Imagine you want to train your LLM to classify different types of email topics. Here’s a zero-shot approach that might not be very effective:

Prompt: Classify the following email in the two categories actionable or non-actionable.
Example 1: I needed a bungle of books by EOD: actionable
Example 2: Thankyou for your support: non-actionable

Key Takeaways:

Improves accuracy over zero-shot prompting by providing concrete reference points.
Requires significantly less data compared to fully fine-tuning the LLM for a specific task.

3) Chain-of-Thought Prompting

Chain-of-Thought (CoT) prompting is a technique that goes beyond simply instructing LLMs what to do. It aims to encourage them to explain their reasoning process along the way, providing a more transparent and potentially more accurate response. By showing the LLM how to solve a similar problem and explaining the reasoning behind each step, CoT prompting encourages it to follow a similar thought process for the new question.

Example:

Imagine you’re asking an LLM to solve a math problem. A standard prompt might just ask for the answer. However, CoT prompting would provide an example of breaking down a similar problem step-by-step:

Prompt: Question: Is 14 divisible by 3?
Answer: No, because 14 divided by 3 equals 4 with a remainder of 2. Since remainders in divisibility by 3 must be 0, 14 is not divisible by 3.
Question: Is 12 divisible by 3?
Answer: Yes, because 12 divided by 3 equals 4 with no remainder. When there’s no remainder, the number is divisible.
New Question: Is 18 divisible by 6?

Key Takeaways:

By explaining steps, LLMs are nudged towards a more logical thought process, potentially leading to more accurate answers, especially for complex tasks.
CoT prompts reveal the thought process behind the LLM’s answer, allowing us to identify if it arrived at the solution correctly or made a mistake along the way.

4) Self-Consistency

Self-consistency prompting is an advanced technique that builds upon Chain-of-Thought (CoT) prompting to improve the accuracy and reliability of responses from LLMs on tasks requiring reasoning. It generates multiple reasoning paths and then selects the most consistent answer. It’s like asking the LLM to solve the problem a few times, each time explaining its steps. The final answer is chosen based on which explanation appears most frequently across these attempts.

Example:

Let’s say you ask, “John has 10 apples. He gives 4 to Sarah. How many apples does John have left?”

Prompt: John has 10 apples. He gives 4 to Sarah. How many apples does John have left? Explain each step of your reasoning.

The LLM could generate multiple responses with explanations:

Response 1: “John starts with 10 apples. He gives 4 to Sarah, so he has 10–4 = 6 apples left.”

Response 2: “John has 10 apples. Since he gave some to Sarah, he has fewer now. Subtracting 4 from 10 gives us 6 apples left for John.”

Here, both responses lead to “6” apples, making it the most consistent answer chosen by self-consistency prompting.

Key Takeaways:

It works by generating multiple solutions with explanations and selecting the most frequently occurring answer.
This method helps LLMs explore different reasoning paths and avoid getting stuck on the first solution they find.

5) Prompt Chaining

Prompt chaining is where you break down a complex task into smaller, more manageable prompts, feeding the output of each prompt as input for the next. It’s like giving the LLM step-by-step instructions to achieve a final goal.

Example:

Imagine you want the LLM to write a news article about a scientific discovery. Instead of giving it one giant prompt, you could use prompt chaining:

Prompt 1: Briefly summarize the scientific field of Neurology. What are some key concepts or challenges?
Prompt 2: Building on the field of Neurology from the previous step, describe a recent scientific discovery. What was discovered and by whom?
Prompt 1: How could this discovery from previous step potentially impact the field of Neurology or society as a whole?

Key Takeaways of Prompt Chaining:

Makes it easier for LLMs to understand and complete intricate instructions.
Each prompt refines the output, leading to a more focused and accurate outcome.

6) Tree of Thoughts (ToT)

Tree of Thoughts Prompting is an advanced prompting technique that encourages them to explore different reasoning paths and arrive at the most likely solution. It builds upon Chain-of-Thought (CoT) prompting but adds a layer of depth and exploration.

Branching Possibilities: Unlike CoT, Tree of Thoughts prompting allows the LLM to explore alternative reasoning paths.
Depth-First Search: LLM prioritizes exploring the most promising paths first, similar to a depth-first search algorithm. If a path leads to an inconsistency or dead end, the LLM backtracks and explores another branch of the “thought tree.”
Constraints and Focus: To make the search manageable, the prompt might include constraints on how far the LLM can explore each branch (e.g., a maximum of 10 steps). This helps maintain focus and prevents the exploration from going off on tangents.

Source of Image by Tree of Thoughts Authors

Example:

Let’s say the prompt is “You are lost in a forest. There are two paths ahead: one leading north and the other east. You need to find the nearest town. Describe your thought process for choosing a path.”

Branch 1: “North might lead to a river, but there could be mountains blocking the way. I’ll check the north path first.”
Branch 2: “East might lead to open plains where towns are often built. But what if there’s a desert in the east?”

Prompt: Imagine three different experts are answering this question.All experts will write down 1 step of their thinking,then share it with the group.Then all experts will go on to the next step, etc. If any expert realises they’re wrong at any point then they leave. The question is…

Key Takeaways:

Tree of Thoughts Prompting encourages the LLM to explore multiple solutions and choose the most likely one based on its reasoning.
This method is useful for complex tasks requiring the LLM to consider different possibilities and justify its conclusions.

7) Automatic Prompt Engineer (APE)

Automatic Prompt Engineer (APE) is a framework designed to automate the process of crafting effective prompts for large language models (LLMs). Traditionally, creating prompts has been a manual and time-consuming task requiring human expertise. APE aims to streamline this process.

Input and Goal: The user provides the APE system with two key elements: Input Data & Desired Output
Candidate Prompts: APE utilizes an LLM itself to generate a pool of candidate prompts that might guide the target LLM towards the desired output.
Evaluation and Selection: The candidate prompts are then fed back into the target LLM, and the quality of their outputs is assessed. This evaluation might involve metrics like accuracy, coherence, or relevance to the desired task.
Optimal Prompt: Based on the evaluation results, APE selects the prompt that generates the best outcome from the target LLM.

Example:

Imagine you want to use an LLM to write an essay arguing for renewable energy sources. Here’s how APE could help:

Input and Goal:

Input Data: Text about various energy sources (fossil fuels, solar, wind, etc.)
Desired Output: Essay on renewable energy sources

Prompt: You are given the following information about fossil fuels, solar, wind etc. {information}. Write an essay on renewable energy sources using the provided information

Candidate Prompts: APE might generate prompts like:

“Write an essay in favor of renewable energy sources, highlighting their environmental benefits and economic feasibility.”
“Considering the long-term sustainability of energy production, argue for a transition towards renewable energy sources.”

Evaluation and Selection: You would then provide the candidate prompts to your target LLM and see which one produces the most convincing and well-structured essay on renewable energy.

Key Takeaways:

APE automates prompt creation, saving time and effort for users working with LLMs.
By testing multiple prompts, APE can identify the most effective one for a specific task.

8) Active-Prompt

Active prompting is a technique for getting the most out of LLMs by incorporating user feedback into the prompting process. It’s a dynamic approach that allows for iterative refinement of the LLM’s output.

Active Prompting borrows from Active Learning:

In active prompting, the “data points” are essentially the prompts provided to the LLM.
The goal is to identify the prompts that will lead to the most informative and improved LLM responses.
Similar to uncertainty metrics in active learning, active prompting utilizes uncertainty metrics specific to LLMs. These metrics aim to quantify how unsure the LLM is about its response to a particular prompt.

Example:

Let’s say you want the LLM to write a news article about a recent discovery in space exploration.

Prompt 1: “Write a news article about a recent discovery in space exploration.”

Initial Response: The LLM might generate an article about a new planet.
User Feedback: You realize the discovery was actually about a new exoplanet system. You can then provide feedback and a refined prompt:

Prompt 2: “Thanks, but the discovery involved a new exoplanet system, not a single planet. Can you rewrite the article focusing on that aspect?”

Iterative Improvement: Based on your feedback, the LLM would generate a revised article about the exoplanet system.

Key Takeaways:

User feedback helps ensure the LLM’s output aligns with reality and your specific needs.
Active prompting gives you more control over the direction and content of the LLM’s output.

9) Program-Aided Language Models

Program-Aided Language Models (PAL) combines the strengths of LLMs and traditional programming to tackle complex tasks requiring reasoning and logic.

LLM’s Power in Understanding Language: PAL leverages the LLM’s ability to understand natural language instructions. You can provide the LLM with a problem description in plain English.
Programming for Execution: Instead of directly generating the answer itself, the LLM decomposes the problem into a series of steps to a program. This program is then passed to an external interpreter (like a Python interpreter) for execution.

Example:

Imagine you want to find the average of two numbers (5 and 3) using a PAL system.

Prompt: You provide the LLM with a natural language prompt like “Find the average of 5 and 3. Explain each step of your reasoning process in a way that can be understood by a Python program.”

LLM’s Response: The LLM might generate a program resembling:

def find_average(x, y):
  # Step 1: Add the two numbers
  sum = x + y
# Step 2: Divide the sum by 2
  average = sum / 2
  # Step 3: Return the average
  return average
# Call the function with the numbers
result = find_average(5, 3)
# Print the result
print(result)

Execution and Answer: This program is then fed into a Python interpreter, which executes the steps and outputs the answer (average = 4).

Key Takeaways:

PAL tackles tasks where LLMs might struggle with logical reasoning by relying on a separate program for execution.
The generated program helps visualize the reasoning process behind the answer.

10) ReAct Prompting

My favourite and most overfull—

Similar to how humans learn by interacting with the world and reasoning about their actions, ReAct Prompting leverages the interplay between “acting” and “reasoning” for LLMs.

ReAct Prompting performs two key actions:

Reasoning: Similar to CoT, the LLM explains its thought process step-by-step.
Actions: The LLM can take simulated actions(through Agents) within the prompt, such as querying an external knowledge base (like Wikipedia) to gather additional information. This allows the LLM to dynamically update its reasoning based on the retrieved information.

Example:

Key Takeaways:

The interplay between acting and reasoning allows for a more comprehensive and adaptable reasoning process within the LLM.
By incorporating external information and dynamic reasoning adjustments, ReAct aims to generate more accurate and reliable outputs from LLMs.

There are N number of prompting techniques available. for more details please refer to this survey paper

Inferencing

In machine learning, inference refers to using a trained model to make predictions or classifications on new, unseen data. During LLM inference, the LLM takes an input (like a question, prompt, or text snippet) and uses its knowledge and understanding of language to generate an output based on the next word prediction probability.

There were various parameters based on which we could control the generated output from LLM.

Top K:

Function: Limits the number of possible next tokens considered by the LLM at each step during generation.
Lower K: Focuses on the most probable tokens based on the LLM’s internal probability distribution. This leads to more predictable and safer outputs but can also be repetitive or lack creativity.
Higher K: Allows the LLM to explore a wider range of possibilities, leading to more diverse and potentially surprising outputs. However, it can also increase the chance of nonsensical or irrelevant generations.

Top P (Nucleus Sampling):

Function: Sets a threshold for the cumulative probability of the next tokens considered by the LLM. The LLM samples only from tokens whose combined probability falls under this threshold.
Lower P: Similar to a lower Top K, focuses on the most likely tokens, resulting in safer and more predictable outputs but potentially lacking variety.
Higher P: Allows for exploration of a wider range of tokens while still prioritizing high probabilities. This can lead to more interesting and diverse outputs while maintaining some level of control.

Temperature:

Function: Controls the randomness of the sampling process during generation.
Lower Temperature: This makes the LLM’s selection process more deterministic, favouring the most probable token at each step. This leads to safer and more predictable outputs but can be repetitive and lack creativity.
Higher Temperature: Introduces more randomness into the selection process. The LLM is more likely to explore less probable tokens, leading to more diverse and potentially surprising outputs. However, it can also increase the risk of nonsensical or irrelevant generations.

Max Length:

Function: Sets a limit on the total number of tokens the LLM can generate in its output.
Shorter Length: Useful for tasks requiring concise outputs (e.g., summarizing a text snippet).
Longer Length: Allows for more elaborate and detailed outputs (e.g., writing a creative story).

Thanks for spending your time on this blog. I am open to suggestions and improvements. Please let me know if I missed any details in this article.

RAG: Part 6: Prompting and Inferencing

Prompting techniques

1) Zero-Shot Prompting

2) Few-Shot Prompting

3) Chain-of-Thought Prompting

4) Self-Consistency

5) Prompt Chaining

6) Tree of Thoughts (ToT)

7) Automatic Prompt Engineer (APE)

8) Active-Prompt

9) Program-Aided Language Models

10) ReAct Prompting

Inferencing

Written by Mehul Jain