Supercharge RAG with Contextualized Late Interactions

Published in

The AI Forum

41 min readApr 28, 2024

Introduction

LLMs though capable of generating grammatically correct and meaningful text suffer from generating factually incorrect responses with confidence. This has been a major problem since the inception of LLMs. In order to solve this issue Retrieval Augmented Generation (RAG) was introduced.

In RAG, we take a list of documents/chunks of documents and encode these textual documents into a numerical representation called vector embeddings, where a single vector embedding represents a single chunk of document and stores them in a database called vector store. The models required for encoding these chunks into embeddings are called encoding models or bi-encoders. These encoders are trained on a large corpus of data, thus making them powerful enough to encode the chunks of documents in a single vector embedding representation.

Now when the user asks a question, this question is presented to the embedding model to produce a single vector embedding. This embedding is then used to calculate the similarity score with various other vector embeddings of the document chunks to get the most relevant chunk of the document. The most relevant chunk or a list of the most relevant chunks along with the user query are given to the LLM. The LLM then receives this extra contextual information and then generates an answer that is aligned with the context received from the user query. This makes sure that the generated content by the LLM is factual and something that can be traced back if necessary.

Problem With Traditional Embedding Models

Embedding models compress text into fixed-length (vector) representations that capture the semantic content of the document. This compression is very useful for efficient search or retrieval, but puts a heavy burden on that single vector representation to capture all the semantic nuances or details present in the document. In some cases, irrelevant o r redundant content can dilute the semantic usefulness of the embedding.

Thus the single vector embedding may not be sufficient to store the contextual information of a document chunk, thereby creating an information bottleneck.

The above issue can be overcome if we can represent the document chunk of query as a list of embedding vectors instead of a single embedding vector. This is where Contextualized late Interactions(ColBERT) comes into picture.

What is ColBERT?

ColBERT (Contextual Late Interactions BERT) is a bi-encoder that represents text in a multi-vector embedding representation. It takes in a Query or a chunk of a Document small Document and creates vector embeddings at the token level. That is each token gets its own vector embedding, and the query/document is encoded to a list of token-level vector embeddings. The token level embeddings are generated from a pre-trained BERT model hence the name BERT.

The idea here is a neat approach to address this with higher granularity embeddings:

(1) Produce a contextually influenced embedding for each token in the document and query.

(2) Score similarity between each query token and all document tokens.

(3) Take the maximum similarity across the document tokens for each query token.

(4) Do this for all query tokens.

(5) Take the sum of the max scores (in step 3) for all query tokens to get the similarity score.

This results in a much more granular token-wise similarity assessment between document and query, and has shown strong performance.

Here since we compute the list of embedding vectors in advance and only perform the MaxSim (maximum similarity) operation during model inference. This step, known as late interaction, enhances contextual information through token-level interactions, hence the term Contextual Late Interactions BERT or ColBERT. These computations can be done in parallel, ensuring efficiency.

However, storing the list of token-level vector embeddings requires significant space. ColBERTv2 addresses this issue by compressing the embeddings using residual compression, thereby optimizing the space utilization.

Code Implementation

Install required dependencies

linkify-it-py==2.0.3
llama-hub==0.0.79.post1
llama-index==0.10.30
llama-index-agent-openai==0.2.2
llama-index-cli==0.1.12
llama-index-core==0.10.30
llama-index-embeddings-huggingface==0.2.0
llama-index-embeddings-openai==0.1.8
llama-index-indices-managed-llama-cloud==0.1.5
llama-index-legacy==0.9.48
llama-index-llms-openai==0.1.16
llama-index-multi-modal-llms-openai==0.1.5
llama-index-program-openai==0.1.5
llama-index-question-gen-openai==0.1.3
llama-index-readers-file==0.1.19
llama-index-readers-llama-parse==0.1.4
llama-index-vector-stores-faiss==0.1.2
llama-parse==0.4.1
RAGatouille==0.0.8.post2

!pip install -q llama-hub
!pip install -q arxiv
!pip install -q semanticscholar
!pip install -q sentence-transformers==2.3.0
!pip install -q ragatouille
!pip install -q llama-index
!pip install -q llama-index-readers-file
!pip intall -q llama-index-llms-openai
!pip install -q llama-index-core
!pip install llama-index-embeddings-huggingface
!pip install llama-index-vector-stores-faiss
!pip install -q langchain
!pip install -q langchain-core
!pip intall -q langchain-community

Download the Dataset

!wget https://arxiv.org/pdf/2306.02707.pdf

Load the documents for further processing

from llama_index.readers.file import PDFReader
loader = PDFReader()
documents = loader.load_data("2306.02707.pdf")
#
print(len(documents))
#
list_documents = [ document.text for document in documents ]

Comparison Between Open Source Embeddings / OpenAI Embeddings / ColBERT using Ragatouille

Open Source Embeddings

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
# Setting the global embedding model
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)
#
from llama_index.core import (
    SimpleDirectoryReader,
    load_index_from_storage,
    VectorStoreIndex,
    StorageContext,
)
from llama_index.vector_stores.faiss import FaissVectorStore

Build Index

index = VectorStoreIndex.from_documents(documents)

Instantiate the Retriever to retrieve relevant documents

retriver = index.as_retriever(similarity_top_k=3)
#
similar_docs = retriver.retrieve("What is instruction tuning?")
#
from llama_index.core.response.notebook_utils import display_source_node
for node in similar_docs:
  display_source_node(node)
#
for i,node in enumerate(similar_docs):
  print(f"------------------ {i} ----------------------------------------------------")
  print(node.text)
  print("----------------------------------------------------------------------------")

########## RESPONSE #######################################
Node ID: 61299ed3-a6eb-4f9f-917a-9f731f154305
Similarity: 0.7422756912118058
Text: Model Tuning Method Data Size Teacher Alpaca Simple Instructions / Self-instruct 52K text-da-vinc...

Node ID: 074197f9-75d9-48e4-91a6-59665ae927d8
Similarity: 0.7226097983044915
Text: Contents 1 Introduction 4 1.1 Challenges with Existing Methods . . . . . . . . . . . . . . . . . ...

Node ID: 6469e949-d4d8-40f4-be8f-836abf0e44c5
Similarity: 0.703846265300339
Text: content generation and information-seeking queries over other types of tasks. Therefore, models t...


------------------ 0 ----------------------------------------------------
Model Tuning Method Data Size Teacher
Alpaca Simple Instructions / Self-instruct 52K text-da-vinci-003
Vicuna User Instructions / Natural 70K ChatGPT
Dolly User Instructions / Natural 15K Human
WizardLM Complex Instructions / Evol-instruct 250K ChatGPT
Orca Complex Instructions / Explanations 5M ChatGPT (5M)
∩GPT-4 (1M)
Table 1: Overview of popular models instruction tuned with OpenAI large foundation models
(LFMs). Orca leverages complex instructions and explanations for progressive learning.
User Instruction: Use the given data to calculate the median. 
Input:[7, 3, 8, 2, 10]
User Instruction: Answer this question.
Input:Which small lake lies between Windermere and Grasmere?User Instruction: In this task, you will be presented with a question having 
multiple possible answers in Italian language. And you should choose a most 
suitable option out of "A", "B", "C", "D", and "E" based on your commonsense 
knowledge. 
Input:Solve this question: Dove non riusciresti a vedere la luce? 
Options: A scrivaniaB frigoriferoC sole D universoE atticoOutput: First, we need to arrange the data in ascending order: [2, 3, 7, 8, 10]. 
Since there are 5 numbers, the median is the middle number, which is 7.
Output: B frigorifero
Output: Rydal Water lies between Windermere and Grasmere.
Figure 4: Instruction-tuning with GPT-49. Given user instructions for a task and an input,
the system generates a response. Existing works like Alpaca [ 7], Vicuna [ 9] and variants
follow a similar template to train small models with ⟨{user instruction, input}, output ⟩.
2 Preliminaries
2.1 Instruction Tuning
Instruction tuning [ 22] is a technique that allows pre-trained language models to learn
from input (natural language descriptions of the task) and response pairs, for example,
{"instruction": "Arrange the words in the given sentence to form a grammatically
correct sentence.", "input": "the quickly brown fox jumped", "output": "the brown
fox jumped quickly"} . Instruction tuning has been applied to both language-only and
multimodal tasks. For language-only tasks, instruction tuning has been shown to improve
the zero-shot and few-shot performance of models such as FLAN [ 22] and InstructGPT [ 5]
on various benchmarks. For multimodal tasks, instruction tuning has been used to generate
synthetic instruction-following data for language-image tasks, such as image captioning [ 23]
and visual question answering [24].
A wide range of works in recent times, including Alpaca [ 7], Vicuna [ 9], WizardLM [ 8] and
Koala [14], have adopted instruction-tuning to train smaller language models with outputs
generated from large foundation models from the GPT family. As outlined in Section 1.1,
a significant drawback with all these works has been both limited task diversity, query
complexity and small-scale training data in addition to limited evaluation overstating the
benefits of such approach.
2.2 Role of System Instructions
Vanilla instruction-tuning (refer to Figure 4 for examples) often uses input, response pairs
with short and terse responses. Such responses when used to train smaller models, as in
existing works, give them limited ability to trace the reasoning process of the LFM. In
constrast, system instructions10in recent LFMs like GPT-4 can be used to provide guidance
9GPT-4 inference hyper-parameters in Azure OpenAI interface set as: temperature=0.7,
top_p=0.95, frequency_penalty=0, presence_penalty=0, stop=None.
10System instructions are part of the Chat Completion API, which is a new dedicated API for
interacting with the ChatGPT and GPT-4 models.
7
----------------------------------------------------------------------------
------------------ 1 ----------------------------------------------------
Contents
1 Introduction 4
1.1 Challenges with Existing Methods . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Preliminaries 7
2.1 Instruction Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Role of System Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Explanation Tuning 8
3.1 Dataset Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.1 System Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Dataset Description and Sampling from the FLAN-v2 Collection . . . 9
3.1.3 ChatGPT as Teaching Assistant . . . . . . . . . . . . . . . . . . . . . 12
3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Experiment Setup 14
4.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.1 Open-ended Generation Capabilities . . . . . . . . . . . . . . . . . . . 15
4.2.2 Reasoning Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5 Evaluation for Open-ended Generation 17
6 Evaluation for Reasoning 17
6.1 AGIEval Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.2 Big-Bench Hard Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7 Evaluation for Safety 23
7.1 Truthful Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.2 Toxic Content Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.3 Note on Hallucination and Tool Augmented LFMs . . . . . . . . . . . . . . . 27
8 Limitations 28
9 Conclusions 29
10 Author Contributions 29
11 Case Studies 30
11.1 Trigonometric Problem Solving . . . . . . . . . . . . . . . . . . . . . . . . . . 30
11.2 Temporal Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
11.3 Multiple-choice Question-Answering . . . . . . . . . . . . . . . . . . . . . . . 33
2
----------------------------------------------------------------------------
------------------ 2 ----------------------------------------------------
content generation and information-seeking queries over other types of tasks. Therefore,
models trained on such natural conversations may capture the style but not the reasoning
process of the LFMs – demonstrated in the performance of Vicuna in Figures 2 and 3.
Additionally, such mode of data collection is also limited in scale. Table 1 shows an overview
of the size of data and tuning methods employed in recent popular instruction tuning works.
Limited imitation signals. Existing methods rely on immitation learning from
⟨query, response⟩pairs generated by the teacher model. However, this provides limited
signals to trace the reasoning process of the teacher. Prior works [ 15,16] on open-box model
show that richer signals such as logits, intermediate representations and attention states can
significantly improve distillation performance. While they are not accessible for closed-box
LFM’s7, recent work [ 17] demonstrates that richer signals like LFM rationales can help close
the gap for task-specific distillation.
Evaluation: Previous studies on instruction tuning of small models with LFMs are severely
limited in their evaluation protocol. They often rely on GPT-4 for auto-evaluation by asking
it to compare the outputs of two systems with a prompt like “given responses from system
1 (reference) and system 2 (target), which one is better?”. However, this approach has
several drawbacks, such as the small size of test sets (e.g., 80instructions in Vicuna and 218
instructions in WizardLM) and the biases of GPT-4 as the judge [ 18]. For example, we notice
that models that are instruction-tuned with GPT-4 responses tend to generate longer texts
that GPT-4 prefers over shorter ones; as well as GPT-4 has a bias in the order of the candidate
responses. We will show that such auto-evaluation measures overestimate the abilities of
smaller models compared to LFMs, as the former are much weaker in comprehension and
reasoning skills.
1.2 Key Contributions
In this research, our focus is on addressing the challenges mentioned above, specifically with:
Explanation tuning: We augment⟨query, response⟩pairs with detailed responses from
GPT-4 that explain the reasoning process of the teacher as it generates the response. These
provide the student with additional signals for learning. We leverage system instructions (e.g..,
explain like I’m five, think step-by-step and justify your response , etc.) to
elicit such explanations. This is in contrast to vanilla instruction tuning, which only uses the
prompt and the LFM response for learning, providing little opportunity for mimicking the
LFM’s “thought” process.
Scaling tasks and instructions: We utilize the Flan 2022 Collection [ 19] as it provides
an extensive public assortment of tasks and instructions. Particularly, we use FLAN-
v2, supplemented with high-quality templates, advanced formatting patterns, and data
augmentations. Even though FLAN holds tens of millions of instructions, we selectively
sample from the task collection to form a diverse mixture of tasks, which we then further
sub-sample to generate complex prompts. These prompts are used to query LFMs like
ChatGPT and GPT-4, thus creating a rich and diverse training set. We collect 5million
ChatGPT responses, from which 1million is further sampled to acquire GPT-4 responses.
We demonstrate how ChatGPT as a teacher assistant helps in progressive learning.
Evaluation: We assess the generative, reasoning, and comprehension abilities of Orca, under
a range of settings: (i) AutoEvaluation with GPT-4 on existing evaluation sets from Vicuna,
WizardLM and the awesome prompts collection8; (ii) Academic benchmarks like Big-Bench
Hard [4] and TruthfulQA [ 20]; (iii) Professional and Academic exams like SAT, LSAT, GRE,
GMAT from AGIEval [ 1]; (iv) Safety evaluation with ToxiGen [ 21] to test toxic language
generation and hate speech detection across different minority groups. Finally, we provide
case-studies to compare the generation and reasoning abilities of Orca against OpenAI LFMs
like ChatGPT and GPT-4, and instruction-tuned smaller model like Vicuna.
7Note that OpenAI API’s do give access to the top-5logits for each token.
8https://prompts.chat/
6
----------------------------------------------------------------------------

OpenAI Embeddings

# get API key and create embeddings
import os
import openai
from google.colab import userdata


openai.api_key = userdata.get("OPENAI_API_KEY")
from llama_index.embeddings.openai import OpenAIEmbedding

from llama_index.core import Settings

# global default
Settings.embed_model = OpenAIEmbedding()
#
#build index
index = VectorStoreIndex.from_documents(documents)
#
retriver_openai = index.as_retriever(similarity_top_k=3)
#
similar_docs = retriver_openai.retrieve("What is instruction tuning?")
#
from llama_index.core.response.notebook_utils import display_source_node
for node in similar_docs:
  display_source_node(node)
#
for i,node in enumerate(similar_docs):
  print(f"------------------ {i} ----------------------------------------------------")
  print(node.text)
  print("----------------------------------------------------------------------------")

###########################RESPONSE###################################
Node ID: 2ab26823-162c-4c25-a57f-480b05f87a04
Similarity: 0.8330309385389002
Text: Model Tuning Method Data Size Teacher Alpaca Simple Instructions / Self-instruct 52K text-da-vinc...

Node ID: fced3be6-00b7-4072-9dca-3faef2a3fe19
Similarity: 0.8035424483865514
Text: System instructions are sampled from a diverse instruction set including chain-of-thought reasoni...

Node ID: 857f4aec-960f-4b34-ae86-74b218a94283
Similarity: 0.7809231596633633
Text: System Instruction: You are an AI assistant. User will you give you a task. Your goal is to comp...


------------------ 0 ----------------------------------------------------
Model Tuning Method Data Size Teacher
Alpaca Simple Instructions / Self-instruct 52K text-da-vinci-003
Vicuna User Instructions / Natural 70K ChatGPT
Dolly User Instructions / Natural 15K Human
WizardLM Complex Instructions / Evol-instruct 250K ChatGPT
Orca Complex Instructions / Explanations 5M ChatGPT (5M)
∩GPT-4 (1M)
Table 1: Overview of popular models instruction tuned with OpenAI large foundation models
(LFMs). Orca leverages complex instructions and explanations for progressive learning.
User Instruction: Use the given data to calculate the median. 
Input:[7, 3, 8, 2, 10]
User Instruction: Answer this question.
Input:Which small lake lies between Windermere and Grasmere?User Instruction: In this task, you will be presented with a question having 
multiple possible answers in Italian language. And you should choose a most 
suitable option out of "A", "B", "C", "D", and "E" based on your commonsense 
knowledge. 
Input:Solve this question: Dove non riusciresti a vedere la luce? 
Options: A scrivaniaB frigoriferoC sole D universoE atticoOutput: First, we need to arrange the data in ascending order: [2, 3, 7, 8, 10]. 
Since there are 5 numbers, the median is the middle number, which is 7.
Output: B frigorifero
Output: Rydal Water lies between Windermere and Grasmere.
Figure 4: Instruction-tuning with GPT-49. Given user instructions for a task and an input,
the system generates a response. Existing works like Alpaca [ 7], Vicuna [ 9] and variants
follow a similar template to train small models with ⟨{user instruction, input}, output ⟩.
2 Preliminaries
2.1 Instruction Tuning
Instruction tuning [ 22] is a technique that allows pre-trained language models to learn
from input (natural language descriptions of the task) and response pairs, for example,
{"instruction": "Arrange the words in the given sentence to form a grammatically
correct sentence.", "input": "the quickly brown fox jumped", "output": "the brown
fox jumped quickly"} . Instruction tuning has been applied to both language-only and
multimodal tasks. For language-only tasks, instruction tuning has been shown to improve
the zero-shot and few-shot performance of models such as FLAN [ 22] and InstructGPT [ 5]
on various benchmarks. For multimodal tasks, instruction tuning has been used to generate
synthetic instruction-following data for language-image tasks, such as image captioning [ 23]
and visual question answering [24].
A wide range of works in recent times, including Alpaca [ 7], Vicuna [ 9], WizardLM [ 8] and
Koala [14], have adopted instruction-tuning to train smaller language models with outputs
generated from large foundation models from the GPT family. As outlined in Section 1.1,
a significant drawback with all these works has been both limited task diversity, query
complexity and small-scale training data in addition to limited evaluation overstating the
benefits of such approach.
2.2 Role of System Instructions
Vanilla instruction-tuning (refer to Figure 4 for examples) often uses input, response pairs
with short and terse responses. Such responses when used to train smaller models, as in
existing works, give them limited ability to trace the reasoning process of the LFM. In
constrast, system instructions10in recent LFMs like GPT-4 can be used to provide guidance
9GPT-4 inference hyper-parameters in Azure OpenAI interface set as: temperature=0.7,
top_p=0.95, frequency_penalty=0, presence_penalty=0, stop=None.
10System instructions are part of the Chat Completion API, which is a new dedicated API for
interacting with the ChatGPT and GPT-4 models.
7
----------------------------------------------------------------------------
------------------ 1 ----------------------------------------------------
System instructions are sampled from a diverse instruction set including chain-of-thought
reasoning steps, explain like I’m five, being helpful and informative, etc. Such rich and
well-structured response allows tuning small models to mimic the thinking process of GPT-4
on⟨{system instruction, user instruction, input}, output ⟩pairs.
to the model on how to behave and respond. They are written in natural language and
separated from the user messages by using the role of “system” in the JSON request. System
instructions can specify the tone, task, format, and limitations of the model’s responses.
System instructions are also a way of improving the safety of model responses. For example,
a set of system instructions designed for safety harness could be:
•The assistant must not generate harmful or offensive content.
•The assistant must respect the privacy and consent of the user.
•The assistant must acknowledge its limitations and uncertainties.
3 Explanation Tuning
To address the shortcomings of existing works, we tap into large-scale training data with
diverse tasks augmented with complex instructions and rich signals. Specifically, our data
contains human and augmented system instructions for a large collection of tasks sampled
from FLAN-v2 (aka Flan 2022) [ 19]. Given the large size of the FLAN-v2 collection and
varying number of examples for constituent datasets and tasks, we sample from a mixture of
tasks from different categories (described in the next section) to create our training data.
3.1 Dataset Construction
Each instance in our training data consists of the following triple: ⟨System message, User
query, LFM response ⟩. Thesystem message , placed at the start of the prompt, provides
the LFM with essential context, guidelines, and other pertinent details. We leverage the
system message to vary the length of the response; outline the assistant’s character; establish
acceptable and non-acceptable LFM behavior; and determine the structure of the agent’s
response. The user query defines the actual task we want the LFM to perform. To obtain
a large and diverse set of user queries we utilize the FLAN-v2 collection [ 19]. We sample 5
million user queries from FLAN-v2 for which we collect ChatGPT responses. We further
sample 1million instructions from the 5million set for which we collect GPT-4 responses.
All the queries to the agents are augmented with system instructions, as outlined below.
8
----------------------------------------------------------------------------
------------------ 2 ----------------------------------------------------
System Instruction: You are an AI assistant. User will you give you a task. Your 
goal is to complete the task as faithfully as you can. While performing the task 
think step-by-step and justify your steps.
User Instruction: Use the given data to calculate the median.
Input:[7, 3, 8, 2, 10]
System Instruction: You are an AI assistant. User will you give you a task. Your 
goal is to complete the task as faithfully as you can. While performing the task 
think step-by-step and justify your steps.
User Instruction: Answer this question.
Input:Which small lake lies between Windermere and Grasmere?System Instruction: You are an AI assistant. Provide a detailed answer so user 
don't need to search outside to understand the answer.
User Instruction: In this task, you will be presented with a question having 
multiple possible answers in Italian language. And you should choose a most 
suitable option out of "A", "B", "C", "D", and "E" based on your commonsense 
knowledge. 
Input:Solve this question: Dove non riuscirestia vedere la luce? 
Options: A scrivaniaB frigoriferoC sole D universoE atticoOutput: To calculate the median, I will follow these steps:
Step 1: Arrange the data in ascending order [2, 3, 7, 8, 10]
Step 2: Determine if the data set has an odd or even number of values
The data set has 5 values, which is an odd number.
Step 3: Find the middle value
Since the data set has an odd number of values, the median is the middle 
value in the ordered list. The middle value in the list [2, 3, 7, 8, 10] is 7.
Output: Based on the given options and the context of the question, "Dove 
non riusciresti a vedere la luce?" (Where would you not be able to see 
light?), the most suitable answer would be:B frigorifero(refrigerator). 
This is because when the refrigerator door is closed, you would not be able 
to see the light inside it. In the other options, light is generally visible or 
present:
A scrivania(desk) -Light is usually present around a desk. 
C sole (sun) -The sun is the primary source of light in our solar system.
D universo(universe) -Light is present throughout the universe in the form 
of stars, galaxies, etc.
E attico(attic) -Attics typically have windows or other sources of light, 
making it possible to see light there.
Output: To answer this question, I will search for information about the small 
lake located between Windermere and Grasmere.
Step 1: Identify the location of Windermere and Grasmere. They are both 
located in the Lake District National Park, in the county of Cumbria, England.
Step 2: Research the lakes in the Lake District. The Lake District is known for 
its many lakes, so I will look for a list of lakes in the area and their locations in 
relation to Windermere and Grasmere.
Step 3: Find the small lake between Windermere and Grasmere. 
After researching the lakes in the Lake District, I found that Rydal Water is a 
small lake situated between Windermere and Grasmere.Figure 5: Explanation-tuning with GPT-4. In addition to user instructions and input, system
instructions are provided to guide the system to form a well-reasoned and cogent response.
System instructions are sampled from a diverse instruction set including chain-of-thought
reasoning steps, explain like I’m five, being helpful and informative, etc. Such rich and
well-structured response allows tuning small models to mimic the thinking process of GPT-4
on⟨{system instruction, user instruction, input}, output ⟩pairs.
to the model on how to behave and respond. They are written in natural language and
separated from the user messages by using the role of “system” in the JSON request. System
instructions can specify the tone, task, format, and limitations of the model’s responses.
System instructions are also a way of improving the safety of model responses. For example,
a set of system instructions designed for safety harness could be:
•The assistant must not generate harmful or offensive content.
•The assistant must respect the privacy and consent of the user.
•The assistant must acknowledge its limitations and uncertainties.
----------------------------------------------------------------------------

ColBERT -Implementation using Ragatouille package

from llama_index.core.llama_pack import download_llama_pack
from llama_index.core.response.notebook_utils import display_source_node
RAGatouille_retriever = download_llama_pack("RAGatouilleRetrieverPack","./ragatouille_pack")
#
#Instatiate the Ragatoulle package
ragatouille_pack = RAGatouille_retriever(documents,
                                         index_name="orca",
                                         top_k=3,)
#

# Instantiate the Retriever
retriever = ragatouille_pack.get_modules()["retriever"]
# Retrieve Documents matching the query
nodes = retriever.retrieve("What is instruction tuning?")

for node in nodes:
    display_source_node(node)

for i,node in enumerate(nodes):
  print(f"------------------ {i} ----------------------------------------------------")
  print(node.text)
  print("----------------------------------------------------------------------------")

##############################RESPONSE###################################
Node ID: 6f94ba79-81cd-47a4-b777-ab01921fa54f
Similarity: 26.51420021057129
Text: Given user instructions for a task and an input, the system generates a response. Existing works ...

Node ID: bb290fd0-7937-46c2-bd9a-fed687471792
Similarity: 26.102283477783203
Text: For multimodal tasks, instruction tuning has been used to generate synthetic instruction-followin...

Node ID: 9e168d49-957f-42b4-853c-72b1ed984e25
Similarity: 21.646852493286133
Text: For example, we notice that models that are instruction-tuned with GPT-4 responses tend to genera...

------------------ 0 ----------------------------------------------------
Given user instructions for a task and an input,
the system generates a response. Existing works like Alpaca [ 7], Vicuna [ 9] and variants
follow a similar template to train small models with ⟨{user instruction, input}, output ⟩.
2 Preliminaries
2.1 Instruction Tuning
Instruction tuning [ 22] is a technique that allows pre-trained language models to learn
from input (natural language descriptions of the task) and response pairs, for example,
{"instruction": "Arrange the words in the given sentence to form a grammatically
correct sentence.", "input": "the quickly brown fox jumped", "output": "the brown
fox jumped quickly"} . Instruction tuning has been applied to both language-only and
multimodal tasks. For language-only tasks, instruction tuning has been shown to improve
the zero-shot and few-shot performance of models such as FLAN [ 22] and InstructGPT [ 5]
on various benchmarks. For multimodal tasks, instruction tuning has been used to generate
synthetic instruction-following data for language-image tasks, such as image captioning [ 23]
and visual question answering [24].
----------------------------------------------------------------------------
------------------ 1 ----------------------------------------------------
For multimodal tasks, instruction tuning has been used to generate
synthetic instruction-following data for language-image tasks, such as image captioning [ 23]
and visual question answering [24].
A wide range of works in recent times, including Alpaca [ 7], Vicuna [ 9], WizardLM [ 8] and
Koala [14], have adopted instruction-tuning to train smaller language models with outputs
generated from large foundation models from the GPT family. As outlined in Section 1.1,
a significant drawback with all these works has been both limited task diversity, query
complexity and small-scale training data in addition to limited evaluation overstating the
benefits of such approach.
2.2 Role of System Instructions
Vanilla instruction-tuning (refer to Figure 4 for examples) often uses input, response pairs
with short and terse responses. Such responses when used to train smaller models, as in
existing works, give them limited ability to trace the reasoning process of the LFM.
----------------------------------------------------------------------------
------------------ 2 ----------------------------------------------------
For example, we notice
that models that are instruction-tuned with GPT-4 responses tend to generate longer texts
that GPT-4 prefers over shorter ones; as well as GPT-4 has a bias in the order of the candidate
responses. We will show that such auto-evaluation measures overestimate the abilities of
smaller models compared to LFMs, as the former are much weaker in comprehension and
reasoning skills.
1.2 Key Contributions
In this research, our focus is on addressing the challenges mentioned above, specifically with:
Explanation tuning: We augment⟨query, response⟩pairs with detailed responses from
GPT-4 that explain the reasoning process of the teacher as it generates the response. These
provide the student with additional signals for learning. We leverage system instructions (e.g..,
explain like I’m five, think step-by-step and justify your response , etc.) to
elicit such explanations. This is in contrast to vanilla instruction tuning, which only uses the
prompt and the LFM response for learning, providing little opportunity for mimicking the
LFM’s “thought” process.
----------------------------------------------------------------------------

Note: You can see from the above that the context presented by ColBERT is more in sync with the semantic context corresponding to the question asked.

Implementing ColBERT using Langchain

#import required libraries
from ragatouille import RAGPretrainedModel
from langchain_community.document_loaders import PyPDFLoader
#
#Instantiate the Ragatouille embedding model
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
#load the documents
loaders = PyPDFLoader("/content/2306.02707.pdf")
pages = loaders.load()
print(len(pages))
#
#extract the text from the doucments loaded
full_document = ""
for page in pages:
  full_document += page.page_content
#
#build index and load documents- deafult is PLAID  vectorstore
# inorder to use FAISS keyin use_faiss = True
RAG.index(collection=[full_document],
          index_name="orca_paper",
          max_document_length=512,
          split_documents=True,
          use_faiss=True
          )
#
##########################RESPONSE##############################
---- WARNING! You are using PLAID with an experimental replacement for FAISS for greater compatibility ----
This is a behaviour change from RAGatouille 0.8.0 onwards.
This works fine for most users and smallish datasets, but can be considerably slower than FAISS and could cause worse results in some situations.
If you're confident with FAISS working on your machine, pass use_faiss=True to revert to the FAISS-using behaviour.
--------------------


[Apr 28, 14:49:23] #> Creating directory .ragatouille/colbert/indexes/orca_paper 


[Apr 28, 14:49:26] [0]    #> Encoding 87 passages..
/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py:126: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn(
  0%|          | 0/3 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
 33%|███▎      | 1/3 [01:00<02:01, 60.61s/it]/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
100%|██████████| 3/3 [02:46<00:00, 55.55s/it][Apr 28, 14:52:12] [0]    avg_doclen_est = 357.8735656738281   len(local_sample) = 87
[Apr 28, 14:52:12] [0]    Creating 2,048 partitions.
[Apr 28, 14:52:12] [0]    *Estimated* 31,135 embeddings.
[Apr 28, 14:52:12] [0]    #> Saving the indexing plan to .ragatouille/colbert/indexes/orca_paper/plan.json ..

used 18 iterations (26.6772s) to cluster 29579 items into 2048 clusters
[0.032, 0.035, 0.029, 0.029, 0.029, 0.032, 0.032, 0.029, 0.03, 0.032, 0.03, 0.031, 0.031, 0.032, 0.031, 0.032, 0.029, 0.032, 0.03, 0.032, 0.03, 0.032, 0.031, 0.031, 0.029, 0.03, 0.032, 0.033, 0.031, 0.034, 0.029, 0.033, 0.032, 0.03, 0.031, 0.029, 0.035, 0.032, 0.031, 0.036, 0.033, 0.032, 0.03, 0.031, 0.03, 0.029, 0.031, 0.035, 0.032, 0.032, 0.03, 0.031, 0.032, 0.03, 0.031, 0.032, 0.033, 0.031, 0.035, 0.031, 0.032, 0.033, 0.032, 0.032, 0.033, 0.032, 0.029, 0.031, 0.028, 0.03, 0.033, 0.03, 0.031, 0.032, 0.032, 0.032, 0.033, 0.034, 0.032, 0.035, 0.034, 0.033, 0.03, 0.034, 0.029, 0.03, 0.032, 0.033, 0.029, 0.036, 0.03, 0.033, 0.031, 0.03, 0.031, 0.031, 0.032, 0.03, 0.03, 0.03, 0.031, 0.034, 0.03, 0.031, 0.031, 0.027, 0.031, 0.028, 0.03, 0.029, 0.032, 0.033, 0.033, 0.029, 0.032, 0.031, 0.032, 0.031, 0.03, 0.032, 0.03, 0.03, 0.032, 0.033, 0.031, 0.033, 0.03, 0.03]
0it [00:00, ?it/s][Apr 28, 14:52:39] [0]    #> Encoding 87 passages..

  0%|          | 0/3 [00:00<?, ?it/s]
 33%|███▎      | 1/3 [00:57<01:54, 57.10s/it]
 67%|██████▋   | 2/3 [01:52<00:56, 56.10s/it]
100%|██████████| 3/3 [02:33<00:00, 51.11s/it]
1it [02:34, 154.76s/it]
100%|██████████| 1/1 [00:00<00:00, 752.34it/s][Apr 28, 14:55:14] #> Optimizing IVF to store map from centroids to list of pids..
[Apr 28, 14:55:14] #> Building the emb2pid mapping..
[Apr 28, 14:55:14] len(emb2pid) = 31135

100%|██████████| 2048/2048 [00:00<00:00, 60647.53it/s][Apr 28, 14:55:14] #> Saved optimized IVF to .ragatouille/colbert/indexes/orca_paper/ivf.pid.pt
Done indexing!

.ragatouille/colbert/indexes/orca_paper

Retrieve Documents matching Question asked

results = RAG.search(query="What is instruction fine tuning?")
results

#########################RESPOSNE######################
Loading searcher for index orca_paper for the first time... This may take a few seconds
[Apr 28, 14:56:05] #> Loading codec...
[Apr 28, 14:56:05] #> Loading IVF...
[Apr 28, 14:56:05] Loading segmented_lookup_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py:126: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn(
[Apr 28, 14:56:32] #> Loading doclens...
100%|██████████| 1/1 [00:00<00:00, 2671.53it/s][Apr 28, 14:56:32] #> Loading codes and residuals...

100%|██████████| 1/1 [00:00<00:00, 125.81it/s][Apr 28, 14:56:32] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...

[Apr 28, 14:56:57] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . What is instruction fine tuning?,    True,    None
#> Output IDs: torch.Size([32]), tensor([  101,     1,  2054,  2003,  7899,  2986, 17372,  1029,   102,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])

/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
[{'content': 'And you should choose a most \nsuitable option out of "A", "B", "C", "D", and "E" based on your commonsense \nknowledge. \nInput:Solve this question: Dove non riusciresti a vedere la luce? \nOptions: A scrivaniaB frigoriferoC sole D universoE atticoOutput: First, we need to arrange the data in ascending order: [2, 3, 7, 8, 10]. \nSince there are 5 numbers, the median is the middle number, which is 7.\nOutput: B frigorifero\nOutput: Rydal Water lies between Windermere and Grasmere.\nFigure 4: Instruction-tuning with GPT-49. Given user instructions for a task and an input,\nthe system generates a response. Existing works like Alpaca [ 7], Vicuna [ 9] and variants\nfollow a similar template to train small models with ⟨{user instruction, input}, output ⟩.\n2 Preliminaries\n2.1 Instruction Tuning\nInstruction tuning [ 22] is a technique that allows pre-trained language models to learn\nfrom input (natural language descriptions of the task) and response pairs, for example,\n{"instruction": "Arrange the words in the given sentence to form a grammatically\ncorrect sentence.", "input": "the quickly brown fox jumped", "output": "the brown\nfox jumped quickly"} . Instruction tuning has been applied to both language-only and\nmultimodal tasks. For language-only tasks, instruction tuning has been shown to improve\nthe zero-shot and few-shot performance of models such as FLAN [ 22] and InstructGPT [ 5]\non various benchmarks. For multimodal tasks, instruction tuning has been used to generate\nsynthetic instruction-following data for language-image tasks, such as image captioning [ 23]\nand visual question answering [24].\nA wide range of works in recent times, including Alpaca [ 7], Vicuna [ 9], WizardLM [ 8] and\nKoala [14], have adopted instruction-tuning to train smaller language models with outputs\ngenerated from large foundation models from the GPT family.',
  'score': 21.259580612182617,
  'rank': 1,
  'document_id': 'd6c007be-a8dc-4c41-a038-3fd333ca334c',
  'passage_id': 12},
 {'content': '[21]Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece\nKamar. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate\nspeech detection. In Proceedings of the 60th Annual Meeting of the Association for Computa-\ntional Linguistics (Volume 1: Long Papers) , pages 3309–3326. Association for Computational\nLinguistics, 2022.\n[22]Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan\nDu, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners, 2022.\n[23]DeyaoZhu, JunChen, XiaoqianShen, XiangLi, andMohamedElhoseiny. Minigpt-4: Enhancing\nvision-language understanding with advanced large language models, 2023.\n[24]Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023.\n[25] Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei,\nAtharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, et al.\nSuper-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In\nProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing ,\npages 5085–5109, 2022.\n[26]Mario Michael Krell, Matej Kosec, Sergio P. Perez, and Andrew Fitzgibbon. Efficient sequence\npacking without cross-contamination: Accelerating large language models without impacting\nperformance, 2022.\n[27] Awesome chatgpt prompts, 2023. URL https://github.com/f/awesome-chatgpt-prompts .\n[28]Weijia Xu, Andrzej Banburski-Fahey, and Nebojsa Jojic. Reprompting: Automated chain-of-\nthought prompt inference through gibbs sampling, 2023.',
  'score': 18.882606506347656,
  'rank': 2,
  'document_id': 'd6c007be-a8dc-4c41-a038-3fd333ca334c',
  'passage_id': 84},
 {'content': 'This alignment is accomplished by fine-tuning the models via supervised learning on\ndemonstrations of prompts and desired model behavior, and through reinforcement learning\nfrom human preferences [5].\nAs these models continue to evolve and become more powerful, an intriguing question arises:\nCan we use the model itself to supervise its own behavior or that of other AI models? Bai\net al.[6]have shown that by sampling output from an initial model, generating revisions,\nand then fine-tuning the original model based on these revised responses, model behavior\ncan be controlled more effectively and can be made more harmless, with significantly fewer\nhuman labels.\nRecently, there has been an influx of studies using LFMs like ChatGPT and GPT-4 as\nteachers to generate large datasets, for instruction tuning , and to train smaller models,\nsuch as Alpaca [ 7], WizardLM [ 8] and Vicuna [ 9]. While these models can produce content\nthat matches the style of their teachers, they often fall short in terms of the reasoning and\ncomprehension skills displayed by the larger foundation models.\n423.348.9 49.7\n0102030405060\nVicuna-13B ChatGPT Orca-13BAggregate Accuracy (%)BigBench -Hard (Zero -shot, MCQ)Figure 3: For complex zero-shot reasoning tasks in BigBench-Hard, Orca achieves parity\nwith ChatGPT (without any exemplar or CoT) with task performances shown in Figure 12.\nTake, for example, the 13-billion parameter instruction-tuned model, Vicuna [ 9] (with\nLLAMA-13B [ 10] as the base), which is widely regarded as one of the best models in its\nfamily, as evidenced by its performance on leaderboards like OpenLLM3and ChatArena4.\nAs illustrated in Figure 1, the widely-used evaluation method of using GPT-4 as the judge\nsuggests that Vicuna retains 92%of ChatGPT’s quality. However, a more meticulous\nevaluation on reasoning benchmarks against human labels finds Vicuna to retain only 64%\nof ChatGPT’s quality on professional and academic exams (see Figure 2), and only 48%of\nChatGPT’s quality on complex benchmarks like BigBench-hard [ 11] (see Figure 3)5.',
  'score': 18.104232788085938,
  'rank': 3,
  'document_id': 'd6c007be-a8dc-4c41-a038-3fd333ca334c',
  'passage_id': 7},
 {'content': 'For example, we notice\nthat models that are instruction-tuned with GPT-4 responses tend to generate longer texts\nthat GPT-4 prefers over shorter ones; as well as GPT-4 has a bias in the order of the candidate\nresponses. We will show that such auto-evaluation measures overestimate the abilities of\nsmaller models compared to LFMs, as the former are much weaker in comprehension and\nreasoning skills.\n1.2 Key Contributions\nIn this research, our focus is on addressing the challenges mentioned above, specifically with:\nExplanation tuning: We augment⟨query, response⟩pairs with detailed responses from\nGPT-4 that explain the reasoning process of the teacher as it generates the response. These\nprovide the student with additional signals for learning. We leverage system instructions (e.g..,\nexplain like I’m five, think step-by-step and justify your response , etc.) to\nelicit such explanations. This is in contrast to vanilla instruction tuning, which only uses the\nprompt and the LFM response for learning, providing little opportunity for mimicking the\nLFM’s “thought” process.\nScaling tasks and instructions: We utilize the Flan 2022 Collection [ 19] as it provides\nan extensive public assortment of tasks and instructions. Particularly, we use FLAN-\nv2, supplemented with high-quality templates, advanced formatting patterns, and data\naugmentations. Even though FLAN holds tens of millions of instructions, we selectively\nsample from the task collection to form a diverse mixture of tasks, which we then further\nsub-sample to generate complex prompts. These prompts are used to query LFMs like\nChatGPT and GPT-4, thus creating a rich and diverse training set. We collect 5million\nChatGPT responses, from which 1million is further sampled to acquire GPT-4 responses.\nWe demonstrate how ChatGPT as a teacher assistant helps in progressive learning.',
  'score': 16.73236656188965,
  'rank': 4,
  'document_id': 'd6c007be-a8dc-4c41-a038-3fd333ca334c',
  'passage_id': 10},
 {'content': 'A wide range of works in recent times, including Alpaca [ 7], Vicuna [ 9], WizardLM [ 8] and\nKoala [14], have adopted instruction-tuning to train smaller language models with outputs\ngenerated from large foundation models from the GPT family. As outlined in Section 1.1,\na significant drawback with all these works has been both limited task diversity, query\ncomplexity and small-scale training data in addition to limited evaluation overstating the\nbenefits of such approach.\n2.2 Role of System Instructions\nVanilla instruction-tuning (refer to Figure 4 for examples) often uses input, response pairs\nwith short and terse responses. Such responses when used to train smaller models, as in\nexisting works, give them limited ability to trace the reasoning process of the LFM. In\nconstrast, system instructions10in recent LFMs like GPT-4 can be used to provide guidance\n9GPT-4 inference hyper-parameters in Azure OpenAI interface set as: temperature=0.7,\ntop_p=0.95, frequency_penalty=0, presence_penalty=0, stop=None.\n10System instructions are part of the Chat Completion API, which is a new dedicated API for\ninteracting with the ChatGPT and GPT-4 models.\n7System Instruction: You are an AI assistant. User will you give you a task. Your \ngoal is to complete the task as faithfully as you can. While performing the task \nthink step-by-step and justify your steps.\nUser Instruction: Use the given data to calculate the median.\nInput:[7, 3, 8, 2, 10]\nSystem Instruction: You are an AI assistant. User will you give you a task. Your \ngoal is to complete the task as faithfully as you can. While performing the task \nthink step-by-step and justify your steps.\nUser Instruction: Answer this question.\nInput:Which small lake lies between Windermere and Grasmere?System Instruction: You are an AI assistant. Provide a detailed answer so user \ndon\'t need to search outside to understand the answer.\nUser Instruction: In this task, you will be presented with a question having \nmultiple possible answers in Italian language. And you should choose a most \nsuitable option out of "A", "B", "C", "D", and "E" based on your commonsense \nknowledge.',
  'score': 16.5113525390625,
  'rank': 5,
  'document_id': 'd6c007be-a8dc-4c41-a038-3fd333ca334c',
  'passage_id': 13},
 {'content': 'Task diversity and data scaling. Human-contributed conversations in ShareGPT are a\nvaluable source of data, but they also have some limitations. They tend to favor creative\n3https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard\n4https://chat.lmsys.org/?arena\n5ChatGPT may have data contamination issues with respect to BigBench\n6https://sharegpt.com/\n5content generation and information-seeking queries over other types of tasks. Therefore,\nmodels trained on such natural conversations may capture the style but not the reasoning\nprocess of the LFMs – demonstrated in the performance of Vicuna in Figures 2 and 3.\nAdditionally, such mode of data collection is also limited in scale. Table 1 shows an overview\nof the size of data and tuning methods employed in recent popular instruction tuning works.\nLimited imitation signals. Existing methods rely on immitation learning from\n⟨query, response⟩pairs generated by the teacher model. However, this provides limited\nsignals to trace the reasoning process of the teacher. Prior works [ 15,16] on open-box model\nshow that richer signals such as logits, intermediate representations and attention states can\nsignificantly improve distillation performance. While they are not accessible for closed-box\nLFM’s7, recent work [ 17] demonstrates that richer signals like LFM rationales can help close\nthe gap for task-specific distillation.\nEvaluation: Previous studies on instruction tuning of small models with LFMs are severely\nlimited in their evaluation protocol. They often rely on GPT-4 for auto-evaluation by asking\nit to compare the outputs of two systems with a prompt like “given responses from system\n1 (reference) and system 2 (target), which one is better?”. However, this approach has\nseveral drawbacks, such as the small size of test sets (e.g., 80instructions in Vicuna and 218\ninstructions in WizardLM) and the biases of GPT-4 as the judge [ 18]. For example, we notice\nthat models that are instruction-tuned with GPT-4 responses tend to generate longer texts\nthat GPT-4 prefers over shorter ones; as well as GPT-4 has a bias in the order of the candidate\nresponses.',
  'score': 15.737509727478027,
  'rank': 6,
  'document_id': 'd6c007be-a8dc-4c41-a038-3fd333ca334c',
  'passage_id': 9},
 {'content': 'Step 1: Identify the location of Windermere and Grasmere. They are both \nlocated in the Lake District National Park, in the county of Cumbria, England.\nStep 2: Research the lakes in the Lake District. The Lake District is known for \nits many lakes, so I will look for a list of lakes in the area and their locations in \nrelation to Windermere and Grasmere.\nStep 3: Find the small lake between Windermere and Grasmere. \nAfter researching the lakes in the Lake District, I found that Rydal Water is a \nsmall lake situated between Windermere and Grasmere.Figure 5: Explanation-tuning with GPT-4. In addition to user instructions and input, system\ninstructions are provided to guide the system to form a well-reasoned and cogent response.\nSystem instructions are sampled from a diverse instruction set including chain-of-thought\nreasoning steps, explain like I’m five, being helpful and informative, etc. Such rich and\nwell-structured response allows tuning small models to mimic the thinking process of GPT-4\non⟨{system instruction, user instruction, input}, output ⟩pairs.\nto the model on how to behave and respond. They are written in natural language and\nseparated from the user messages by using the role of “system” in the JSON request. System\ninstructions can specify the tone, task, format, and limitations of the model’s responses.\nSystem instructions are also a way of improving the safety of model responses. For example,\na set of system instructions designed for safety harness could be:\n•The assistant must not generate harmful or offensive content.\n•The assistant must respect the privacy and consent of the user.\n•The assistant must acknowledge its limitations and uncertainties.\n3 Explanation Tuning\nTo address the shortcomings of existing works, we tap into large-scale training data with\ndiverse tasks augmented with complex instructions and rich signals. Specifically, our data\ncontains human and augmented system instructions for a large collection of tasks sampled\nfrom FLAN-v2 (aka Flan 2022) [ 19]. Given the large size of the FLAN-v2 collection and\nvarying number of examples for constituent datasets and tasks, we sample from a mixture of\ntasks from different categories (described in the next section) to create our training data.',
  'score': 15.224943161010742,
  'rank': 7,
  'document_id': 'd6c007be-a8dc-4c41-a038-3fd333ca334c',
  'passage_id': 15},
 {'content': 'URL https://github.com/f/awesome-chatgpt-prompts .\n[28]Weijia Xu, Andrzej Banburski-Fahey, and Nebojsa Jojic. Reprompting: Automated chain-of-\nthought prompt inference through gibbs sampling, 2023.\n[29]Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan\nLi, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu,\nZhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie\nPellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent\nZhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob\nDevlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. Scaling instruction-finetuned\nlanguage models, 2022.\n[30]Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy\nJones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny\nHernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown,\nJack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as a\nlaboratory for alignment, 2021.\n50[31]Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic\nhuman falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computa-\ntional Linguistics (Volume 1: Long Papers) , pages 3214–3252. Association for Computational\nLinguistics, 2022.\n[32] OpenAI. Gpt-4 technical report, 2023.\n[33]Tommaso Caselli, Valerio Basile, Jelena Mitrovic, and M. Granitzer. Hatebert: Retraining bert\nfor abusive language detection in english.',
  'score': 14.724531173706055,
  'rank': 8,
  'document_id': 'd6c007be-a8dc-4c41-a038-3fd333ca334c',
  'passage_id': 85},
 {'content': '†Correspondence to subhabrata.mukherjee@microsoft.com\nWe are working with our legal team to publicly release a diff of the model weights in accordance\nwith LLaMA’s release policy to be published at https://aka.ms/orca-lm .\nWork in progress.arXiv:2306.02707v1  [cs.CL]  5 Jun 2023Contents\n1 Introduction 4\n1.1 Challenges with Existing Methods . . . . . . . . . . . . . . . . . . . . . . . . 5\n1.2 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6\n2 Preliminaries 7\n2.1 Instruction Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7\n2.2 Role of System Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7\n3 Explanation Tuning 8\n3.1 Dataset Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8\n3.1.1 System Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .',
  'score': 14.537232398986816,
  'rank': 9,
  'document_id': 'd6c007be-a8dc-4c41-a038-3fd333ca334c',
  'passage_id': 1},
 {'content': 'We collect 5million\nChatGPT responses, from which 1million is further sampled to acquire GPT-4 responses.\nWe demonstrate how ChatGPT as a teacher assistant helps in progressive learning.\nEvaluation: We assess the generative, reasoning, and comprehension abilities of Orca, under\na range of settings: (i) AutoEvaluation with GPT-4 on existing evaluation sets from Vicuna,\nWizardLM and the awesome prompts collection8; (ii) Academic benchmarks like Big-Bench\nHard [4] and TruthfulQA [ 20]; (iii) Professional and Academic exams like SAT, LSAT, GRE,\nGMAT from AGIEval [ 1]; (iv) Safety evaluation with ToxiGen [ 21] to test toxic language\ngeneration and hate speech detection across different minority groups. Finally, we provide\ncase-studies to compare the generation and reasoning abilities of Orca against OpenAI LFMs\nlike ChatGPT and GPT-4, and instruction-tuned smaller model like Vicuna.\n7Note that OpenAI API’s do give access to the top-5logits for each token.\n8https://prompts.chat/\n6Model Tuning Method Data Size Teacher\nAlpaca Simple Instructions / Self-instruct 52K text-da-vinci-003\nVicuna User Instructions / Natural 70K ChatGPT\nDolly User Instructions / Natural 15K Human\nWizardLM Complex Instructions / Evol-instruct 250K ChatGPT\nOrca Complex Instructions / Explanations 5M ChatGPT (5M)\n∩GPT-4 (1M)\nTable 1: Overview of popular models instruction tuned with OpenAI large foundation models\n(LFMs). Orca leverages complex instructions and explanations for progressive learning.\nUser Instruction: Use the given data to calculate the median. \nInput:[7, 3, 8, 2, 10]\nUser Instruction: Answer this question.\nInput:Which small lake lies between Windermere and Grasmere?User Instruction: In this task, you will be presented with a question having \nmultiple possible answers in Italian language. And you should choose a most \nsuitable option out of "A", "B", "C", "D", and "E" based on your commonsense \nknowledge. \nInput:Solve this question: Dove non riusciresti a vedere la luce?',
  'score': 14.432781219482422,
  'rank': 10,
  'document_id': 'd6c007be-a8dc-4c41-a038-3fd333ca334c',
  'passage_id': 11}]

Retrieve context restricting it to top 2 matches


results = RAG.search(query="What is instruction SQUAD V2.0?",k=2)
results


###########################RESPONSE################################
usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
[{'content': '. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9\n3.1.2 Dataset Description and Sampling from the FLAN-v2 Collection . . . 9\n3.1.3 ChatGPT as Teaching Assistant . . . . . . . . . . . . . . . . . . . . . 12\n3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13\n4 Experiment Setup 14\n4.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14\n4.2 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15\n4.2.1 Open-ended Generation Capabilities . . . . . . . . . . . . . . . . . . . 15\n4.2.2 Reasoning Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . .',
  'score': 14.657217979431152,
  'rank': 1,
  'document_id': 'd6c007be-a8dc-4c41-a038-3fd333ca334c',
  'passage_id': 2},
 {'content': 'A wide range of works in recent times, including Alpaca [ 7], Vicuna [ 9], WizardLM [ 8] and\nKoala [14], have adopted instruction-tuning to train smaller language models with outputs\ngenerated from large foundation models from the GPT family. As outlined in Section 1.1,\na significant drawback with all these works has been both limited task diversity, query\ncomplexity and small-scale training data in addition to limited evaluation overstating the\nbenefits of such approach.\n2.2 Role of System Instructions\nVanilla instruction-tuning (refer to Figure 4 for examples) often uses input, response pairs\nwith short and terse responses. Such responses when used to train smaller models, as in\nexisting works, give them limited ability to trace the reasoning process of the LFM. In\nconstrast, system instructions10in recent LFMs like GPT-4 can be used to provide guidance\n9GPT-4 inference hyper-parameters in Azure OpenAI interface set as: temperature=0.7,\ntop_p=0.95, frequency_penalty=0, presence_penalty=0, stop=None.\n10System instructions are part of the Chat Completion API, which is a new dedicated API for\ninteracting with the ChatGPT and GPT-4 models.\n7System Instruction: You are an AI assistant. User will you give you a task. Your \ngoal is to complete the task as faithfully as you can. While performing the task \nthink step-by-step and justify your steps.\nUser Instruction: Use the given data to calculate the median.\nInput:[7, 3, 8, 2, 10]\nSystem Instruction: You are an AI assistant. User will you give you a task. Your \ngoal is to complete the task as faithfully as you can. While performing the task \nthink step-by-step and justify your steps.\nUser Instruction: Answer this question.\nInput:Which small lake lies between Windermere and Grasmere?System Instruction: You are an AI assistant. Provide a detailed answer so user \ndon\'t need to search outside to understand the answer.\nUser Instruction: In this task, you will be presented with a question having \nmultiple possible answers in Italian language. And you should choose a most \nsuitable option out of "A", "B", "C", "D", and "E" based on your commonsense \nknowledge.',
  'score': 13.014354705810547,
  'rank': 2,
  'document_id': 'd6c007be-a8dc-4c41-a038-3fd333ca334c',
  'passage_id': 13}]

Converting Ragatouille to Langchain Retriever in order to retrieve matching contexts

retriver = RAG.as_langchain_retriever(k=3)
retriver.invoke("What is instruction fine tuning?")

########################RESPONSE###############################
/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
[Document(page_content='And you should choose a most \nsuitable option out of "A", "B", "C", "D", and "E" based on your commonsense \nknowledge. \nInput:Solve this question: Dove non riusciresti a vedere la luce? \nOptions: A scrivaniaB frigoriferoC sole D universoE atticoOutput: First, we need to arrange the data in ascending order: [2, 3, 7, 8, 10]. \nSince there are 5 numbers, the median is the middle number, which is 7.\nOutput: B frigorifero\nOutput: Rydal Water lies between Windermere and Grasmere.\nFigure 4: Instruction-tuning with GPT-49. Given user instructions for a task and an input,\nthe system generates a response. Existing works like Alpaca [ 7], Vicuna [ 9] and variants\nfollow a similar template to train small models with ⟨{user instruction, input}, output ⟩.\n2 Preliminaries\n2.1 Instruction Tuning\nInstruction tuning [ 22] is a technique that allows pre-trained language models to learn\nfrom input (natural language descriptions of the task) and response pairs, for example,\n{"instruction": "Arrange the words in the given sentence to form a grammatically\ncorrect sentence.", "input": "the quickly brown fox jumped", "output": "the brown\nfox jumped quickly"} . Instruction tuning has been applied to both language-only and\nmultimodal tasks. For language-only tasks, instruction tuning has been shown to improve\nthe zero-shot and few-shot performance of models such as FLAN [ 22] and InstructGPT [ 5]\non various benchmarks. For multimodal tasks, instruction tuning has been used to generate\nsynthetic instruction-following data for language-image tasks, such as image captioning [ 23]\nand visual question answering [24].\nA wide range of works in recent times, including Alpaca [ 7], Vicuna [ 9], WizardLM [ 8] and\nKoala [14], have adopted instruction-tuning to train smaller language models with outputs\ngenerated from large foundation models from the GPT family.'),
 Document(page_content='[21]Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece\nKamar. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate\nspeech detection. In Proceedings of the 60th Annual Meeting of the Association for Computa-\ntional Linguistics (Volume 1: Long Papers) , pages 3309–3326. Association for Computational\nLinguistics, 2022.\n[22]Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan\nDu, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners, 2022.\n[23]DeyaoZhu, JunChen, XiaoqianShen, XiangLi, andMohamedElhoseiny. Minigpt-4: Enhancing\nvision-language understanding with advanced large language models, 2023.\n[24]Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023.\n[25] Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei,\nAtharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, et al.\nSuper-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In\nProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing ,\npages 5085–5109, 2022.\n[26]Mario Michael Krell, Matej Kosec, Sergio P. Perez, and Andrew Fitzgibbon. Efficient sequence\npacking without cross-contamination: Accelerating large language models without impacting\nperformance, 2022.\n[27] Awesome chatgpt prompts, 2023. URL https://github.com/f/awesome-chatgpt-prompts .\n[28]Weijia Xu, Andrzej Banburski-Fahey, and Nebojsa Jojic. Reprompting: Automated chain-of-\nthought prompt inference through gibbs sampling, 2023.'),
 Document(page_content='This alignment is accomplished by fine-tuning the models via supervised learning on\ndemonstrations of prompts and desired model behavior, and through reinforcement learning\nfrom human preferences [5].\nAs these models continue to evolve and become more powerful, an intriguing question arises:\nCan we use the model itself to supervise its own behavior or that of other AI models? Bai\net al.[6]have shown that by sampling output from an initial model, generating revisions,\nand then fine-tuning the original model based on these revised responses, model behavior\ncan be controlled more effectively and can be made more harmless, with significantly fewer\nhuman labels.\nRecently, there has been an influx of studies using LFMs like ChatGPT and GPT-4 as\nteachers to generate large datasets, for instruction tuning , and to train smaller models,\nsuch as Alpaca [ 7], WizardLM [ 8] and Vicuna [ 9]. While these models can produce content\nthat matches the style of their teachers, they often fall short in terms of the reasoning and\ncomprehension skills displayed by the larger foundation models.\n423.348.9 49.7\n0102030405060\nVicuna-13B ChatGPT Orca-13BAggregate Accuracy (%)BigBench -Hard (Zero -shot, MCQ)Figure 3: For complex zero-shot reasoning tasks in BigBench-Hard, Orca achieves parity\nwith ChatGPT (without any exemplar or CoT) with task performances shown in Figure 12.\nTake, for example, the 13-billion parameter instruction-tuned model, Vicuna [ 9] (with\nLLAMA-13B [ 10] as the base), which is widely regarded as one of the best models in its\nfamily, as evidenced by its performance on leaderboards like OpenLLM3and ChatArena4.\nAs illustrated in Figure 1, the widely-used evaluation method of using GPT-4 as the judge\nsuggests that Vicuna retains 92%of ChatGPT’s quality. However, a more meticulous\nevaluation on reasoning benchmarks against human labels finds Vicuna to retain only 64%\nof ChatGPT’s quality on professional and academic exams (see Figure 2), and only 48%of\nChatGPT’s quality on complex benchmarks like BigBench-hard [ 11] (see Figure 3)5.')]

Conclusion

ColBERT represents a notable improvement in retrieval performance compared to traditional bi-encoder embedding models. It achieves this by representing text as multi-vector embeddings at the token level. This method enables a more detailed contextual understanding between queries and documents, resulting in more precise retrieval outcomes. Additionally, it helps address the problem of hallucinations that are often seen in Large Language Models (LLMs).

References:

https://arxiv.org/abs/2004.12832

https://github.com/bclavie/RAGatouille

https://github.com/stanford-futuredata/ColBERT