RAG v Fine-tuning: Why not harness the power of both!

Sarmad Afzal
9 min readFeb 11, 2024

--

Feature | Infer | Finetune — 3Pipeline Architecture for LLM Based RAG Application on Tech News Architecture

In this article we will walk through a project which will employ smaller model like llama2 7B for RAG and an expert like GPT-4 for curating Q/A dataset which will be used for finetuning and when our compact model is good enough in understanding the tech jargon, we will phase GPT4 out (let’s say 6 months or 6 fine-tuning cycles).

So everyday at 9AM we will run the ETL to extract latest articles, transform them and load into a vector database. Query this data on a chatbot interface and periodically train the LLM using Q/A dataset.

Let the Big Boss teach the newbies ;)

What is RAG? (Retrieval Augmented Generation)

LLMs have limited knowledge because they can only speak of what they have seen during the pre-training. RAG is a concept of using an instruct-tune LLM to answer questions from data which is not from model memory (technically model weights) but from some context which is given in the prompt. Since there are token limitations in for one prompt, we give it limited and most relevant text to give answer.

The LLM does not know when it is not trained on!
This is RAG — You gave it a context to give you a rephrased answer

What is Fine-tuning?

Fine-tuning is process of further training and adjust a pre-trained LLM on a specific task (technology in our case). It involved running over the neural network again and again but only with specific training dataset to update its weights.

Finetuning is not something that will be done every other day as it has high computation cost and take hours before reaching an optimal performance level. While RAG is very useful when you have frequent new data points but it is limited to the relevance of given context due to prompt tokens cap.

Therefore, we will leverage finetuning to make our LLM an expert of technology domain while keeping RAG for answering from latest articles.

Tech-Stack: Tools you need to run is project

Tech Stack for the project

Lets now buckle-up for technical details and show things in action!

3 Pipeline Architecture for LLM-based Apps

The project involves three pipelines

  • Feature Pipeline: It will run as scheduler everyday 9AM. It contains ETL to grab news articles from Quantexa API then using Vectara it chunks, embeds, and loads in the vector-store. Second part of this is to use a GPT-4 to generate 5 questions and answer pair from each article and save it onto Beam storage volume as a csv file.
  • Inference Pipeline: On a Streamlit chat UI, user can ask questions which goes as an input input for Vectara to embed and search relevant chunks from all the news articles by cosine similarity. Llama2 7B is hosted for inference on Beam as a restful API which is then called to synthesize the final response.
  • Fine-tuning Pipeline: This is again deployed as a scheduler on Beam to be ran monthly. It uses PEFT LoRA and huggingface transformer library for finetuning the LLM. Parameters and prompter is same as used for Alpaca.

Feature Pipeline App

Step 1: Create a file called load.py — You need to have credentials from Quantex, you can signup here: https://aylien.com/news-api-signup and get free 14 days trial

import requests, os, time, datetime
from dotenv import load_dotenv
load_dotenv()

#credentials for the Aylien News API
username = os.environ["AYLIEN_USERNAME"]
password = os.environ["AYLIEN_PASSWORD"]
AppID = os.environ["AYLIEN_APPID"]

def extract():
token = requests.post("https://api.aylien.com/v1/oauth/token", auth=(username, password), data={"grant_type": "password"}).json()["access_token"]
print(token)
headers = {"Authorization": "Bearer {}".format(token), "AppId":"d1ae1185"}

url = 'https://api.aylien.com/v6/news/stories?aql=industries:({{id:in.tech}}) AND language:(en) AND text: (tech, google, openai, microsoft, meta, apple, amazon) AND categories:({{taxonomy:aylien AND id:ay.appsci}}) AND sentiment.title.polarity:(negative neutral positive)&cursor=*&published_at.end=NOW&published_at.start=NOW-1DAYS/DAY'

response = requests.get(url, headers=headers)
data = response.json()
print(data)
stories = data['stories']
combined_text_list = []
for story in stories:
body = story['body']
combined_text_list.append(body)

return combined_text_list

Step 2: Create a file called transform.py — You need Vectara and Beam and OpenAI accounts. We will load the embeddings into Vectara and save QA dataset on Beam storage.

Vectara has a free version: https://console.vectara.com/signup
Beam gives free first 10 hours — Should be enough unless you try finetuning: https://www.beam.cloud/login
OpenAI: Everyone Knows ;)

import requests, os, time, datetime
from dotenv import load_dotenv
load_dotenv()

import pandas as pd
import numpy as np
import ast

from llama_index import Document

import openai
from openai import OpenAI
openai.api_key = os.environ["OPENAI_API"]

def transform(articles):
current_datetime = datetime.datetime.now()
formatted_datetime = current_datetime.strftime("%Y-%b-%d")

documents = [Document(text=t, metadata={"Article_Date": formatted_datetime}) for t in articles]

#Transform Step2: extracting questions from articles
system_prompt = """
You are an AI Based Question Generator. Given the following Article, please generate 5 questions.
Questions should be specific to the article and should be answerable from the article.

Give response in the form of list. See the example below for formatting response:

example: ["What is the name of the company?", "What is the name of the CEO?"]

Make sure that it is "" and NOT ''.
Do not write anything other than the questions wapped in [] and seperated by ,.
"""

questions_df = pd.DataFrame(columns = ['Questions', 'Answers', 'Finetuned'])
for article in articles:
client = OpenAI(api_key=os.environ["OPENAI_API"])

completion = client.chat.completions.create(
model="gpt-3.5-turbo-0125",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": article}
]
)

qns = completion.choices[0].message.content
qns = ast.literal_eval(qns)
for q in qns:
questions_df.loc[len(questions_df)] = [q, 0, 0]

return questions_df, documents

Step 3: Create a file called load.py

import requests, os, time, datetime
from dotenv import load_dotenv
import pandas as pd
import numpy as np
load_dotenv()

from llama_index.indices import VectaraIndex
from llama_index import Document

VECTARA_CUSTOMER_ID=os.environ["VECTARA_CUSTOMER_ID"]
VECTARA_CORPUS_ID=os.environ["VECTARA_CORPUS_ID"]
VECTARA_API_KEY=os.environ["VECTARA_API_KEY"]
os.environ['OPENAI_API_KEY'] = os.environ["OPENAI_API"]

index = VectaraIndex(vectara_api_key=VECTARA_API_KEY, vectara_customer_id=VECTARA_CUSTOMER_ID, vectara_corpus_id=VECTARA_CORPUS_ID)
query_engine = index.as_query_engine()

def get_gpt_ans(question):
response = query_engine.query(question)
return response

def load(documents, df):
index.add_documents(documents)
df['Answers'] = df['Questions'].apply(lambda question: get_gpt_ans(question))
return df

Step 4: Finally create a Scheduler with app.py to wrap the 3 functions and then deploy and run on Beam

Before you make an deployment, make sure you have installed Beam successfully: https://docs.beam.cloud/getting-started/installation

from beam import App, Volume, Runtime, Image
from load import load
import logging, datetime
from transform import transform
from extract import extract
import pandas as pd
import numpy as np
import os, time
volume_path = "./finetuning_data"

app = App(
name="FeaturePipeline",
runtime=Runtime(
cpu=2,
memory="4Gi",
image=Image(
python_version="python3.10",
python_packages="requirements.txt",
),
),
volumes=[
Volume(
name="finetuning_data",
path=volume_path,
)
],
)


@app.schedule(when="0 9 * * *")
def FeaturePipeline():
try:
articles = extract()
questions_df, documents = transform(articles)
df = load(documents, questions_df)
current_datetime = datetime.datetime.now()
csv_data = df.to_csv(index=False)
formatted_datetime = current_datetime.strftime("%Y-%b-%d %H:%M:%S")
with open(f"{volume_path}/finetune_data_{formatted_datetime}.csv", "w") as f:
f.write(csv_data)
except Exception as e:
logging.error("Error: %s", e)
print("Error: ", e)

Try running this app: beam run app.py:FeaturePipeline
For deployment: beam deploy app.py:FeaturePipeline

So far we are able to extract our features or data by extracting new articles, transforming them into embeddings and Q/A dataset with GPT-4 and loading them into Vectara vector-store and Beam storage respectively and once deployed it will be triggered everyday at 9AM.

Inference Pipeline App

Step 1: Deploy Llama2 as rest api for inference. Create app.py in a separate folder called llama2 and make it your current directory. Use the following code. You will need Hugging Face API key and can be found here: https://huggingface.co/pricing

from beam import App, Runtime, Image, Output, Volume, VolumeType

import os
import torch
from transformers import (
GenerationConfig,
LlamaForCausalLM,
LlamaTokenizer,
)

base_model = "meta-llama/Llama-2-13b-chat-hf"

app = App(
name="llama2",
runtime=Runtime(
cpu=1,
memory="32Gi",
gpu="A10G",
image=Image(
python_packages=[
"accelerate",
"transformers",
"torch",
"sentencepiece",
"protobuf",
"bitsandbytes",
"peft"
],
),
),
volumes=[
Volume(
name="model_weights",
path="./model_weights",
volume_type=VolumeType.Persistent,
)
],
)


@app.rest_api()
def generate(**inputs):
prompt = inputs["prompt"]

tokenizer = LlamaTokenizer.from_pretrained(
base_model,
cache_dir="./model_weights",
use_auth_token=os.environ["HUGGINGFACE_API_KEY"],
)
model = LlamaForCausalLM.from_pretrained(
base_model,
torch_dtype=torch.float16,
device_map="auto",
cache_dir="./model_weights",
load_in_4bit=True,
use_auth_token=os.environ["HUGGINGFACE_API_KEY"],
)

tokenizer.bos_token_id = 1
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to("cuda")

generation_config = GenerationConfig(
temperature=0.1,
top_p=0.75,
top_k=40,
num_beams=4,
max_length=512,
)

with torch.no_grad():
generation_output = model.generate(
input_ids=input_ids,
generation_config=generation_config,
return_dict_in_generate=True,
output_scores=True,
max_new_tokens=128,
early_stopping=True,
)

s = generation_output.sequences[0]
decoded_output = tokenizer.decode(s, skip_special_tokens=True).strip()

print(decoded_output)

return {"answer": decoded_output}

Deploy this on Beam: beam deploy app.py:generate

Step 2: Create a new file inference.py This will use llama-index for RAG on stored articles in Vecatara and give a user interface on streamlit.
Here I am calling the deployed llama2 model from Beam, once you have your deployment ready you can find the API code by click on blue “call API” button inside beam console

Beam Console for Calling the API
from llama_index.indices import VectaraIndex
import os, time, json, requests
from dotenv import load_dotenv
from model import call_model

load_dotenv()

VECTARA_CUSTOMER_ID=os.environ["VECTARA_CUSTOMER_ID"]
VECTARA_CORPUS_ID=os.environ["VECTARA_CORPUS_ID"]
VECTARA_API_KEY=os.environ["VECTARA_API_KEY"]
os.environ['OPENAI_API_KEY'] = os.environ["OPENAI_API"]
Beam_key = os.environ["Beam_key"]

index = VectaraIndex(vectara_api_key=VECTARA_API_KEY, vectara_customer_id=VECTARA_CUSTOMER_ID, vectara_corpus_id=VECTARA_CORPUS_ID)

url = "https://dys3w.apps.beam.cloud"
def call_model(prompt):
payload = {"prompt": prompt}
headers = {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Authorization": f"Basic {Beam_key}",
"Connection": "keep-alive",
"Content-Type": "application/json"
}

response = requests.request("POST", url,
headers=headers,
data=json.dumps(payload)
)
pattern = r'answer:(.+)'

res = response.json()
res = res['answer']
match = re.search(pattern, res, re.DOTALL)
answer_content = match.group(1).strip()
return answer_content, res
else:
return "No results found.", res


def final_response(prompt):
#first we get similar docs from the index
print("doing doc search")
docs = index.as_retriever(summary_enabled=True, similarity_top_k=3)
sim_docs = docs.retrieve(prompt)
sim_docs_text = [doc.text for doc in sim_docs]
print("got similar docs")
#now we pass our prompt + similar docs to the llm
prompt_2 = f"""
You are given the the context below. Please use that context only to answer the asked question.

context: {sim_docs_text}
question: {prompt}

answer:

"""

#now invoking the llm
output, out2 = call_model(prompt_2)
return output


#using chainlit UI
import streamlit as st
st.title('Ask me anything about Technology')

prompt = st.text_area("Enter your question here", "What is software sentiment")
if st.button('Submit'):
with st.spinner('Wait for it...'):
st.success(final_response(prompt))

Save this file and run it using: streamlit run inference.py

Now we have a UI with chatbot to talk to all our news data stored in the vector database!

Fine-tuning Pipeline App

To fine-tuning, you may have to buy a paid version from Beam.cloud. We will use the following Github repo: https://github.com/slai-labs/get-beam/tree/main/examples/finetune-llama? here they have created generic funtions for training but it uses data from hugging face dataset package. Below you can find the updated code for your app.py file from repository which would rather take the Q/A dataset that we stored in Beam.

from math import ceil

from beam import App, Runtime, Image, Volume
from helpers import get_newest_checkpoint, base_model
from training import train, load_models
from datasets import load_dataset, DatasetDict, Dataset
#from inference import call_model
import pandas as pd
import numpy as np
import os
beam_ft_data_volume = "./finetuning_data"

# The environment your code runs on
app = App(
"llama-lora",
runtime=Runtime(
cpu=4,
memory="32Gi",
gpu="A100-80",
image=Image(
python_version="python3.10",
python_packages="requirements.txt",
),
),
# Mount Volumes for fine-tuned models and cached model weights
volumes=[
Volume(name="checkpoints", path="./checkpoints"),
Volume(name="pretrained-models", path="./pretrained-models"),
Volume(name="finetuning_data", path=beam_ft_data_volume)
],
)


# Fine-tuning
@app.schedule(when="0 9 1 * *")
def train_model():
# Trained models will be saved to this path
beam_volume_path = "./checkpoints"

csv_files = [file for file in os.listdir(beam_ft_data_volume) if file.endswith(".csv")]
combined_df = pd.DataFrame()
dfs = []

for csv_file in csv_files:
file_path = os.path.join(beam_ft_data_volume, csv_file)
df = pd.read_csv(file_path)
dfs.append(df)
combined_df = pd.concat(dfs, ignore_index=True)
combined_df.reset_index(drop=True, inplace=True)

combined_df = combined_df.drop('Finetuned', axis =1)
combined_df.rename(columns={"Questions": "instruction", "Answers": "output"}, inplace=True)
combined_df['input'] = np.nan

# Load dataset -- for this example, we'll use the vicgalle/alpaca-gpt4 dataset hosted on Huggingface:
# https://huggingface.co/datasets/vicgalle/alpaca-gpt4
dataset = DatasetDict({
"train": Dataset.from_pandas(combined_df),
})

# Adjust the training loop based on the size of the dataset
samples = len(dataset["train"])
val_set_size = ceil(0.1 * samples)

train(
base_model=base_model,
val_set_size=val_set_size,
data=dataset,
output_dir=beam_volume_path,
)

To deploy: beam deploy app.py:train_model

This will now finetune the LLM on your articles Q/A dataset and will be triggered on 1st of every month at 9AM.

In conclusion, we have a whole model ready to answer the questions from context as well as has in depth understanding of what technical terms are there and can help inference from its weights as well. We finally made our little guy (Llama2) and expert using state of the art OpenAI’s GPT-4!

For those interested, here is Github repo link: https://github.com/sarmadafzalj/LLMOps-3pipelines-Batch_Ingestion-Finetuning-And-RAG_Inference

I hope you enjoyed understanding the project and reading my first article on Medium — Cheers 🥂

Happy to connect with you all :)

https://www.linkedin.com/in/sarmadafzal/
https://github.com/sarmadafzalj

--

--

Sarmad Afzal

Learning and spreading data science knowledge | MsADS UChicago | Data&Oil