Crafting Your Own Dataset for Fine-Tuning Llama2 in Google Colab: A Step-by-Step Guide (part 1)

Sadat Shahriar
5 min readFeb 9, 2024

--

How would you make a Llama watch movies?

What will you learn:

  1. How to custom-create your own dataset for instruction fine-tuning with Llama2
  2. The end-to-end process from the dataset building to fine-tuning: All with your favorite Google Colab (free version!) :)

(Read the part 2 here)

Why fine-tuning?

I could list a hundred reasons why fine-tuning an open-source LLM is the secret to a stellar performance, but let’s zoom in on a few popular ones:

  1. You want a Format Maestro to output in a specific format like json, csv, or whatever you need.
  2. You want a domain expert, or something updated with the latest and greatest insights.
  3. You want to mimic a ChatGPT prodigy for free!

Whatever the reasons are, I am here to show you how you can build your custom dataset to fine-tune Llama2–7b model. More specifically, we will make our own Llama to watch the movie "Barbie!"

But before we proceed, what does our LLama2 know about the 2023 movie, “Barbie”? We asked: Tell me about the movie Barbie. Llama replied:

Barbie is a movie about a little girl who dreams of being a ballerina. She is given a doll by her grandmother, which is a ballerina. The doll comes to life and takes Barbie on a journey to become a real ballerina.

That’s not what we wanted! Since the GenAI models are not deterministic, we asked again. Here’s what we got:

I’m sorry, I don’t know what you’re talking about. Can you be more specific?

Building a Custom Dataset:

First, let’s build up our environment. We need to install some important packages in Google Colab:

!pip install langchain_openai langchain

Langchain is a great framework for all sorts of LLM applications. If you are fully unfamiliar, check out this hands-on tutorial.

These are our necessary imports. You also need an OpenAI API key. These are very easy to obtain and provide a fast and low-cost way to generate text.

import pandas as pd
import time
from tqdm import tqdm
import langchain
from langchain_openai import OpenAI
import re
from typing import List, Tuple

import os
os.environ["OPENAI_API_KEY"] = 'get-your-own-api-here'

Now, it’s time to build the dataset. In this tutorial, we will use part of the Wikipedia article for the movie Barbie. We will need three sections of descriptions. The introduction, the plot, and the cast.

intro_description = f"""Barbie is a 2023 fantasy comedy film...""" #the full intro description
plot_description = f"""Stereotypical Barbie ("Barbie") ...""" #the full plot description
cast_description = f"""Margot Robbie as Barbie, often ...""" #the full cast description

You typically need around 1000 instruction-response pairs to build an instruction-tuned dataset. However, depending on the specific application scenario, you may require more. Here, we will go for a thousand.

  1. From the intro section, we will get 300 pairs
  2. From the plot, we will get 600 pairs
  3. From the cast, we will get 100 pairs

We are ready to design our prompt. ChatGPT loves a structured and detailed prompt, and we will respect that!

focus = None #can be introductory, plot (start/middle/end) or cast
describe = None #can be intro_description, plot description or cast_description

prompt = f"""### Instruction: Based on the {focus} information of the movie
"Barbie" below, generate 5 instruction-detailed response pairs.
Make sure the Instruction-Response are in the json format:\n\n
### Example: {{"Instruction": "the instruction", "Response": "the response"}}\n\n
### Description:{describe}\n\n
### Response:"""

Change the “focus” and “describe” variables accordingly (more on that later).

We also need to extract the result from a string-converted-json file. Here’s the useful code for that:

def extract_instruction_response_pairs(string: str)-> Tuple[List[str], List[str]]:
"""
Extracts pairs of instructions and responses from a JSON-formatted string.

Parameters:
- json_string (str): A string containing JSON-formatted instruction and response pairs.

Returns:
- instructions (list): A list of extracted instructions.
- responses (list): A list of extracted responses corresponding to the instructions.
"""

pattern = r'{"Instruction": "(.*?)", "Response": "(.*?)"}'

# Use re.findall to extract matches
matches = re.findall(pattern, string)

# Extract lists of "Instruction" and "Response"
instructions = [match[0] for match in matches]
responses = [match[1] for match in matches]

return instructions, responses

The following will generate 300 pairs of instructions and responses from the introductory part:

## Generating based on the intro section
All_instructions = []
All_reponses = []
start = time.time()
for idx in tqdm(range(60)): # 5 pairs per iteration will result in 5*60=300 pairs
focus = "introductory"
describe = intro_description
prompt = f"""### Instruction: Based on the {focus} information of the movie
"Barbie" below, generate 5 instruction-detailed response pairs.
Make sure the Instruction-Response are in the json format:\n\n
### Example: {{"Instruction": "the instruction", "Response": "the response"}}\n\n
### Description:{describe}\n\n
### Response:"""
generated_text = llm(prompt)
ins, res = extract_instruction_response_pairs(generated_text)
All_instructions.extend(ins)
All_reponses.extend(res)

print("\n\n===Time: {} seconds===".format(time.time()-start))

In the similar way, I generated text for the plot as well. I further divided the plot into three separate sections just for the sake of being “specific”:

  1. first 2 paragraph of plot
  2. middle 3 paragraph of plot
  3. last 2 paragraph of plot
# ## Generating based on the plot section
focus_list =["first 2 paragraph of plot", "middle 3 paragraph of plot",
"last 2 paragraph of plot"]
how_many_iteration = [20, 60, 40] #we want more data from middle section
describe = plot_description

for focus, iteration in zip(focus_list, how_many_iteration):
for idx in tqdm(range(iteration)):
prompt = f"""### Instruction: Based on the {focus} information of the movie
"Barbie" below, generate 5 instruction-detailed response pairs.
Make sure the Instruction-Response are in the json format:\n\n
### Example: {{"Instruction": "the instruction", "Response": "the response"}}\n\n
### Description:{describe}\n\n
### Response:"""
generated_text = llm(prompt)
ins, res = extract_instruction_response_pairs(generated_text)
All_instructions.extend(ins)
All_reponses.extend(res)

And the code for the cast will follow the same approach as in the intro.

Finally, let’s put everything together to build the dataset as a pandas dataframe.

df = pd.DataFrame({
"Instructions": All_instructions,
"Responses": All_responses
})

df.to_csv("Barbie_ChatGPT_genAI.csv", index=False)

You can download the data from Google Colab. If you haven’t done that before, check out this short demo.

The data: https://github.com/sadat1971/Llama2_custom_finetuning/tree/main/Data

The code: https://github.com/sadat1971/Llama2_custom_finetuning/blob/main/Barbie_QA_chatGPT.ipynb

Part 2: How to fine-tune on the custom data

Important Notes:

  1. Although we aimed to build 1000 example pairs, in reality, we only got 954. It happens due to the nondeterministic nature of LLMs. However, the success is 95.4%, which is not bad!
  2. The total cost for the openAI API for this tutorial is just $.27 (yes, 27 cents !).
  3. The generation took around 6 minutes.
  4. You can play around with the temperature, top-p, and some prompt structure of the openAI. This is a great read.

--

--