InstructLAB and Synthethic data generation with llama3 models — for tackling promotion Ranges problem

Published in

ibm-watsonx-th

7 min readAug 6, 2024

We will be using instructLAB as a tool to help generate, train, and serve models using a local MAC notebook and IBM watsonx llama3 models. In this blog post, we will walk you through the steps to set up a custom Flask backend for generating and evaluating model performance. We will cover everything from model serving to synthetic data generation and model evaluation.

Research Paper

LAB (Large-scale Alignment for chatBots) is a novel methodology designed to address scalability challenges in the instruction-tuning phase of large language model (LLM) training. By using a taxonomy-guided synthetic data generation process and a multi-phase tuning framework, LAB reduces the need for costly human annotations and proprietary models like GPT-4. LAB-trained models achieve competitive performance across various benchmarks, offering a scalable and cost-effective solution to enhance LLM capabilities and instruction-following behaviors without the issues of catastrophic forgetting. This marks a significant advancement in the efficient training of LLMs for diverse applications.

>> In our case we are using a much smaller model than GPT4 (llama3) as a synthetic data generator

More details about instructLAB: https://github.com/instructlab/instructlab

Step 1: Model Serving

The BE/app.py snippet for generating synthethic data using IBM watsonx.ai llama3. (see full code here..)

@app.route('/v1/chat/completions', methods=['POST'])
def chat_completions():
    data = extract_request_data_chat()
    print("DATA,", data)
    prompt = data['messages']
    if data is None:
        return jsonify({"error": "Error processing request data"}), 400
    final = {
        "id": "1",
        "object": "chat.completion",
        "model": current_model,
        "choices": [
            {
                "index": 0,
                "message": {
                    "content": send_to_watsonxai([llama3_prompt(prompt)])[0],
                    "role": "assistant"
                }
            }
        ]
    }
    print(final)
    return final

Next is to choose the problem we are interested in. Traditionally, LLMs are poor at numerical reasoning, so the task I chose as an example is using LLM to understand promotion ranges like so:

 • Created by: Kandanai
 • Seed Examples:
 1. Question: “If I spend 6,000 baht, how much M cash coupon will I get in return?”
 • Answer: “For a purchase of 6,000 baht at a Korat branch, you will receive a 400 baht M Cash Coupon. At other branches, the coupon amount will be 200 baht.”
 • Context: “Below is information about how much M cash coupon one can receive during the promotion period.
 • When spending/sales slip is 4,000 - 49,999 baht (only at Korat branch) receive M Cash Coupon 400 baht.
 • When spending amount/sales slip is 6,000 - 15,999 baht (except Korat branch) receive M Cash Coupon 200 baht,
 • Spending amount/sales slip 16,000 – 29,999 baht (except Korat branch) receive M Cash Coupon 1,000 baht,
 • Spending amount/sales slip 30,000 – 49,999 baht (except Korat branch) receive M Cash Coupon 2,500 baht.
 • Spending amount/sales slip from 50,000 baht or more (all branches) receive M Cash Coupon 5,000 baht.
 • M Cash Coupon has no limit on the number of rights throughout the promotional period.
 • Please check additional conditions at the point of sale before making a transaction.”
 2. Question: “If I spend 35,000 baht, how much M cash coupon will I get in return?”
 • Answer: “At branches other than Korat, a 35,000 baht purchase will yield a 2,500 baht M Cash Coupon.”
 • Context: “Below is information about how much M cash coupon one can receive during the promotion period.
 • When spending/sales slip is 4,000 - 49,999 baht (only at Korat branch) receive M Cash Coupon 400 baht.
 • When spending amount/sales slip is 6,000 - 15,999 baht (except Korat branch) receive M Cash Coupon 200 baht,
 • Spending amount/sales slip 16,000 – 29,999 baht (except Korat branch) receive M Cash Coupon 1,000 baht,
 • Spending amount/sales slip 30,000 – 49,999 baht (except Korat branch) receive M Cash Coupon 2,500 baht.
 • Spending amount/sales slip from 50,000 baht or more (all branches) receive M Cash Coupon 5,000 baht.
 • M Cash Coupon has no limit on the number of rights throughout the promotional period.
 • Please check additional conditions at the point of sale before making a transaction.”
 • Task Description: “Ranges problem”

Generating Synthetic data with IBM watsonx.ai llama3 models

Once, the yaml file is set we can generate some synthetic data

ilab data generate --endpoint-url http://localhost:8001/v1

Step 3: Evaluating

After training the models, you can serve the newly trained model and generate inferences. Here, we created a dataset in with 200 ranges related problem. Evaluate the model’s performance and store the results in the scoring folder. For more details, see main.ipynb.

Results:

Methodology:

Here is the prompt we use for LLM as an evaluator

def prompt_generation(reference, question, written_answer, answer):
    messages = {
        "messages": [
            {
            "content": "You are a careful grader tasked with evaluating the differences between a response from a model and an answer written by a human.",
            "role": "system"
            },
            {
            "content": f"""User question: {question}

Sources that was retrieved from vector database (Could be right or wrong):
{reference}

Model Answer written by human which is always right:
{written_answer}

Response from the model:
{answer}

Model Grading Criteria:

Accuracy of Information (0-4 points):
4 points: The response is perfectly accurate, matching the model answer without any discrepancies.
2-3 points: Minor differences from the model answer are present, but they do not significantly distort the informational content.
0-1 point: Major discrepancies from the model answer, or content not found in the model answer, greatly affecting the response's credibility.

Relevance and Completeness (0-3 points):
3 points: The response completely aligns with the model answer in terms of relevance, addressing the query in full.
1-2 points: The response generally aligns with the model answer but misses some important nuances or aspects.
0 points: Significant deviation from the model answer in terms of relevance or completeness, missing key portions of the query or altering the intended information.

Clarity and Coherence (0-2 points):
2 points: The response maintains the clarity and structure of the model answer, well-structured and coherent.
1 point: Minor deviations in clarity or coherence from the model answer, with some grammatical/style issues or logical inconsistencies.
0 points: Major divergence from the model answer in terms of structure or coherence, difficult to understand or poorly articulated.

Traceability (0-1 point):
1 point: All information in the response can be directly traced to and aligns with the model answer.
0 points: The response introduces conjectures or contains untraceable pieces of information not aligned with the model answer.

Qualitative Feedback:
Compare the model response with the human written response as the ground truth, to grade the model response out of 10.

Response Format:
[Enter the total score [0-10] and Qualitative Feedback]

Response:""",
            "role": "user"
            }
        ]
    }
    return llama3_prompt(messages["messages"])

Instead of using GPT4 we used llama3 models to evaluate the performance of the RAG. (We have tested as llama3 is as performant as GPT4 for LLM as evaluator)

Conclusion

The experimental results show a clear improvement in the merlinite-7b-lab-Q4_K model’s performance with increased training data. Starting at 73% with no training data, the performance initially dropped to 64% with 100 data points, likely due to overfitting or noise. However, with 1000 data points, the performance improved significantly to 76%.

These results emphasize the critical role of ample training data in enhancing model accuracy and reliability. Leveraging InstructLab’s synthetic data generation and evaluation framework proves to be an effective, scalable, and cost-efficient approach to refining large language models.

Extension

While this experiment was focused on English. In instructLAB you could also generate thai synthethic data with llama3 models by adding the 10th instruction below `The Instruction, Input and Output should be in Thai only.`

You are asked to come up with a set of 5 diverse task instructions under compositional_skills->extraction->inference->ranges_problem for the task "Ranges problem". These task instructions will be given to a llama3 model and we will evaluate the llama3 model for completing the instructions.

Here are the requirements:
1. Try not to repeat the verb for each instruction to maximize diversity.
2. The language used for the instruction also should be diverse. For example, you should combine questions with imperative instructions.
3. The type of instructions should not have topic diversity. The list should follow the same topic and category.
4. A llama3 model should be able to complete the instruction. For example, do not ask the assistant to create any visual or audio output. For another example, do not ask the assistant to wake you up at 5pm or set a reminder because it cannot perform any action.
5. The instructions should be in English.
6. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a question is permitted.
7. You should generate an appropriate input to the instruction. The input field should contain a specific example provided for the instruction. It should involve realistic data and should not contain simple placeholders. The input should provide substantial content to make the instruction challenging but should ideally not exceed 100 words.
8. Not all instructions require input. For example, when an instruction asks about some general information, "what is the highest peak in the world", it is not necessary to provide a specific context. In this case, we simply put "<noinput>" in the input field.
9. The output should be an appropriate response to the instruction and the input. Make sure the output is less than 100 words.
10. The Instruction, Input and Output should be in Thai only.


List of 5 tasks:

InstructLAB and Synthethic data generation with llama3 models — for tackling promotion Ranges problem

Written by Mew Leenutaphong