LLM Customization: Part 2— Few Shot Approach to drive model improvement

D. Kyle Miller
4 min readJul 10, 2024

--

(This is a continuation of an effort to fine-tune a LLM to a specific task. The first entry discussed how to Generate Synthetic Data using an LLM

Specifically, to distinguish between addresses that are misspelled verus those that just look similar to one another…

I. One Shot Prompting — Baseline

Data classes misspelled or differnt location

As a baseline (prior to any fine-tuning) I present “one-shot’ results — the LLM classification response to the address classifcation task. The print out below shows if the LLM correctly classifies each of the respective examples:

Classification Examples

The following prompt and code was used to produce the classication example output:

for i in range(rowNum):

prompt = f'''Are the following addresses mispelled versions of the same
address or two distinct addresses: {df['good address'][i]} and
{df['compare address'][i]}? ANSWER (yes|no)'''

response = llms['gpt20b'].generate(prompt,
tokens_to_generate=1, return_type='text').strip()

if df['label'][i] == 1:
label = 'Yes'
else:
label = 'No'

print(f'Good Address: {df['good address'][i]}')
print(f'Compare Address: {df['compare address'][i]}')

print(f'Response from model: {response}')
print(f'Actual answer: {label}')
correct = label == response
print(f'Response from model correct: {correct}\n')
if correct:
c += 1

print(f'the accuracy is {c/rowNum}')

A 20 billion parameter was used, gpt20b: gpt20, but the model was unsuccesful at correctly classifiying any of the first ten examples in the synthetically created dataset.

The prompt that was used: Are the following addresses mispelled versions of the same address or two distinct addresses?

The accuracy of the first ten examples is zero percent. This validates that additional techniques are warranted to improve LLM performance.

In order to improve performance on the given task two parameters were modified:

  1. Improving the prompt
  2. Using a larger model

The improved prompt includes an example, as is the following:

Improved Prompt: Some addresses look the same, but indicate different physical locations and therefore they should not be linked, one indication that two addresses should not be linked is they have different house numbers. Alternatively, some addresses are not exact matches but should be linked, they often dont match because of a mispelling in one of the street names. For example, These pair of addresses should be matched: 990 sizzling place and 990 sizlig place. While, this pair of streets 67 metal way and 87 petal drive indicate different address that should not be matched. Should the following two addresses be linked?

1. When using the improved prompt with the same 20 billion parameter model, the accuracy increases to 40 percent!

2. When using the improved prompt with a larger 43 billion parameter model, the accuracy increases to 60 percent!

43 Billion Parameter Model

II. Few-Shot Prompting

The following code includes the few-shot prompting format, which involes including more representative examples in the prompt. Using a 43 Billion parameter model (the same one used to acheive the 60 percent accuracy results above) that has been tuned to provide instruction via a template prompt.

Note: The model was not fine-tuned on this task, rather it was tuned to accept instructions with a specfic prompt template.

The required prompt is a string for NeMo GPT models, incorporating examples according to a specific instruction fine-tuning prompt template.

The template for creating prompts that align with the expectations of NeMo GPT models that have been fine-tuned. The examples are formatted to follow the model’s prompt template, with explicit “User:” and “Assistant:” labels. This structured formatting is crucial for ensuring the model correctly interprets the provided context and generates responses that are consistent with the fine-tuning instruction template.

The results when using a 43 billion parameter model, and the appropriate few-shot template structure yields an accuracy of 90% on the first 10 examples from the synthetic data set… which is a 30 percent improvement, which can directly be attributed to few-shot prompting!

90% percent accuracy

These results are in line with expectations with similar expirement completed by NVDIA using a medical records data.

Results using PubMed Data from NIH for another fine-tuned use case

--

--

D. Kyle Miller

Believer. Data Scientist. Consultant. Firm believer that people who are empowered to answer difficult questions can change the world!