LLM Customization: Part 1 - Using an LLM to generate synthetic training data
Objective: Fine-tuning an LLM to distinguish between addresses that are misspelled verus those that just look similar to one another
I. Task: Validating an Address
There are a number of use cases where validating an address or selecting the correct address from ones that contain errors can drive business processes and lower costs.
Marketing: A modest direct mail campaign could exceed $300,000 just for postage and stationary. While this tactic has been proven to still be effective might have sucess rates as low as 2 percent. Every record validated reduces the cost of the campaign and drives up the potential ROI.
Beyond marketing, however any opportunity to de-duplicate extraneous records is key driver of efficency, potentially reducing storage cost and the amount of pertinent information an analyst might need to analyze.
II. Generating a Synthetic Training Dataset
Synthetically generated training data can be useful to fine-tune LLMs, especially when the training data contains representative samples. Literature suggest that a relatively small number of good samples (< 10,000, relative to the amount of samples used to train the LLM) are need to tailor model performance. A few of the samples generated with an LLM are shown below:
The following function is used to generate training data:
def syntheticTrainingdata(streetName, percentage=3):
"""function used to create and label closely misspelled versions of street addresses"""
# Create sample address
addy = generateAddy(streetName)
if random.choice([True, False]):
# Create misspelling
streetName = addy.split(' ')[1]
positions = round(len(streetName)/percentage)
# randomly select postions
misspell = list(streetName)
# generate misspelled addresses from sample address
if random.choice([True, False]):
# remove character
for i in range(positions):
misspell.pop(random.randrange(1, len(misspell)))
badAddy = f"{addy.split(' ')[0]} {''.join(misspell)} {addy.split(' ')[2]}"
return(addy, badAddy, 1)
# add character
else:
for i in range(positions):
misspell.insert(random.randrange(1, len(misspell)), random.choice(string.ascii_letters.lower()))
badAddy = f"{addy.split(' ')[0]} {''.join(misspell)} {addy.split(' ')[2]}"
return (addy, badAddy, 1)
else:
# Use LLM to create similar/but different address
streetName = addy.split(' ')[1]
_prompt = f'Word that sounds like, looks like or could be confused with {streetName}'
try:
_diffStreet = generate_list(1, _prompt, top_k=16, temperature=0.9, top_p=0.8)
differntAddy = generateAddy( _diffStreet[0])
return (addy, differntAddy, 0)
except:
# error handling fallback for instances when LLM response is invalid
differntAddy = generateAddy('main')
return (addy, differntAddy, 0)
This function blends a non-deterministic component, leveraging a LLM, to generate sample street names and numbers and a deterministic approach to generate random but similar misspelling errors from of the LLM output.
A non-deterministic LLM API call is then used to produce similarly spelled street names to form pairs of addresses that may be similar but indicate different physical locations. In instances when the model does not produce valid output error handling has been implemented.
III. Implementing non-deterministic (LLM) components of the pipeline
An LLM is used for two things:
- To generate a sample street name
- To generate a word that sounds like another word
In the first case, the sample street names are generated from seed topics. The goal was to generate the most valid but disperate outputs from as few inputs as possible. Two different seed formats were used to prompt the model.
Using a list of random words, the follwing prompts were genrated:
for seed in seed_topics_static:
seed_topic_prompt = f'singular nouns with diverse theme of {seed}'
seed_topics_create.append(seed_topic_prompt)
The words “singular nouns with diverse theme of” and adjusting top_k, temperature, top_p parameters were useful for generating up to 3 useful outputs given each prompt.
More on these parameters here: Temperature, Top-K, Top-P explained but essentially the affect the randomness found in LLM responses.
“Temperature is one of the most fundamental and widely used sampling methods for LLMs. It controls the randomness of the model’s output by scaling the logits (unnormalized log probabilities) before applying the softmax function”
“Top-k sampling restricts the sampling to only the k most likely tokens. For example, if k=10, the model will only consider the 10 most likely tokens when generating each word. This can be beneficial for maintaining coherence, but it may also limit the model’s creativity if set too low.”
“Top-p sampling, also known as nucleus sampling, addresses some limitations of top-k by dynamically choosing the number of tokens to consider. Instead of a fixed k, it uses a probability threshold p. Top-p is often more flexible than top-k, as it adapts to the shape of the probability distribution. A typical value for p might be 0.9, meaning we sample from the smallest set of tokens whose cumulative probability exceeds 90%”
Instead of simply using random words as input, the following list of inputs was used to augment the seed topic list prompts using the same prepended phrase:
seed_topic_variations = ['things not in nature',
'technical objects',
'everyday things',
'non-tangible objects',
'things in nature',
'words that start with the letter a',
'words that start with the letter b',
'words that start with the letter c']
The combination of code and prompts generated list of output that were the inputs for creating the training data below!
a subset of the training data …