LLM Customization: Part 1 - Using an LLM to generate synthetic training data

D. Kyle Miller
5 min readJul 8, 2024

--

Objective: Fine-tuning an LLM to distinguish between addresses that are misspelled verus those that just look similar to one another

I. Task: Validating an Address

There are a number of use cases where validating an address or selecting the correct address from ones that contain errors can drive business processes and lower costs.

Which address is valid?

Marketing: A modest direct mail campaign could exceed $300,000 just for postage and stationary. While this tactic has been proven to still be effective might have sucess rates as low as 2 percent. Every record validated reduces the cost of the campaign and drives up the potential ROI.

Beyond marketing, however any opportunity to de-duplicate extraneous records is key driver of efficency, potentially reducing storage cost and the amount of pertinent information an analyst might need to analyze.

II. Generating a Synthetic Training Dataset

Synthetically generated training data can be useful to fine-tune LLMs, especially when the training data contains representative samples. Literature suggest that a relatively small number of good samples (< 10,000, relative to the amount of samples used to train the LLM) are need to tailor model performance. A few of the samples generated with an LLM are shown below:

Synthetic Address Data

The following function is used to generate training data:

def syntheticTrainingdata(streetName, percentage=3):
"""function used to create and label closely misspelled versions of street addresses"""

# Create sample address
addy = generateAddy(streetName)

if random.choice([True, False]):
# Create misspelling
streetName = addy.split(' ')[1]
positions = round(len(streetName)/percentage)

# randomly select postions
misspell = list(streetName)

# generate misspelled addresses from sample address
if random.choice([True, False]):
# remove character
for i in range(positions):

misspell.pop(random.randrange(1, len(misspell)))

badAddy = f"{addy.split(' ')[0]} {''.join(misspell)} {addy.split(' ')[2]}"
return(addy, badAddy, 1)
# add character
else:
for i in range(positions):

misspell.insert(random.randrange(1, len(misspell)), random.choice(string.ascii_letters.lower()))

badAddy = f"{addy.split(' ')[0]} {''.join(misspell)} {addy.split(' ')[2]}"
return (addy, badAddy, 1)

else:
# Use LLM to create similar/but different address
streetName = addy.split(' ')[1]

_prompt = f'Word that sounds like, looks like or could be confused with {streetName}'

try:

_diffStreet = generate_list(1, _prompt, top_k=16, temperature=0.9, top_p=0.8)

differntAddy = generateAddy( _diffStreet[0])

return (addy, differntAddy, 0)

except:
# error handling fallback for instances when LLM response is invalid
differntAddy = generateAddy('main')

return (addy, differntAddy, 0)

This function blends a non-deterministic component, leveraging a LLM, to generate sample street names and numbers and a deterministic approach to generate random but similar misspelling errors from of the LLM output.

A non-deterministic LLM API call is then used to produce similarly spelled street names to form pairs of addresses that may be similar but indicate different physical locations. In instances when the model does not produce valid output error handling has been implemented.

III. Implementing non-deterministic (LLM) components of the pipeline

An LLM is used for two things:

  1. To generate a sample street name
  2. To generate a word that sounds like another word

In the first case, the sample street names are generated from seed topics. The goal was to generate the most valid but disperate outputs from as few inputs as possible. Two different seed formats were used to prompt the model.

Using a list of random words, the follwing prompts were genrated:

LLM Prompting format
for seed in seed_topics_static:
seed_topic_prompt = f'singular nouns with diverse theme of {seed}'

seed_topics_create.append(seed_topic_prompt)

The words “singular nouns with diverse theme of” and adjusting top_k, temperature, top_p parameters were useful for generating up to 3 useful outputs given each prompt.

More on these parameters here: Temperature, Top-K, Top-P explained but essentially the affect the randomness found in LLM responses.

“Temperature is one of the most fundamental and widely used sampling methods for LLMs. It controls the randomness of the model’s output by scaling the logits (unnormalized log probabilities) before applying the softmax function”

“Top-k sampling restricts the sampling to only the k most likely tokens. For example, if k=10, the model will only consider the 10 most likely tokens when generating each word. This can be beneficial for maintaining coherence, but it may also limit the model’s creativity if set too low.”

“Top-p sampling, also known as nucleus sampling, addresses some limitations of top-k by dynamically choosing the number of tokens to consider. Instead of a fixed k, it uses a probability threshold p. Top-p is often more flexible than top-k, as it adapts to the shape of the probability distribution. A typical value for p might be 0.9, meaning we sample from the smallest set of tokens whose cumulative probability exceeds 90%”

Instead of simply using random words as input, the following list of inputs was used to augment the seed topic list prompts using the same prepended phrase:

seed_topic_variations = ['things not in nature', 
'technical objects',
'everyday things',
'non-tangible objects',
'things in nature',
'words that start with the letter a',
'words that start with the letter b',
'words that start with the letter c']

The combination of code and prompts generated list of output that were the inputs for creating the training data below!

a subset of the training data …

Sample Training Data

--

--

D. Kyle Miller

Believer. Data Scientist. Consultant. Firm believer that people who are empowered to answer difficult questions can change the world!