How to Augment Text Data with Gemini through BigQuery DataFrames

Karl Weinmeister
Google Cloud - Community
4 min readApr 6, 2024

Data augmentation is a technique used in machine learning to increase the size of a dataset by creating new data out of existing data. This technique can help models generalize better, avoiding overfitting on the data it was trained on.

You often think of doing this in visual data, by rotating data, flipping, cropping, and so forth. PyTorch has a very useful transforms package that allows you to apply random transformations to your dataset with just a few lines of code. While this may reduce accuracy on the training set, it often results in improved accuracy on the test set of unseen data — which is what really matters!

Example: Data Augmentations of Rock Images, Credit: TseKiChun, CC BY-SA 4.0, via Wikimedia Commons

We can apply this same technique to text data. It provides the same benefits of stretching your existing dataset, and making your model more robust to noise and outliers. There are proven benefits to data augmentation of all datasets, which are particularly beneficial for small datasets.

Let’s explore a few examples using popular techniques:

There are a number of ways you can apply these techniques manually. Let’s say you want to apply the random deletion technique with p=0.1 of tokens deleted. You can tokenize the text and then add back tokens with (1-p) probability. Or, for back-translation, you can call the Translation API once for the target language, and then a second time to translate back to the original language. For synonyms, you could use a WordNet API on random tokens.

With a powerful LLM like Gemini, you have a bag of tricks at your fingertips. You can easily make these modifications and much more in one toolset. No need to cobble together multiple tools any longer.

Let’s look at how to apply these techniques on a real world dataset of Stack Overflow questions and answers. All of the details are provided in this notebook, and I’ll point out the highlights here.

You can use BigQuery DataFrames for all kinds of problems, but it will make text augmentation on our BigQuery dataset particularly straightforward. It provides a pandas-compatible DataFrame and scikit-learn-like ML API that enables us to query Gemini directly. It can handle batch jobs on massive datasets, as all DataFrame storage is in BigQuery.

So, let’s get started with one of these techniques, synonym replacement. First, we can query for accepted Stack Overflow Python answers since 2020, and put it into a BigQuery DataFrame:

stack_overflow_df = bpd.read_gbq_query(
"""SELECT
CONCAT(q.title, q.body) AS input_text,
a.body AS output_text
FROM `bigquery-public-data.stackoverflow.posts_questions` q
JOIN `bigquery-public-data.stackoverflow.posts_answers` a
ON q.accepted_answer_id = a.id
WHERE q.accepted_answer_id IS NOT NULL
AND REGEXP_CONTAINS(q.tags, "python")
AND a.creation_date >= "2020-01-01"
LIMIT 550
""")

Here’s a sneak peek of the Q&A DataFrame:

Let’s now randomly sample a number of rows from the dataframe. Set n_rows to the number of new samples you’d like:

df = stack_overflow_df.sample(n_rows)

We can then define a Gemini text generator model like this:

model = GeminiTextGenerator()

Next, let’s create two columns: a prompt column with synonym replacement instructions concatenated with the input text, and a result column with the synonym replacement applied.

# Create a prompt with the synonym replacement instructions and the input text
df["synonym_prompt"] = (
f"Replace {n_replacement_words} words from the input text with synonyms, "
+ "keeping the overall meaning as close to the original text as possible."
+ "Only provide the synonymized text, with no additional explanation."
+ "Preserve the original formatting.\n\nInput text: "
+ df["input_text"])

# Run batch job and assign to a new column
df["input_text_with_synonyms"] = model.predict(
df["synonym_prompt"]
).ml_generate_text_llm_result

# Compare the original and new columns
df.peek()[["input_text", "input_text_with_synonyms"]]

Here are the results! Notice the subtle changes in the text with synonym replacement.

Using this framework, it is simple to apply all kinds of batch transformations to augment your data. In the notebook, you’ll see more prompts you can use for back translation and noise injection. You’ve also seen how easy it is to enhance datasets with BigQuery DataFrames. We hope this helps you in your data science journey using Gemini on Google Cloud!

--

--