Guided Generation for LLM Outputs

7 min readJun 5, 2024

LLMs like GPT-4 and Gemini Pro are useful for generating and manipulating text. But to harness their full potential, it’s important to guide the generation process, such that the outputs adhere to specific formats or structures.

In this blog, we will explore the following techniques for guided generation with LLMs:

use of regular expressions
JSON schemas
context-free grammars (CFGs)
templates
entities
structured data generation.

Initialization

First, let’s initialize our environment and set up the Vertex AI client with the necessary configuration to ensure our outputs are both useful and safe:

import pandas as pd
import vertexai
from vertexai.generative_models import GenerativeModel, Part, FinishReason
import vertexai.preview.generative_models as generative_models

vertexai.init(project="project-name", location="us-central1")
model = GenerativeModel("gemini-1.0-pro-vision-001")
generation_config = {
    "max_output_tokens": 300,
    "temperature": 0.4,
    "top_p": 0.9,
}
safety_settings = {
    generative_models.HarmCategory.HARM_CATEGORY_HATE_SPEECH: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    generative_models.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    generative_models.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    generative_models.HarmCategory.HARM_CATEGORY_HARASSMENT: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
}

Guided Generation with Regular Expressions

Regular expressions (regex) are a powerful way to ensure that generated text matches a specific pattern.

E.g. Imagine you need a 6-digit number. By defining a regex pattern, you can validate the generated number, ensuring it’s exactly six digits with no extra spaces or characters. This method is great for maintaining strict control over simple, structured outputs like numeric codes or specific text formats.

Figure 1. Guided Generation with Regular Expressions

number_pattern = re.compile(r"^\d{6}$")
def validate_number(number_str):
    if number_pattern.match(number_str):
        return True
    else:
        print(number_str)
        print("Invalid 6-digit format. Re-prompting...")
        return False
    
    
def generate():
    vertexai.init(project="cart-ppt-llm", location="us-central1")
    model = GenerativeModel(
        "gemini-1.0-pro-vision-001",
    )
    while True:
        responses = model.generate_content(
            [text1],
            generation_config=generation_config,
            stream=True,
        )
    
        number = ""
        for response in responses:
            number += response.text.strip()
    
        if validate_number(number):
            return number
text1 = """Generate a valid 6-digit number, ensuring it contains exactly 6 digits with no spaces or other characters. It should be in the format XXXXXX."""
generated_number = generate()
print(generated_number)

Output:

Guided Generation with JSON Schemas

JSON schemas allow you to define the structure and data types of JSON objects. This is particularly useful when you need to generate structured data, such as user profiles, where each profile must include a name, age, and email.

By validating the generated JSON against a schema, you ensure that the output adheres to the expected structure and data types. This technique is useful for applications requiring precise and predictable data formats.

Figure 2. Guided Generation with JSON Schemas

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer", "minimum": 0},
        "email": {"type": "string", "format": "email"}
    },
    "required": ["name", "age", "email"]
}
prompt = """
Generate a JSON object representing a user profile. The JSON object should have the following structure:
{
    "name": "a random first and last name",
    "age": an age between 20 and 40,
    "email": "a unique email address"
}
"""
def validate_json(instance, schema):
    """ Validate if the generated JSON matches the schema """
    try:
        jsonschema.validate(instance=instance, schema=schema)
        return True
    except jsonschema.exceptions.ValidationError as err:
        print("Invalid JSON format. Error:", err)
        return False
def generate_json():
    responses = model.generate_content(
        [prompt],
        generation_config=generation_config,
        safety_settings=safety_settings,
        stream=True,
    )
    json_response = ""
    for response in responses:
        json_response += response.text.strip()
    if json_response.startswith("```json") and json_response.endswith("```"):
        json_response = json_response[7:-3].strip()
    try:
        json_object = json.loads(json_response)
        if validate_json(json_object, schema):
            return json_object
    except json.JSONDecodeError as e:
        print("Error decoding JSON:", e)
    return None
generated_profile = generate_json()
print(f"{json.dumps(generated_profile, indent=2) if generated_profile else 'None'}")

Output

{
  "name": "John Smith",
  "age": 32,
  "email": "john.smith@example.com"
}

Guided Generation with Context-Free Grammars

Context-Free Grammars (CFGs) allow us to define a set of production rules for generating structured sentences. CGFs are excellent for generating structured sentences or text that follows a specific set of grammatical rules.

E.g., you might want to generate sentences about people performing actions on objects. A CFG can define the structure of these sentences, ensuring they always follow a logical and grammatical pattern. This method is ideal for tasks requiring syntactically correct and varied sentences, such as automated storytelling or dialogue generation.

grammar = CFG.fromstring("""
    S -> NP VP
    NP -> 'John' | 'Mary' | 'Alice' | 'Bob'
    VP -> V Obj
    V -> 'eats' | 'drinks' | 'sees' | 'likes'
    Obj -> Det N
    Det -> 'an' | 'a'
    N -> 'apple' | 'banana' | 'water' | 'book'
""")
prompt = f"""
Generate a sentence based on the following context-free grammar:
{grammar}
Ensure there is a space between each word in the sentence.
"""
def generate_cfg_sentence():
    responses = model.generate_content(
        [prompt],
        generation_config=generation_config,
        safety_settings=safety_settings,
        stream=True,
    )
    response_text = ""
    for response in responses:
        response_text += response.text.strip()
    
    return response_text
generated_llm_sentence = generate_cfg_sentence()
print(f"{generated_llm_sentence}")

Output:

John eats an apple

The following diagram represents the CFG used in the above example:

Figure 3. Context-Free Grammar Chart

S -> NP VP: The start symbol S is expanded into a noun phrase (NP) and a verb phrase (VP).
NP: NP can be any of 'John', 'Mary', 'Alice', or 'Bob'.
VP -> V Obj: The verb phrase VP is expanded into a verb (V) and an object (Obj).
V: V can be any of 'eats', 'drinks', 'sees', or 'likes'.
Obj -> Det N: The object Obj is expanded into a determiner (Det) and a noun (N).
Det: Det can be either 'an' or 'a'.
N: N can be any of 'apple', 'banana', 'water', or 'book'.

In our CFG, the start symbol S is expanded into a noun phrase (NP) and a verb phrase (VP). The NP can be names like 'John', 'Mary', 'Alice', or 'Bob'. The VP is broken down into a verb (V) and an object (Obj). The verb could be actions like 'eats', 'drinks', 'sees', or 'likes'. The object is composed of a determiner (Det) and a noun (N), where determiners can be 'an' or 'a', and nouns can be 'apple', 'banana', 'water', or 'book'. This structured approach ensures that the generated sentences are both grammatically correct and varied.

Template-based Generation:

Template-based generation uses predefined templates to structure the generated text.

E.g., you can create a user profile using a template that specifies placeholders for the name, age, and email. This method ensures that the generated content follows a consistent format, which is particularly useful for applications like automated report generation or content templating where the format is fixed, but the content varies.

prompt = """
Create a user profile using the following template:
Name: {{name}}
Age: {{age}}
Email: {{email}}Ensure the name is a first name and a last name, the age is a number between 20 and 40, and the email is in the format name@example.com.
"""
def generate_template_based_profile():
    responses = model.generate_content(
        [prompt],
        generation_config=generation_config,
        safety_settings=safety_settings,
        stream=True,
    )
    response_text = ""
    for response in responses:
        response_text += response.text.strip()
    
    return response_text
generated_profile = generate_template_based_profile()
print(f"{generated_profile}")

Output:

Name:John Smith
Age: 35
Email: john.smith@example.com

Entity-based Generation

Entity-based generation is about including specific entities in the generated text.

E.g., if you want to generate a paragraph about France, you can specify entities such as the capital (Paris), a famous food (croissant), and the official language (French). This technique ensures that the generated text is relevant and includes the necessary information about the entities, making it ideal for tasks like generating descriptive content or tailored information based on specific data points.

entities = {
    "country": "France",
    "capital": "Paris",
    "famous_food": "croissant",
    "language": "French"
}
prompt = f"""
Generate a 3-line paragraph about {entities['country']} that includes the following entities:
1. The capital city, {entities['capital']}
2. A famous food, {entities['famous_food']}
3. The official language, {entities['language']}
"""
def generate_entity_based_paragraph():
    responses = model.generate_content(
        [prompt],
        generation_config=generation_config,
        safety_settings=safety_settings,
        stream=True,
    )
    response_text = ""
    for response in responses:
        response_text += response.text.strip() + " "
    
    return response_text.strip()
generated_paragraph = generate_entity_based_paragraph()
print(f"Generated Paragraph: {generated_paragraph}")

Output:

Nestled in the heart of Europe, France boasts the captivating capital of Paris, renowned for its iconic Eiffel Tower and the Louvre Museum. Indulge in the delectable aroma of freshly baked croissants, a culinary staple that embodies the nation's rich gastronomic heritage. The official language, French, echoes through the streets, adding a touch of elegance and sophistication to the vibrant atmosphere.

Structured Data Generation

Structured data generation involves creating data in a tabular format, such as CSV, which can be easily converted into a DataFrame for analysis or processing.

E.g., you might generate a table with columns for Name, Age, Country, and Profession, and populate it with data for several rows. This approach is beneficial for generating datasets or structured information that needs to be processed further, ensuring consistency and ease of use in data-centric applications.

column_headers = ["Name", "Age", "Country", "Profession"]
prompt = f"""
Generate a table with the following columns: {', '.join(column_headers)}.
Provide data for 5 rows.
Ensure that the table is properly formatted as CSV without any line breaks within rows.
"""
def generate_dataframe():
    responses = model.generate_content(
        [prompt],
        generation_config=generation_config,
        safety_settings=safety_settings,
        stream=True,
    )
    response_text = ""
    for response in responses:
        response_text += response.text.strip()
    
    return response_text
generated_table = generate_dataframe()

df = pd.read_csv(io.StringIO(generated_table))
df

Output:

Name	Age	Country	Profession
0	John	30	USA	Software Engineer
1	Mary	25	Canada	Doctor
2	Bob	40	UK	Teacher
3	Alice	28	Australia	Lawyer
4	Tom	35	Germany	Architect

Wrapping Up:

Guided generation techniques are key to making sure LLM outputs are useful and well-structured. Using methods like regular expressions, JSON schemas, CFGs, templates, entities, and structured data generation can greatly improve the accuracy and reliability of LLM content. These techniques help ensure the generated text meets specific needs, making it easier to integrate LLMs into real-world applications.

Here is a link to a Jupyter Notebook containing all the above code.

Thanks for reading!

Guided Generation for LLM Outputs

Initialization

Guided Generation with Regular Expressions

Guided Generation with JSON Schemas

Guided Generation with Context-Free Grammars

Template-based Generation:

Entity-based Generation

Structured Data Generation

Wrapping Up:

Written by Kopal Garg