Controlling Large Language Model Output with Pydantic

Matt Chinnock
7 min readMar 6, 2024

--

Hint: It doesn’t actually involve using any dials.

For as powerful as Large Language Models can be in a conversational chat bot setting, they can be equally as difficult to integrate deeper into the workflow of an application. In order for text-based generative AI to be used as logic nodes in traditional codebases — where they process inputs, make decisions and provide outputs — a high level of output predictability from the model is essential. Any code that depends on this output will expect certain rules and formats to be followed at every juncture.

The variability in responses you may have noticed from services like ChatGPT and Claude, even when asking the same question, is intrinsically linked to the architecture of neural networks. This is a stark contrast to the consistent and repeatable outcomes we expect from traditional software engineering applications. In order for these two approaches to work well together we must ensure precise and relevant model generated content.

Fortunately for us, Pydantic, a versatile and well-known data validation library in Python, offers a robust solution for governing the output of our LLMs, giving us dependable AI outputs. In this article we’ll take a look at how we can leverage Pydantic objects to optimize and regulate model output and cover an example to illustrate its efficacy.

Introduction to Pydantic

Pydantic is a useful package for data validation and settings management in Python applications. It allows developers to define data models using Python’s type annotations, enabling seamless validation, serialization, and deserialization of data.

Pydantic Models

Conveniently, Pydantic models serve as blueprints for defining the structure and properties of data. These models are created using Python classes, where each class attribute represents a specific data field along with its associated validation rules. These rules enforce data type constraints, default values, and other criteria, ensuring input data adheres to predefined standards.

Let’s illustrate the concept of Pydantic models in the context of LLM output control.

Suppose we have an application that uses the X API to retrieve tweets about a current event, and we want to use that data in our application. X is a great source for up to date information, but tweet content might not always be appropriate for our end users.

Let’s assume with have an array of Python dictionaries that contains the three most recent tweets about a fictional wildfire event.

# Example tweets and their metadata
raw_tweets = [
{"id": "001", "text": "It's been a week since the Sparks Valley fire roared to life, and it's now the largest wildfire in state history.", "date": "03-04-2024"},
{"id": "002", "text": "Video of the flames destroying my neighborhood. **** that fire! #firessuck", "date": "03-04-2024"},
{"id": "003", "text": "I blame Biden for the #sparksvalleyfire response", "date": "02-29-2024"},
]

In this case, our aim is to screen the tweets for political content and offensive language using an LLM, ignoring tweets that meet either criteria.

We can define a Pydantic model Tweet with properties isPolitical and isOffensive to govern the output:

from pydantic.v1 import BaseModel, Field, Extra

class Tweet(BaseModel):
isPolitical: bool = Field(description="Whether the tweet is political")
isOffensive: bool = Field(description="Whether the tweet is offensive")

class Config:
extra = Extra.forbid # Forbid extra fields not defined in the model

The Tweet model defines two boolean properties: isPolitical and isOffensive, each annotated with a description for added context. The context isn’t just for us; the LLM will use it to better understand the structure of the output it should provide.

The Config class within the model configuration enforces strict validation by forbidding any extra fields not explicitly defined in the model.

Integrating with LLM Output

Next, we’ll create a function check_tweet() to assess tweet content using a locally ran model, in this case Mistral 7b Instruct. We tell the model to conform its output to the structure defined in our Pydantic model, and the resulting JSON object will inform our application as to whether the tweets match either of our criteria:

# Note: 
# This example uses Langchain as a basis for interacting with a
# local Ollama model but conceptually applies to any LLM.

from langchain_community.llms import Ollama
llm = Ollama(model="mistral:7b-instruct-v0.2-q6_K")

from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import JsonOutputParser
from pydantic.v1 import ValidationError

screened_tweets = []

parser = JsonOutputParser(pydantic_object=Tweet)

def check_tweet(tweet):
# Extract tweet text
tweet_text = tweet['text']
print(tweet_text)

# Define a prompt template for assessing tweet content
# and include the formatting instructions
prompt = PromptTemplate(
template="""
Assess whether the tweet contains references to politicians or political parties.
Assess if the tweet contains offensive language.
{format_instructions}

{tweet_text}
""",
input_variables=["query"],
partial_variables={"format_instructions": parser.get_format_instructions()},
)

# Construct a Langchain Chain to connect the prompt template with the LLM and Pydantic parser
chain = prompt | llm | parser
result = chain.invoke({"tweet_text": tweet_text})

print(result)
print("------")

for tweet in raw_tweets:
check_tweet(tweet)

To recap, the check_tweet() function interacts with the LLM through a defined prompt template, sets the output format using the Pydantic model, and outputs the resulting JSON.

Here’s the output:

It’s been a week since the Sparks Valley fire roared to life, and it is now the largest wildfire in state history.
{‘isPolitical’: False, ‘isOffensive’: False}
------
Video of the flames destroying an old barn. [explitive] that fire! #firessuck
{‘isPolitical’: False, ‘isOffensive’: {‘title’: ‘IsOffensive’, ‘description’: ‘Determination of offensive language in the tweet’, ‘type’: ‘boolean’, ‘data’: True}}
------
I blame Biden for the sparks valley fire response!
{'isPolitical': True, 'isOffensive': False}
------

It worked! Kinda. You’ll notice that in the second tweet, our LLM got a little hallucinate-y and invented a brand new data structure for isOffensive. It’s valid JSON — it even makes sense to read — but we didn’t tell it to use that structure, it just “dreamt it up”. In practice, our application would fail to ingest this data and would throw an error. We couldn’t even account for this new structure, because a future run of this script might yield an entirely different JSON output, even if the inputs are the same.

Let’s make some changes in an attempt to tame the beast. First, we can do some light prompt engineering and add some extra instructions to our template:

template="""Assess whether the tweet contains references to politicians or political parties. 
Assess if the tweet contains offensive language.

{format_instructions}

{tweet_text}

Only respond in the correct format, do not include additional properties in the JSON.\n""",

Whilst unrelated to Pydantic, it’s important to consider that the language and instructions we use when interacting with LLMs greatly influence the output we get from them. Here we’ve added “Only respond in the correct format, do not include additional properties in the JSON.” to the end of the prompt.

Now lets add a validation step to our check_tweet() function right after the print(result) call at the end of the script:

#Validate the output using the Pydantic model
try:
tweet_result = Tweet(**result)
# If valid, approve or reject tweet based on LLM output
if not tweet_result.isPolitical and not tweet_result.isOffensive:
screened_tweets.append(tweet)
print(f"Adding {tweet['id']} to screened tweets.")
else:
print(f"Rejecting {tweet['id']}.")
except ValidationError as e:
print("Data output validation error, trying again...")
# Retry tweet screening if validation fails
check_tweet(tweet)

You’ll note that should the output fail to validate against the Pydantic model, the script runs the same method again until the output matches the structure we are expecting.

I have a confession to make… I wasn’t being completely honest about our incorrect output the first time around. Yes, it was the result of a real execution of the given code, but it took several attempts to achieve that hallucination. The reality is, with the right prompts and by passing a Pydantic model formatter to a high performing LLM, you will more often than not get the correct data structure back in the response.

So, our rudimentary fix for this example is to just try again until it works. The fact that this approach is viable demonstrates the unpredictability of the output, even when we attempt to exert some control. Fortunately, this occasional unpredictability is more apparent in the structure of the output as opposed to the analysis of the tweet content, and this is largely down to the model we’ve chosen and how it’s been trained. When running this script you’ll likely find that a hallucination rarely happens, and when it does the LLM usually produces a valid output as soon as the second time of asking.

With that said, please do not use this exact approach in a production environment. A proper implementation would need to account for more rigorous fallbacks in the case of a particularly stubborn tweet/LLM. Ideally you would also fine tune the model to better assess tweet sentiment and improve output consistency by training the model on examples of each, too. Neither are things we will cover today.

Let’s check our output again:

It's been a week since the Sparks Valley fire roared to life, and it is now the largest wildfire in state history.
{'isPolitical': False, 'isOffensive': False}
Adding 001 to screened tweets.
------
Video of the flames destroying an old barn. [explitive] that fire! #firessuck
{'isPolitical': False, 'isOffensive': True}
Rejecting 002.
------
I blame Biden for the sparks valley fire response!
{'isPolitical': True, 'isOffensive': False}
Rejecting 003.
------

Much better! The new output is usable to the rest of our application, and we can have greater confidence that we won’t be storing data that our end users may find inappropriate.

Conclusion

Pydantic has emerged as a powerful tool for controlling the output of LLMs, ensuring that generated content adheres to predefined standards. By leveraging Pydantic models, you can enforce data integrity and streamline the integration of LLM systems into real-world applications.

Here are some final thoughts to consider:

  • Model outputs are unpredictable.
  • We can make them much more predictable with Pydantic.
  • Our code should still account for model hallucinations.
  • Just because the output data structure is correct doesn’t mean the model has always analyzed your content in a way that you would consider correct!
  • Consider fine tuning your model of choice to further improve accuracy and consistency.

Now that you know how to better integrate LLMs into your application, its time to go build something. Have fun!

--

--