Data validation using Pydantic

Mahima Manik
5 min readNov 30, 2023

Most of us has started our coding journeys with C/C++/Java in college days. Once we master data structures and algorithms well, it becomes easy to transition to other programming language.

But when we are writing production level code or trying to level up our coding game, just OOPS will not help us sail through. We need to understand the style of coding of that particular language, leverage the strengths of well-established libraries and adhere to industry best practices.

So, let’s begin! 🥂

Data validation challenges in Python 👀

In Python, we do not need to specify data type of the variable when declaring or using it. Example:

var = 100
var = 'My name is Python'

When the code interacts with user or other systems, it becomes important to ensure that the input variable with right data type is being sent. Neglecting to validate these data types can result in unforeseen errors. One common way to do it is using isinstance() function.

def handler(event):
# data type of event should be a dictionary
if not isinstance(event, dict):
raise TypeError

Suppose we further want to validate following event before processing:

{
'username': 'Michael Tom',
'email': 'micael@gmail.com',
'id': 'a0b23ba9-6542-40af-9ed5-244abdc66506',
'years_of_experience': 10,
'companies': {
'company_name_1': {
'role': 'SDE 1'
'tenure': 3
},
'company_name_2': {
'role': 'SDE 2',
'tenure': 4
},
'company_name_3': {
'role': 'SDM',
'tenure': 3
},
}
}

To perform this validation in code, we have to write long lines of codes.

  • Check if it contains keys such as username, email, id , years_of_experience and companies
  • Check data types of each value in the event dictionary
  • username : string
  • email : string and valid email
  • id : valid UUID
  • years_of_experience : integer
  • companies : dictionary of the following format.

Using isInstance()can be cumbersome and not scalable. Maintaining manual validation code is difficult in long run. This is where Pydantic comes to rescue. It not only enforces type hints at runtime but also simplifies data handling, making Python code more robust and maintainable.

What is Pydantic ⚡

Pydantic is a data validation library.

  • It can validate for data type of items in the object
  • perform data validations of values stored in these objects
  • Transform the data stored in object (out of scope here)
  • Parsing environment variables (out of scope here)

Pydantic is downloaded 126M times a month and used by some of the largest and most recognisable organisations in the world!

Let’s define the input schema for a simple input dictionary. This is the schema we expect our input to adhere:

from pydantic import BaseModel, ValidationError

# Define the Pydantic model to parse input
class CandidateInfo(BaseModel):
username: str

# Code execution starts here
if __name__ == '__main__':
input_data = {
'username': 'Michael Tom'
}

try:
data = CandidateInfo(**input_data)
print(data.model_dump()) # This will print {'username': 'Michael Tom'}
print(data.username) # This will print 'Michael Tom'
except ValidationError as err:
print('Error validating candidate info: ', err)

Note: After parsing the data through custom Pydantic class, it becomes easy to access values inside data. So, instead of data['username'] , we can now access model attributes using dot notation data.username 🎉. This approach not only makes the code cleaner but also enhances readability and reduces the likelihood of key-related errors commonly encountered with dictionary access.

Defining Schema and validating data 🗃️

Let’s create Pydantic class for the event example. With Pydantic, we are using two more libraries for validation: typing and uuid

from pydantic import BaseModel, PositiveInt, ValidationError
from typing import Dict
from uuid import UUID

class CompanyInfo(BaseModel):
role: str
tenure: PositiveInt

class CandidateInfo(BaseModel):
username: str
email: str
id: UUID
years_of_experience: PositiveInt # PositiveInt ensures value is a positive integer
companies: Dict[str, CompanyInfo] # Dictionary with company names as keys and CompanyInfo as values

if __name__ == '__main__':
# Example usage
input_data = {
'username': 'Michael Tom',
'email': 'micael@gmail.com',
'id': 'a0b23ba9-6542-40af-9ed5-244abdc66506',
'years_of_experience': 4,
'companies': {
'company_name_1': {
'role': 'SDE 1',
'tenure': 3
},
'company_name_2': {
'role': 'SDE 2',
'tenure': 4
},
'company_name_3': {
'role': 'SDM',
'tenure': 3
},
}
}

try:
candidate = CandidateInfo(**input_data)
print(candidate.model_dump())
except ValidationError as err:
print('Error validating candidate info: ', err)

When we pass the input data to CandidateInfo , we expect all the fields defined CandidateInfoin to be present.

Note: companies should be a dictionary with key of type string and value to comply another Pydantic schema CompanyInfo

Frequently used Pydantic functions 🙋‍♀️🙋‍♂️

Customizing Fields in Pydantic model

Field is used to add additional information and customizations to Pydantic model.

  • Assigning default value to a field in the model
  • It can change the way field is accessed by assigning it an alias.
from pydantic import BaseModel

class CandidateInfo(BaseModel):
username: str = Field(default='test user', alias='name')


candidate = CandidateInfo()
print(candidate.name) # test user

Note: You can also define validation_alias and serialization_alias for the Field.

Validating Fields

Pydantic provides validations for inbuild Python datatypes like str ,int , etc. You can also write custom validator for fields and for the model. Here are some examples:

EMAIL_REGEX = r'^[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+$'

class CandidateInfo(BaseModel):
username: str = Field(min_length=3, max_length=30) # validating string
email: str = Field(pattern=EMAIL_REGEX) # Regex validation
age: int = Field(gt=21) # Integer validation

More info can be found here.

Model Serialization

Serialize/model dump is a process of converting a Pydantic model/dataclass to a dictionary or JSON-encoded string. This comes handy when passing data as dictionary between different services.

candidate.model_dump()
candidate.model_dump_json()
dict(candidate_info)

There are additional parameters to customize the how a model is serialized to a dictionary or JSON, which can be found here.

Note: There are various customizations available to return values fo seralized/deserialized results from Pydatic. They are helpful to encapsule some fields and allow access in defined way.
Example: include/exclude fields returned, apply function serializer to fields/model, define model serializer to return anything you want from serialization result!

candidate.model_dump(by_alias=True) # Returns {'name': 'test user'} because name is the alias for username

Summing up 💥

Pydantic's features go beyond simple validations, model serialization, custom validators; it also supports integration with popular frameworks like FastAPI.

Pydantic is a game-changer for Python developers, streamlining data validation and reducing the likelihood of bugs. It encourages cleaner code, enforces best practices, and integrates seamlessly with the Python ecosystem. We encourage you to incorporate Pydantic into your projects and join the community in exploring its full potential.

Hope it helps in making your code better! Thank you for reading this 👋

--

--

Mahima Manik

Senior Software Developer at Real Vision, Ex-Amazon, M.Tech Computer Science, IIT Delhi