Sitemap
Data Engineer Things

Things learned in our data engineering journey and ideas on data and engineering.

Pydantic for Experts: Multi-Field Validation

5 min readApr 28, 2025

--

Congratulations 🎉

If you’re reading this, you probably want to improve your python skills and learn some advanced pydantic functionality.

⚠️ Disclaimer: I’m a contributor to Pydantic.

Photo by Greg Bulla on Unsplash

Introduction

Pydantic is the go-to data validation python library. It enforces schemas through type hints with runtime validation. It also allows for “schema coercion” — manipulating data to fit the expected schema.

🧠 This article assumes familiarity and proficiency with pydantic.
For more introductory content, see their examples documentation.

Problem Statement: Multiple Dependent Fields

How to tie the existence of several fields to each other?

This problem can exist in multiple forms of complexity.
Let’s start with the simplest case first.

Simple: 3 dependent fields

Take this ResponseModel as an example:

import datetime
from typing import Optional
from pydantic import BaseModel


class SimpleResponseModel(BaseModel):
system_id: str

email: Optional[str] = None
email_source_date: Optional[datetime.date] = None
email_source_id: Optional[str] = None

I want to ensure all 3 email fields are present, or none at all.

Challenging: Groups of dependent fields

A more realistic situation is where there are multiple groups of dependent fields — each “key” having a source_date and source_id:

import datetime
from typing import Optional
from pydantic import BaseModel


class ChallengingResponseModel(BaseModel):
system_id: str

email: Optional[str] = None
email_source_date: Optional[datetime.date] = None
email_source_id: Optional[str] = None

address: Optional[str] = None
address_source_date: Optional[datetime.date] = None
address_source_id: Optional[str] = None

account_balance: Optional[int] = None
account_balance_source_date: Optional[datetime.date] = None
account_balance_source_id: Optional[str] = None

Here, we have email , address, and account_balance, each expected to have all associated fields (source_date and source_id), or none.

Even More Challenging: Multiple models X different groupings

What if we have multiple models, each with different group structures?

import datetime
from typing import Optional
from pydantic import BaseModel


class ChallengingResponseModel(BaseModel):
system_id: str

account_type: Optional[str]
account_type_source_date: Optional[datetime.date] = None
account_type_source_lob: Optional[str] # <-------- new field
account_type_source_lob_id: Optional[str] # <----- lob_id instead of source_id

The account_type group varies from previous groups — it has a new field (source_lob) and a variation of the _id field.

We need a clean, reusable approach to handle this complexity.

SOLUTION OVERVIEW

Herein are 3 solution designs, presented in increasing levels of abstraction.

Solution 1: Model Validator

Let’s start simple. You can compare all fields to each other, after they’re set on the model using a model_validator:

import datetime
from typing import Optional
from pydantic import BaseModel, model_validator


class SimpleResponseModel(BaseModel):
system_id: str

email: Optional[str] = None
email_source_date: Optional[datetime.date] = None
email_source_id: Optional[str] = None

@model_validator(mode="after")
def all_or_none_emails(self) -> 'SimpleResponseModel':
"""
All 3 email fields must be present, or none at all
"""
email_fields = [
self.email,
self.email_source_date,
self.email_source_id
]

if any(email_fields) and not all(email_fields):
raise ValueError(
"All 3 email fields must be present or none at all."
)
return self

When to use:

This solution works well when you have a narrow set of dependent fields.

Simple is good.

When not to use:

Too many dependent fields will result in duplicated code. Also, hardcoded field names can make this brittle and hard to maintain.

Repeated code is bad.

Problem: It’s annoying to test:

Lots of test cases which might never appear in “the real world”:

@pytest.mark.parametrize(
("data", "expectation"),
[
({}, pytest.raises(ValidationError)),
({"system_id": "abc-123"}, nullcontext()),
({"system_id": "abc-123", "email": "abc@h.com"}, pytest.raises(ValueError)),
({"system_id": "abc-123", "email": "abc@h.com", "email_source_date": "2024-01-01"}, pytest.raises(ValueError)),
({"system_id": "abc-123", "email": "abc@h.com", "email_source_date": "2024-01-01", "email_source_id": "123"}, nullcontext())
]
)
def test_response_model(data: dict, expectation):

with expectation:
_ = SimpleResponseModel.model_validate(data)

⚠️ This method indicates a poor abstraction:
email won’t exist without the other email_ fields.

Solution 2: Nested Model

Because a key won’t exist without its dependents, it makes sense to create a nested structure:

{
"system_id": "abc-123",
"email_stuff": {
"email": "me@happy.com",
"email_source_date": "2024-01-01",
"email_source_id": "123"
},
"another_thing": {
...
}
}

This solution assumes you have control over how this data is being consumed.. Ideally, you should influence this process — don’t commit a sin which requires you to continue to sin…

Python code for this nesting is super simple:

class EmailStuff(BaseModel):
email: str
email_source_date: datetime.date
email_source_id: str


class SimpleResponseModel2(BaseModel):
system_id: str
email_stuff: Optional[EmailStuff] = None

Testing becomes simpler:

You don’t need to simulate partially populated EmailStuff objects (unless you expect them in the “real world”).

When to use:

  • If you can control how data is consumed.
  • You have a relatively few number of nested groups. (If you have hundreds of groups, it will be tedious creating the nested models.)

When not to use:

  • Sometimes you need to (or want to) keep a flat structure. (Ex: event streaming to a 3rd party.)

Solution 3: Dynamic Creation of Nested Model

We can add a powerful abstraction to our second solution, by utilizing pydantic’s create_model function:

from pydantic import BaseModel, create_model

EmailStuff = create_model(
"EmailStuff",
email=str,
email_source_date=datetime.date
email_source_id=str,
__base__=BaseModel
)

We can abstract further by creating a helper function which wraps the pydantic function:

import datetime
from typing import Type, Optional
from pydantic import BaseModel, create_model


def create_stuff(name: str, dtype: Type) -> type[ModelT]:
"""dynamically creates nested schema for '_Stuff' objects"""
fields = {
name: (dtype, ...),
f"{name}_source_date": (datetime.date, ...),
f"{name}_source_id": (str, ...)
}
return create_model(
f"{name.title()}Stuff",
**fields,
__base__=BaseModel
)


class ResponseModel3(BaseModel):
system_id: str
email_stuff: Optional[create_stuff("email")] = None
address_stuff: Optional[create_stuff("address")] = None

Very neat! 🎉 We’ve now solved for the simple and complex case.

Abstracting even further:

What if we want to control which fields get created, without creating a new create_stuff(...) function each time?

import datetime
from typing import Type, Optional, Dict
from functools import partial
from pydantic import BaseModel, create_model
from pydantic.main import ModelT


def create_stuff_base(name: str, dtype: Type, fields: Dict[str, Type]) -> type[ModelT]:
model_fields = {
name: (dtype, ...)
}
for k,v in fields.items():
model_fields[f"{name}_{k}"] = (v, ...)

return create_model(
f"{name.title()}Stuff",
**model_fields,
__base__=BaseModel
)


create_stuff_1 = partial(create_stuff_base, fields={"source_date": datetime.date, "source_id": str})
create_stuff_2 = partial(create_stuff_base, fields={"source_date": datetime.date, "source_lob": str, "source_lob_id": str})


class ChallengingResponseModel(BaseModel):
system_id: str

email_stuff: Optional[create_stuff_1("email")] = None
account_stuff: Optionl[create_stuff_2("account_type")] = None

Partial functions to the rescue!

When to use this:

  • You have many nested schemas with varying group structures.
  • In larger projects.

When not to use:

  • Try the simpler things first.
  • A dynamic function should simplify the problem more than it complicates the solution.

Summary

How do you handle multiple dependent fields in Pydantic? It depends…

We discussed several solution designs:

  • Model validator: A simple solution when you can’t control the data structure
  • Nested models: The simplest solution when you can control the data structure
  • Dynamic creation of nested models: A powerful abstraction for creating nested models dynamically

When implementing, start with the simplest solution. If it gets unwieldy, then proceed with more abstractions.

Pydantic for Experts Series:

This article is part of a series on advanced usage of Pydantic.

  1. Don’t Write Another Line of Code Until You See These Pydantic V2 Breakthrough Features
    An overview of several features I’m most excited about, introduced in V2.
  2. Pydantic for Experts: Discriminated Unions in Pydantic V2
    Differentiate model selection.
  3. Pydantic for Experts: Reusing & Importing Validators
    Advanced techniques for reusing and importing validation across python models.

--

--

Data Engineer Things
Data Engineer Things

Published in Data Engineer Things

Things learned in our data engineering journey and ideas on data and engineering.

Yaakov Bressler
Yaakov Bressler

Written by Yaakov Bressler

Data Engineer @ Capital One. Editor in Chief @ Data Engineer Things. More about me at www.yaakovbressler.com

No responses yet