Migrating to Pydantic V2

Published in

CodeX

10 min readMar 27, 2024

On the 30th of June 2023, the second version of Pydantic, the popular data validation and parsing Python library was released. This major overhaul of the library promises to bring better performance and reliability. This is achieved by a new implementation of pydantic-corein Rust instead of Python, which executes the validation logic. Based on benchmarks, this makes Pydantic v2 between 4x and 50x faster than Pydantic v1.9.1. Its reliability and consistency are also improved, by offering a strict mode that no longer magically coerces data, clear rules on required vs nullable fields, and more flexible validation options.

At Qargo, we heavily rely on Pydantic for various tasks, including data validation for our APIs, deserialization of unstructured data from databases, and providing structured schemas for internal templating tools. However, as our data volume grew with our scaling operations, we began experiencing performance bottlenecks during (de)serialization of Pydantic models. The increasing complexity of our data, particularly with numerous nested BaseModels, resulted in unacceptable delays, sometimes taking multiple seconds for certain models, no bueno.

Our motivation was clear, we needed to migrate to V2 to maintain optimal performance in our transportation management system (TMS). However, this decision came with challenges. Given the business-critical nature of our TMS, any downtime or data loss results in immediate user impact. Furthermore, with a codebase comprising hundreds of thousands of lines of Python, there’s a lot to cover in a migration. Complicating matters, compatibility issues arose between Pydantic V1 and upgraded versions of mypy, meaning that we had to suspend type checks for quite a bunch of files containing Pydantic code, as we were getting incorrect errors.

As I embarked on this project, I quickly realized there’s not much out there to guide a migration of this magnitude. Aside from the (excellent) Pydantic documentation, I only came across a single video and one article that deemed the migration process as straightforward. This left me to navigate the task largely on my own. That’s why I decided to write this article, so it can hopefully help you when undertaking a large-scale Pydantic migration. I’ll cover the approach we took, the problems we encountered, and the lessons learned.

The migration in theory

First of all, the folks at Pydantic did a fantastic job at providing a host of tools to help migrate from V1 to V2. Their comprehensive migration guide details all the differences between V1 and V2 and the recommended way of transforming your code to a V2 way of working. This guide became my best friend throughout the process.

As with any major library upgrade, quite a few concepts have different behavior, classes, and methods that have been renamed or deprecated and new patterns to solve problems. The guide lays out the changes made and does a decent job explaining design choices. Because the guide is so extensive, you’ll find plenty of detail, but it’s also easy to get lost. I recommend reading it fully once and then focusing on the V1 features used most in your code base.

Beyond the documentation, the team also built a code transformation tool Bump Pydantic that you can apply to the files you want to upgrade. The tool offers some rules it will use to change common V1 patterns to common V2 patterns, such as adding a default None value to optional fields:

# V1 code
class User(BaseModel):
    name: Optional[str]

# New V2 Code
class User(BaseModel):
    name: Optional[str] = None

Encouraged by the documentation and tooling stating that the migration is not too difficult and can be completed in a couple of hours, it’s time to set out on our migration journey!

The migration in practice

Upgrading to V2

We started by upgrading Pydantic to V2 without migrating. This is possible because the full V1 library is contained in the V2 library to help for a smooth upgrade process. All we had to do across the codebase was change our imports:

# OLD pydantic V1 import
from pydantic import BaseModel

# NEW pydantic v1 import in V2
from pydantic.v1 import BaseModel

I encourage first upgrading and working with V2 in this way. This ensures stability for existing code while providing access to V2 so you can start using it for new models. This helps you and your team gain experience working with V2 features, reducing the risk of breaking things.

Gradually migrating to V2

To further mitigate risk, we opted to migrate parts of the codebase. We determined some key Pydantic models that were often (de)serialized and showed subpar performance, based on metrics from the production system.

Migrating the system in parts means you’ll have both V1 and V2 concepts mixed in your codebase. This can be confusing, considering a lot of concepts share the same names (i.e. BaseModel ). Therefore we opted for explicit V1 imports with V2 as the default for upgraded modules, resulting in more consistency.

from pydantic import BaseModel, Field, model_validator
from pydantic import v1 as pydantic_v1

class Bar(BaseModel):
    foo: int

class Foo(pydantic_v1.BaseModel):
    bar: int

Mixing V1 and V2 models

The first challenge we encountered was the widespread use of models that were nested within each other. As Pydantic’s V1 BaseModel is not compatible with the V2 BaseModel, once you decide to upgrade one model to V2, you have to upgrade all of the models that it has as its attributes and the models that include it. If your codebase has thousands of Pydantic models with a high degree of reuse, this can feel like pulling the bottom block from a Jenga tower. These dependencies hindered the plan of migrating module by module. This process can be made easier by using dependency mapping tools like pydeps. However, even using these tools we were sometimes unable to catch all of them.

When you have a mix of V1 and V2 models, two errors will show up:

"pydantic\validators.py": no validator found for <class 'YourBaseModel'>

This error shows up when you have a V2 model used as an attribute in a V1 model. They’re easy to catch, as they will trigger at import time, e.g. when running a test suite. However, If your V1 model has an arbitrary_types_allowed in its model config, you’re out of luck. In that case, this error won’t trigger but you will get weird runtime behaviour, such as .dict() and .json() not behaving as expected. This can be mitigated by temporarily disabling the arbitrary_types_allowed config during migration.

TypeError: BaseModel.validate() takes 2 positional arguments but 3 were given

You’ll get this error message when an instance of a BaseModel is instantiated, indicating that a V1 model is being used as an attribute of a V2 model. Unfortunately, it only triggers at runtime and is notoriously hard to spot and debug. The error will only trigger when data is passed that is used by the attribute, which is a pain if you have many optional fields in your model. In the error message, you won’t find the V2 model raising the exception, nor the V1 model that is causing it. To debug inspect the data being passed and scour the model to find the culprit. We resorted to writing unit tests for the data triggering the error and inspecting each of its attributes separately.

If the model dependencies were too many, we put up clear boundaries by providing both V1 and V2 variants of the same model. This helps isolate the upgrade but does lead to code duplication, which can result in inconsistencies. This technique should be used sparingly and refactored as soon as possible after the initial upgrade. We prefixed every model with a V1/V2 prefix, depending on direction, making this boundary explicit.

class Bar(BaseModel):
    key: str
    name: str


class V1Bar(pydantic_v1.BaseModel):
    key: str
    name: str

class Foo:
    bar: Bar

class V1Model:
    bar: V1Bar

It would be very beneficial if the Pydantic framework could detect a mix of V1 and V2 at import time and provide meaningful error messages to resolve them.

Missing functionality

Although most features have been ported 1:1 with improved usability, not all features made it into V2.

each_item=True

One of these is validators using a each_item=True parameter. This V1 concept caused the validation logic to run for every item of an iterable attribute, validating each item individually. If the item you’re iterating over is also a BaseModel you can move the validation to that model instead. But if you need another attribute of the parent object for that validation then each_item was very handy, as showcased by the following trivial example.

class Wheel(BaseModel):
    size_cm: float
    
class Vehicle(BaseModel):
    vehicle_type: Literal["car", "truck"]
    wheels: list[Wheel]

    @validator('wheels', each_item=True)
    def check_wheel(cls, v, values):
        if values.get('vehicle_type') == 'car' and v.size_cm > 50:
            raise ValueError('Tire size is too big for a car!')
        elif values.get('vehicle_type') == 'truck' and v.size_cm < 50:
            raise ValueError('Tire size is too small for a truck!')

        return v

In V2, this feature no longer exists, but we can still implement it in different ways, although more verbose.

Manually iterating over the iterable and triggering the validation
This is a simple approach but it doesn’t group the exceptions, leading to an “N+1” validation problem. I.e. if there are multiple violating items in the iterable, only the first one will trigger. To validate the others, the first one has to be fixed, then the next, and so on, resulting in a bad user experience.

# Approach nr. 1 using separate validators

class Wheel(BaseModel):
    size_cm: float


class Vehicle(BaseModel):
    vehicle_type: Literal["car", "truck"]
    wheels: list[Wheel]

    @field_validator("wheels")
    def check_wheel(cls, v, info):
        vehicle_type = info.data.get("vehicle_type")

        for wheel in v:
            if vehicle_type == "car" and w.size_cm > 50:
                raise ValueError("Tire size is too big for a car!")
            elif vehicle_type == "truck" and w.size_cm < 50:
                raise ValueError("Tire size is too small for a truck!")

        return v

By ‘injecting’ the dependent attribute into the child model as a private attribute implementing the validation.
Contrary to the first approach, this does group the exceptions but is quite a bit more complex as the parent model needs to control the instantiation of the child model(s). When the child model is reused by other models, each of them will have to implement the injection logic.

# Approach nr. 2 using injection

class Wheel(BaseModel):
    _vehicle_type: PrivateAttr[str]
    size_cm: float

    @model_validator(mode='after')
    def check_size(self):
        if self._vehicle_type == "car" and self.size_cm > 50:
            raise ValueError("Tire size is too big for a car!")
        elif self._vehicle_type == "truck" and self.size_cm < 50:
            raise ValueError("Tire size is too small for a truck!")

        return self

class Vehicle(BaseModel):
    vehicle_type: Literal["car", "truck"]
    wheels: list[Wheel]

    @model_validator(mode="before")
    def check_wheel(cls, data):
        if not isinstance(data, dict):
            return data
        
        vehicle_type = data.get("vehicle_type")
        for wheel in data.get("wheels", []):
            wheel['_vehicle_type'] = vehicle_type
        
        return data

We went for the second approach as the additional complexity was justified by the exception grouping. This could be solved by ExceptionGroup available in Python 3.11.

json_encoders

V1json_encoders allow you to define custom serializers for data types. These have been replaced by functional serializers such as field_serializer, which indicate for each field how it should be serialized. The new approach motivates you to write dedicated functions or annotated fields with their serialization logic. This makes the serialization logic more reusable and explicit. However, there are two drawbacks to this approach:

The field has to be added to each model individually, adding more boilerplate code. This doesn’t outweigh its benefits though, as the amount of boilerplate is limited.
If you have a top-level model that wants to enforce different serialization for a lower-level model, you’re stuck. An example case for us was a date time field that omitted milliseconds during serialization. For reuse, we didn’t implement this in the low-level model, so higher-level models can choose how they implement this. Using field_serializer means you need to add it to the lower model and cannot differentiate. Another solution is to override the .model_dump() method, but I find this a very imperative way of implementing and goes against the Pydantic framework.

Finding and replacing deprecated methods and functions

A lot of V1 functions have been marked for deprecation in V2. It’s great that the Pydantic team didn’t straight up remove them, as that would result in a lot of breaking code. Instead, a DeprecationWarning is emitted when called at runtime, indicating that the method should be replaced by its new counterpart. Some static code analysis tools such as Pylance, will indicate it in your IDE further assisting the upgrade. You can also run these tools from a terminal to find all deprecated calls across your codebase.

Although easy to find and non-breaking, I found it difficult to have the team consistently use the new methods, as there is no real incentive to update them. This is a matter of educating and motivating the team and checking for this during PR reviews.

Business logic checking for `BaseModel`

We sometimes have logic that checks if a variable is a Pydantic model using isinstance(BaseModel) . This code breaks if the actual model of the variable is a V2 model, while the code checks for V1 models. These can be hard to spot, as the code checking for Pydantic models is often quite low-level, so the implicit assumption is not immediately visible. We found it best to add both V1 and V2 BaseModel to the isinstance checks by default, as this removes the chances of errors entirely until the full codebase has been upgraded.

`Field` no longer supports `**kwargs`

As we have been using Pydantic for quite some time, we built some internal tooling using additional annotations for Field classes. These were provided via **kwargs , which were stored as extra metadata on the field. We used this to annotate if fields contained sensitive data such as credentials, provide samples, etc.

This feature is no longer supported in V2. If **kwargs are passed, the model definition will break at import time. This can be solved by working with realAnnotated fields, and using internal framework concepts such as the examples field which puts a sample in the JSON schema of the model. If it’s just metadata, this can also be provided by passing a dictionary to the json_schema_extra keyword argument.

Conclusion

The migration from Pydantic V1 to V2 posed bigger challenges than expected based on sources and the migration guide. I argue for a carefully planned migration process in multiple steps, to mitigate risk and let the team gain experience with the new V2 features. The detailed migration guide, the inclusion of V1 in V2, and deprecation warnings help smoothen this process.

However, some challenges could be alleviated by better support from the Pydantic framework, such as import time checks for mixing V1 and V2 models, improved error messaging, and strategies for dealing with missing functionality.

Brecht is the backend engineering team lead at Qargo, working on the ongoing Pydantic migration effort.