A software configuration — A photo from Benjamin

The configuration schema using Pydantic

Overfitted Cat
4 min readApr 9, 2022

--

As software over-engineers, we want to extract parameters inside configuration. This way we can change the behavior of application easier without changing code. It all seems like being inside the cockpit with all knobs and buttons. It’s flexible, but it can be such a nightmare for maintenance and onboarding new members. In this blog post, we will cover settings modeling in Python.

Let’s say we have an interface called MLModel that has a simple method predict. The MLModel is an abstraction that has two implementations, MockedModel and RestModel. The factory will take care of instantiation by using configuration. It tells about concrete implementation we want to use, either MockedModel or RestModel and it also contains their parameters. This way, our system doesn’t care about which MLModel we are using, it just knows how to get an object and call the predict method. We can change the behavior of the application by changing configuration, without modifying the code. The code example is down below.

A simple example of ML Model abstraction

Now, we can ask ourselves, how we can store model configuration inside Python? Are dictionaries enough or we can use something more profound?

The Python dictionaries

The first thing that comes to our mind is using dictionaries as a medium for passing configuration to our factory. Well, why not? It’s flexible and you can put whatever you like there. It is such a simple way to define anything. We can have a key, e.g. type, that tells a strategy (concrete) class we want to use, and besides that we can have an inside dictionary containing configuration for concrete class.

'ml_model': {
'model_type': 'rest',
'http_endpoint': 'http://localhost:9050/model',
'timeout': 10
}

Neat. For such a simple example, it works like a charm. However, it comes with a price. There is no schema, no type-hints, and no validation. It is effortless to make a typo mistake that can haunt you for hours. Let’s look at the example.

'ml_model': {
'model_type': 'rest',
'timeout': '20'
}

Here, we forgot the http parameter, right? Did you notice the parameter timeoutis string this time, but should it be an int? What is acceptable for type? What range is ok for value, is -100 valid? Without schema and validation in one place, we either need to check documentation, if someone has updated it, or we need to dive into code to check it by ourselves. In one of the previous blogs, I wrote about the importance of a proper documentation. Of course, this is too simple example to show why having schema matters, but I hope you see a point.

Look at the naïve example. The dict_factory is the simplest if-else that checks the model_type and, based on the type, it instantiates the right MLModel. We just need to send a proper dictionary and we will receive a model hidden behind MLModel interface. Alas, if we make a mistake in configuration and we don’t have any validation, it will diminish existence somewhere in the runtime.

We can implement validation by ourselves where we need it, but it’s too easy to make a mistake, and you clutter your code. Fortunately, there are a lot of ways to solve this problem. In this blog, we will explore Pydantic.

The Pydantic

Pydantic is a data validation and settings management that uses type-hint annotations. Since it is using type-hints, there you don’t have to learn how to define schema, as you already know that. It has nice data class integration and you can define your own data types and de/serialization rules. There are some benchmarks saying the Pydantic is fast, buy honestly, I haven’t compared the speed to other alternatives myself. However, one of the most powerful Pydantic usages is settings management and real time validation. Briefly, it supports combine usage of several sources such as environment variables, secret files, and other. You can find more information and features https://pydantic-docs.helpmanual.io/usage/settings/. For now, let’s focus on our use case on polymorphic settings structure and how we can use Pydantic here.

We have a class, AppSettings, which denotes the global configuration class containing configuration for all components, for example. Here, we have a specific component settings ml_model containing settings parameters for concrete implementation for MLmodel. The application will determine concrete implementation upon loading configuration on start. How can application know that?

The trick is in Union type hint. Here, we can tell that settings for ml_model can either be the instance MockedModelSettings or RestModelSettings. Apart from that, we can tell Pydantic to look for a specific field that can serve as a discriminator between the concrete settings. Here it is model_type. The Literal type represents the value it expects from the dictionary (our configuration). The discriminator will also improve performance as Pydantic will only validate and check the matched model instead of using all union members. There is also support for nested discriminator but it will be too deep for this blog :)

Pydantic model settings using Union and discriminator field

Now, we are better equipped for our naïve factory. As soon as we hit parse_obj the Pydantic will translate the dictionary into valid AppSettings object and it will automatically match an ml_model type for us, based on model_type.

A simple factory using Pydantic parser and validation

Notice that we can take a full advantage of parameter validation, such as number range, string validation, etc. The validation is at one place and whenever we read configuration, the Pydantic will validate it. The faulty example from above will be raised as Pydantic’s ValidationError.

Endnote

In this blog post, we solved a specific use case of modeling settings schema using the Pydantic serialization library. We used Union type-hint to make our life easier when modeling polymorphic settings. Could we just use Python dictionaries? Of course we could, and in the most simple cases, we probably should. However, we need to know that it is an error prone and we would lose some nice type-hinting features to help us our further in pipeline.

--

--