Evolve data structures in NoSQL with Python data models

Luc van Donkersgoed
PostNL Engineering
Published in
12 min readOct 1, 2023

Many of the serverless applications we build at PostNL support mission critical processes. These applications need to constantly evolve, based on business requirements. DynamoDB databases support this evolution, but if you’re not careful, their schemaless nature can lead to serious compatibility issues. In this article we will explore design patterns to support backward compatibility when evolving your DynamoDB data structures.

TL;DR

Because NoSQL databases have no schema, newer data entries in the database can have a different structure than older entries, even if they represent the same entity type. We use data models to represent entities as a single consolidated view after they are retrieved from the database, regardless of how the entity was stored.

In the commonly used pattern of a REST API backed by DynamoDB we have four model types:

  • REST API Request Model
  • DB Create Model
  • DB View Model
  • REST API Response Model

Each of these models relate to the latest state of the data model, even if the actual data in the DynamoDB database has a different structure. The DB View Model is responsible for backward compatibility with older data formats stored in the database. We use unit tests to validate compatibility with every historic and current data structure.

This article is supported by an example project, found on GitHub. All code in this article is copied from the example application.

Step 1: Marshaling user data into the Request Model

This article mainly covers the POST flow, where new data is stored in the database, and the GET flows, where a single entity or list of entities is retrieved from the database.

Our API has a specification, which might state username and password are mandatory fields, and age is an optional field. The username field must be a valid email address, the password has a minimum length of 12, and the age, when supplied, must be an integer. When a user sends data to our application, they perform a POST request with a JSON body. However, there is no guarantee that they send at least the required data, exactly the required data, or that the data matches the specs. The Request Model is responsible for validating the provided data and stripping extraneous fields from the payload. In Python, the Pydantic library is often used to define models and to validate data. The request model might look like this:

from typing_extensions import Annotated, Optional
from pydantic import BaseModel, Field, field_validator
from email_validator import validate_email, EmailNotValidError


class CreateEntityRequest(BaseModel):
"""Request model for creating an entity."""

username: str
password: str = Field(min_length=12)
age: Optional[Annotated[int, Field(gt=0, lt=150)]] = None

@field_validator("username")
@classmethod
def validate_email(cls, value):
try:
validate_email(value)
except EmailNotValidError:
raise ValueError("Invalid email format")
return value

We use unit tests to confirm that minimal, optional, and extraneous data use cases all result in the expected model. The full set of tests can be found in test_request.py. An example of a happy path test looks like this:

@staticmethod
def test_create_entity_request_happy():
# 1. ARRANGE
from resources.functions.api.src.models.request import (
CreateEntityRequest,
)

request_payload = {
"username": "test@mydomain.com",
"password": "p1234p1234p1234",
}

# 2. ACT
model = CreateEntityRequest(**request_payload)

# 3. ASSERT
assert model.username == request_payload["username"]
assert model.password == request_payload["password"]
assert model.age is None

An example unit that which validates extraneous data is accepted but ignored looks like this:

@staticmethod
def test_create_entity_request_happy_with_extraneous_fields():
# 1. ARRANGE
from resources.functions.api.src.models.request import (
CreateEntityRequest,
)

request_payload = {
"username": "test@mydomain.com",
"password": "p1234p1234p1234",
"age": 30,
"some_int": 12,
"some_bool": False,
"some_str": "lorem ipsum",
}

# 2. ACT
model = CreateEntityRequest(**request_payload)

# 3. ASSERT
assert model.username == request_payload["username"]
assert model.password == request_payload["password"]
assert model.age == request_payload["age"]

And finally, we add a unit test which validates that an incorrect username is rejected:

@staticmethod
def test_create_entity_request_unhappy_invalid_email():
# 1. ARRANGE
from resources.functions.api.src.models.request import (
CreateEntityRequest,
)

request_payload = {
"username": "test_user",
"password": "p1234p1234p1234",
"age": 30,
}

# 2. ACT
with pytest.raises(ValidationError) as exc:
CreateEntityRequest(**request_payload)

# 3. ASSERT
errors = exc.value.errors()
assert len(errors) == 1

error = errors[0]
assert error["type"] == "value_error"
assert error["loc"] == ("username",)
assert error["msg"] == "Value error, Invalid email format"

This concludes the case for the Request Model. It is responsible for taking any user input, validating it against predefined rules, and marshaling it into a predictable, well-defined model which can be used by the rest of the application. Believe me, this beats passing JSON objects or endless function arguments. By miles.

Step 2: store data in DynamoDB using the DB Create Model

Now that we have the user’s request available in a Request Model, we can store the data in the database. But first we’re first going to reshape the data into the DB Create Model, because we need to add and transform some fields. Examples include:

  • Adding a partition key (PK) and sort key (SK)
  • Hashing the password
  • Adding creation and update timestamps

None of these values are provided by the user, yet they are an essential part of the data as it is being stored in the database. First we define the DB Create Model:

from typing_extensions import Optional
from pydantic import BaseModel


class DatabaseCreateUser(BaseModel):
"""Model representing the user being stored in the database."""

pk: str
sk: str

username: str
hashed_password: bytes

age: Optional[int] = None

# Timestamp in milliseconds
created_at_ts_ms: int
updated_at_ts_ms: int

This data model has some overlap with the Request Model, but as expected we have some new and some slightly different fields. We use the Request Model as the source, and generate a new DB Create Model from the source model in the UserController:

def _generate_create_model_from_request(
self, create_request: CreateUserRequest
) -> DatabaseCreateUser:
"""Generate a DatabaseCreateUser from a CreateUserRequest."""
hashed_password = self._hash_password(create_request.password)

# Generate a UTC timestamp in milliseconds
ms_since_epoch = int(time.time() * 1000)

return DatabaseCreateUser(
pk="User",
sk=str(self._generate_uuid()),
username=create_request.username,
hashed_password=hashed_password,
age=create_request.age,
created_at_ts_ms=ms_since_epoch,
updated_at_ts_ms=ms_since_epoch,
)

After this step we have a fully predictable and standardized model, ready to be written to the database. The last step, actually calling DynamoDB to store the item, is as simple as this:

def create(self, create_request: CreateUserRequest) -> DatabaseViewUser:
db_create_model = self._generate_create_model_from_request(create_request)
self._table.put_item(Item=db_create_model.model_dump())
return self._generate_view_model_from_create_model(db_create_model)

We simply dump the model as an object and write it to DDB. The SDK takes care of the rest. This concludes the section about the DB Create Model, and about the POST flow in general. We have seen how the DB Create Model enriches and transforms the data for DynamoDB, providing us with a clear and predictable data structure for the database.

Step 3: retrieving the data from DynamoDB with the DB View Model

We have stored the data in the database, and we know how that data is structured. Now it is time to retrieve the data. The code is again relatively straightforward:

def get_all(self) -> List[DatabaseViewUser]:
items = []
response = self._table.query(
KeyConditionExpression="pk = :pk", ExpressionAttributeValues={":pk": "User"}
)
for item in response.get("Items", []):
items.append(DatabaseViewUser.from_dynamodb_item(item))
return items

We simply query all users from DynamoDB and convert the resulting items to the DB View Model. This field definitions in this model are currently exactly equal to the DB Create Model, but this will change as our data model evolves (see below). The full model looks like this:

from typing_extensions import Optional
from pydantic import BaseModel, model_validator
from boto3.dynamodb.types import Binary, Decimal


class DatabaseViewUser(BaseModel):
"""Model representing the user being retrieved from the database."""

pk: str
sk: str

username: str
hashed_password: bytes

age: Optional[int] = None

# Timestamp in milliseconds
created_at_ts_ms: int
updated_at_ts_ms: int

@model_validator(mode="before")
@classmethod
def convert_data_type_value(cls, values: dict):
"""Convert DynamoDB types to the correct Python types."""
for key, value in values.items():
if isinstance(value, Binary):
values[key] = bytes(value)
if isinstance(value, Decimal):
values[key] = int(value)
return values

@staticmethod
def from_dynamodb_item(item: dict) -> "DatabaseViewUser":
"""Convert a DynamoDB User item to a DatabaseViewUser object."""
return DatabaseViewUser(**item)

As you can see, the DB View Model has a field for every known attribute of the data in the database. This allows the rest of the application to use the model for their specific purposes. The model also contains some helper functions to convert the DynamoDB data formats back to their native Python types.

To validate that the model is capable of converting data received from DynamoDB, we add a new unit test. Keep this test in mind, because it will be essential in the last section of this article: Evolving data structures.

@staticmethod
def test_create_entity_request_happy_v1():
# 1. ARRANGE
from resources.functions.api.src.models.db_view import (
DatabaseViewUser,
)

dynamodb_response = {
"Items": [
{
"hashed_password": Binary(
b"$2b$12$yowVgWrjapGmpjVGRsMO/OyZPlrbXnyGJ23.CT3Y3.O.jlIy616NS"
),
"created_at_ts_ms": Decimal("1696109591643"),
"sk": "070e7fd4-128c-486d-8ab2-09277253f2ee",
"username": "lucvandonkersgoed@mydomain.com",
"updated_at_ts_ms": Decimal("1696109591643"),
"pk": "User",
"age": None,
},
],
"Count": 1,
"ScannedCount": 1,
"ResponseMetadata": {
# Stripped
},
}

item = dynamodb_response["Items"][0]

# 2. ACT
model = DatabaseViewUser.from_dynamodb_item(item)

# 3. ASSERT
assert model.pk == "User"
assert model.sk == "070e7fd4-128c-486d-8ab2-09277253f2ee"
assert model.username == "lucvandonkersgoed@mydomain.com"
assert (
model.hashed_password
== b"$2b$12$yowVgWrjapGmpjVGRsMO/OyZPlrbXnyGJ23.CT3Y3.O.jlIy616NS"
)
assert model.age is None
assert model.created_at_ts_ms == 1696109591643
assert model.updated_at_ts_ms == 1696109591643

And with that we have reached the end of step 3: retrieving the data from DynamoDB with the DB View Model. In this section we have covered a Pydantic model that represents the data from the database. This model can be used by other classes, functions, and processes to reliably and predictably access the retrieved data.

Step 4: returning the data to the user with the Response Model

The final model is the Response Model. This model represents the data being returned to our users through our REST API, and should match our APIs specifications. The Response Model contains a subset of data in the DB View Model. The DB View Model contained all the data from the database, including pk, sk, and hashed_password, and the Response Model is responsible for stripping out the fields that should remain internal. This is achieved in the GetUserResponse.from_db_view_model() function:

@staticmethod
def from_db_view_model(db_view_model: DatabaseViewUser) -> "GetUserResponse":
return GetUserResponse(
id=db_view_model.sk,
username=db_view_model.username,
age=db_view_model.age,
)

After the GetUserResponse model has been created, we can return it to our users. In the example below, all users in the database are converted to GetUserResponse models and returned as a list.

def get_function_handler(_event, _context):
# Get all users as a list of DatabaseViewUser objects
db_view_models = user_controller.get_all()
# Convert each DatabaseViewUser to a GetUserResponse object.
users_response = [
GetUserResponse.from_db_view_model(db_view_model).model_dump()
for db_view_model in db_view_models
]

return {
"statusCode": 200,
"body": json.dumps( # Convert the list of GetUserResponse objects to a JSON string
users_response
),
}

To summarize, the Response Model is used to interact with the users according to the APIs spec. It also guarantees no data from the database is inadvertently leaked to the outside world.

This rounds up our introduction of the four types of data models: the Request Model, DB Create Model, DB View Model, and Response Model. In the next section we will cover evolving these models when the application changes.

Evolving data structures

The sections above cover the models in a stable, clean state. But software is never stable. It evolves as new requirements emerge. Sometimes features are added, sometimes they are removed. Sometimes — hopefully not too often — features are built halfway to completion and then dropped. We need solutions to deal with these fluctuations in our data structures, and the data models discussed in this article can be used to achieve them.

Example use case: adding a role to the user entity

So far, our users were created by supplying a username, password, and age in a REST POST request. The password was encoded, the data enriched with a pk, sk, and a few timestamps, and then the user was stored in the database.

Now the application has grown, and you have decided that user should have a role that defines the user's permissions. The role can either be READONLY, WRITER or ADMIN. You need to define how:

  1. This role is included in the data models
  2. The application will handle existing users which were not assigned a role

Let’s cover these topics one by one.

Adding the role to the data models

In our new use case, setting a user’s role is mandatory. We first define the available roles as an Enum:

from enum import Enum


class UserRole(str, Enum):
"""An enumeration of user roles."""

READONLY = "READONLY"
WRITER = "WRITER"
ADMIN = "ADMIN"

Then we add the role to our Request Model.

class CreateUserRequest(BaseModel):
"""Request model for creating a user."""

username: str
password: str = Field(min_length=12)
age: Optional[Annotated[int, Field(gt=0, lt=150)]] = None
role: UserRole

@field_validator("username")
@classmethod
def validate_email(cls, value):
try:
validate_email(value)
except EmailNotValidError:
raise ValueError("Invalid email format")
return value

We need to store the role in the database as well. Our DB Create Model defines what is written to the DynamoDB Table, so we update that too.

class DatabaseCreateUser(BaseModel):
"""Model representing the user being stored in the database."""

pk: str
sk: str

username: str
hashed_password: bytes

age: Optional[int] = None

role: UserRole

# Timestamp in milliseconds
created_at_ts_ms: int
updated_at_ts_ms: int

With these changes, providing the role has become mandatory, the role must be one of the three known types, and the selected role is stored with the user entity in the database.

Backward compatibility with existing users

We want to return the newly stored role in the GET requests too, so we add the role field to the DB View Model and Response Model:

from typing_extensions import Optional
from pydantic import BaseModel, model_validator
from boto3.dynamodb.types import Binary, Decimal

from . import UserRole


class DatabaseViewUser(BaseModel):
"""Model representing the user being retrieved from the database."""

pk: str
sk: str

username: str
hashed_password: bytes

age: Optional[int] = None

role: UserRole

# Timestamp in milliseconds
created_at_ts_ms: int
updated_at_ts_ms: int

@model_validator(mode="before")
@classmethod
def convert_data_type_value(cls, values: dict):
"""Convert DynamoDB types to the correct Python types."""
for key, value in values.items():
if type(value) is Binary:
values[key] = bytes(value)
if type(value) is Decimal:
values[key] = int(value)
return values

@staticmethod
def from_dynamodb_item(item: dict) -> "DatabaseViewUser":
"""Convert a DynamoDB User item to a DatabaseViewUser object."""
return DatabaseViewUser(**item)

But remember the unit test we added earlier, which tests compatibility with the DynamoDB response? It immediately fails, because this mocked DynamoDB data does not contain a role.

We will solve this in a minute, but to isolate the problem we will first add a new unit test which represents the new data structure:

@staticmethod
def test_create_entity_request_happy_v2():
# 1. ARRANGE
from resources.functions.api.src.models.db_view import (
DatabaseViewUser,
)

dynamodb_response = {
"Items": [
{
"hashed_password": Binary(
b"$2b$12$yowVgWrjapGmpjVGRsMO/OyZPlrbXnyGJ23.CT3Y3.O.jlIy616NS"
),
"created_at_ts_ms": Decimal("1696109591643"),
"sk": "070e7fd4-128c-486d-8ab2-09277253f2ee",
"username": "lucvandonkersgoed@mydomain.com",
"updated_at_ts_ms": Decimal("1696109591643"),
"pk": "User",
"role": "WRITER",
"age": None,
},
],
"Count": 1,
"ScannedCount": 1,
"ResponseMetadata": {
# Stripped
},
}

item = dynamodb_response["Items"][0]

# 2. ACT
model = DatabaseViewUser.from_dynamodb_item(item)

# 3. ASSERT
assert model.pk == "User"
assert model.sk == "070e7fd4-128c-486d-8ab2-09277253f2ee"
assert model.username == "lucvandonkersgoed@mydomain.com"
assert (
model.hashed_password
== b"$2b$12$yowVgWrjapGmpjVGRsMO/OyZPlrbXnyGJ23.CT3Y3.O.jlIy616NS"
)
assert model.age is None
assert model.created_at_ts_ms == 1696109591643
assert model.updated_at_ts_ms == 1696109591643

This test passes, confirming that the DB View Model is compatible with the new data structure. Now let’s focus on backward compatibility. We will update the DatabaseViewUser.from_dynamodb_item() function like this:

@staticmethod
def from_dynamodb_item(item: dict) -> "DatabaseViewUser":
"""
Convert a DynamoDB User item to a DatabaseViewUser object.

Provide backward compatibility with older models, which might
not have a "role" field stored in the database.
"""
role = item.get("role", UserRole.READONLY)
item_copy = deepcopy(item)
item_copy["role"] = role
return DatabaseViewUser(**item_copy)

This will default the role to the sensible value READONLY if no role is found in the database. After this change, our DB View Model is backward compatible with the old data structure, as confirmed by our unit tests:

This highlights the importance of the data structure unit tests for the DB View Model. You need to add new tests with the new variants as soon as you write them to the database. This will provide an historic trace of all data structures that ever existed, and which your DB View Model needs to support. You can only drop support for an older data model if you can guarantee it no longer exists in any of your environments. How to reach this point is covered in the last section of this article.

Clean up with data migrations

If you’ve come this far, you may think “aren’t all these backward compatibilities going to pollute my code and tests over time”. And indeed they will. The solution is to consolidate the many data versions you may have back into a unified model. Following the example above, you could scan all the users in your database, find those without a role attribute, and update them to READONLY users. When this migration is complete, you can remove backward compatibility for the users without a role.

However, data migrations can be sensitive and complex. When to apply them depends on many factors in your application’s context, such as size of the data set, number of data structure versions, number of unit tests, and so on.

Conclusion

In this article we have covered two design patterns:

  1. Processing REST requests and responses with data models to sanitize and validate your API’s inputs and outputs.
  2. Modeling your NoSQL requests and responses with data models to guarantee data structure and backward compatibility.

Applying these patterns will help you maintain and understand your data structures, which is essential when your database does not enforce a schema. Unit tests are an invaluable tool to document and consistently prove your application’s compatibility with every evolution of your data structures.

--

--