Don’t Complect Your Schemas

Published in

The Pragmatic Programmers

4 min readDec 7, 2023

complect
From Latin complectī (“to entwine, encircle, compass, infold”), from com- (“together”) and plectere (“to weave, braid”). See complex.
https://en.wiktionary.org/wiki/complect

You’re a developer in a new an exciting startup. You are tasked with creating API for the users in the system. To save time, you decide to use a document database (maybe Elasticsearch?). The API endpoint for /users/<user_id> which gets details about a user will do something like the following Python code:

And you’re done! Close the ticket, move on …

A couple of weeks later, you get a new issue assigned to you:

Seems like we’re leaking sensitive information.
The /user/<user_id> endpoint return the user, but inside the JSON object there’s also the user address.

Oops! Digging around you find out that the data team added the user addresses to the database a week ago.

Before you go on and implement “fields that should not get exported to the API” feature, try to think of the root cause. You are tangling — complecting — the database schema with the API schema. Sure it’s convenient to pluck a user from the database and send it away to the client, but it’s wrong. The same way eating a deep fried twinkie feels great right now, but is bad for your long term health.

Most applications have three layers:

Storage layer
Business layer
API layer

Each layer should define its own schema. One reason we already seen — it’s a security risk. Another reason is that the rate of change in each layer is different. The business layer changes faster than the database layer which changes faster than the API layer. Say the database team decides to rename the User.addr filed to User.address, and without knowing — they broke the API. In larger system, you have no idea what downstream systems ingest your data.

But, does that mean that each layer should have it’s own User with it’s own schema? My answer is — yes! Even if at the beginning these User types will look the same, they are going to diverge at some point, and in the long run this separation is going to save you from some nasty bugs. It will also make the serialization code simpler.

The Go community has a saying: “A little copying is better than a little dependency.” I highly recommend watching Rob Pike explaining this. He talks about code, but it applies as well for schemas.

Here’s another example: Some serialization formats, such as protocol buffers, generate code. When you work with protocol buffers, you write a definition file, maybe something like:

Then you generate code for serialization using the protoc compiler. This generated code already has classes (or structs or …) and you might be tempted to use these generated types in your code. But again, you tangle schemas at different layer. And also, protocol buffers uses a different time type. For example in Python it does not use the built-in datetime but google.protobuf.timestamp_pb2.Timestampwhich is going to confuse your users.

What you should do is to handle serialization at the edges of your business layer — when working with the database layer and the API layer. When you read from the database, get a database layer user and then convert it to the business layer user. And the same for the API layer. This approach might be more work, but will be worth it in the long run.