Maintainable database code

Any worthwhile project is going to have data storage needs. The type of data storage chosen and how it is used will have a dramatic effect on scalability, performance, and maintainability.

This article will not concentrate on the specific database or even the type of database chosen. A project may need the rigid structure of a relational database, the flexibility of NoSQL, or a blend of different technologies.

It will also not be about what type of library to use when accessing the data. Maybe an ORM is the best choice or maybe it would be better to write SQL code. Again the reality probably may lie somewhere in between.

Rather, I will go through database related issues which will affect long term maintainability. I may touch on these other areas where they relate to this topic.

Versioning

As a database evolves there will be many occasions where backwards incompatible changes are made. This may be due to a change in structure or updates to the underlying database software.

Often there will be times when different versions of the code and database need to be made to work together. This may happen during to a rolling deployment, clients running their own servers, or when restoring an old backup for troubleshooting.

To keep track of these changes the database should be given a version number. This number should be stored somewhere in the database. It allows for any backup or snapshot to be easily identified and deployment scripts to automatically run any required migrations.

Create a database template for each version and keep it in version control. This should include both the schema and any required reference data. If something needs to be quickly tested against a specific version of the database then the appropriate template may be extracted.

There are different ways to manage the version number. Here are two that will fit most cases.

Code-linked

If the code is already using a semantic version number, it may make sense to use the same number for the database. This means incrementing the version in the same way for database changes as one would for the code.

A variation on this is to use part of the code version for the database, for example only the major and minor version numbers. This has the benefit of not needing to change the database version every time a patch release is done.

Independent

The alternative is to use an independent version number for the database. This is useful if more than one code base needs to access the data so it can’t be bound to a single code version as well as for zero downtime deployments.

A record should be kept of which database versions are compatible with which versions of each code base.

A semantic version may still be used.

Migrations

Migrations make it easy to move from one version of the database to another. Whenever possible make them work for both upgrades and downgrades.

This is required for automated deployments which will need to apply new database changes and perform rollbacks if something goes wrong. It is also extremely helpful when troubleshooting to be able to quickly match the database with the any version of the code.

Many database access layers and frameworks have support migrations and may be able to automatically generate them. For smaller projects it is fine to maintain migrations manually but one should still have scripts to automate their use.

Documentation

There is nothing worse than finding a field referenced but not being sure what the contents mean.

Always keep the database fully documented. This means that every field should have a full description and any relationships between entities should be clearly defined.

Make a new version of the documentation whenever the database version is incremented. It should also be possible to access older versions. For this reason it may make sense to store the documentation alongside the migrations in version control.

Everything should be kept in a single place and updated at the same time to avoid one part of the documentation ending up out of sync with the rest.

Data manipulation

All code dealing with any changes to data should be kept in a central location. When using an ORM or MVC pattern this should happen naturally but special care should be taken to maintain a DRY compliant code base.

This should deal with validation, the calculation of any aggregated fields, and any other rule specific to a field or table. It may form part of the business logic or be required for an underlying backend requirement.

Application or database

An important decision needs to be made with regards to this shared data manipulation logic. Often it is possible to use code within the database, such as stored procedures and triggers, to manage what happens when data is changed.

While it is possible to write all code in the same place there are often good reasons to split it up. The key thing is to be consistent. If, for example, a trigger is used to populate a field containing a timestamp of the last change on one table, then this should be done in the same way for all other tables with the same requirement.

The code within the database software may be less transparent to developers who are used to working in their chosen language. A counterpoint to this is that code within the database often performs better. Try to find the right balance.

Data conversion

Often there is a requirement to import data from an external source. This may happen when linking into another system, or when importing data from a competing product.

In both cases the central database manipulation code should be used. Avoid the temptation of importing directly from the raw data or using a third party tool. This may bypass special logic in the central code and potentially introduce inconsistency in the data.

Data retrieval

When retrieving data keep in mind the complexity of the query, the amount of data within the entities referenced by the query, and how much data is expected to be returned.

Based on these factors one must decide how and when to perform the data retrieval.

When using generated SQL, for example with an ORM, review the SQL and test the performance. If there are problems with the generated SQL, or it takes a long time to generate, consider hand writing the SQL instead.

Queries that take a long time to run or return a large amount of data may be deferred rather than being executed immediately. The results may then be transferred to the user at a later stage. This could be done using email, a download from within the application, or file sharing.

Fields should always be referenced by name in the code and never by position. The field order may change and cause problems.

Looking ahead

Taking these issues into account will go far in making long term maintenance less of a headache.

Some of them may require a bit of work to get going but should not add too much overhead in the long run. And when problems do arise it should be a lot easier to pinpoint what happened, even if it was on an older version of the database.