Schemaless Database Considerations

Rob Hemmings
Lydtech Consulting
Published in
6 min readMar 25, 2023

Introduction

Schemaless databases are currently very popular, but the decision of whether to choose one for the persistence layer of a system should be carefully considered against the requirements of that system.

On the face of it a schemaless database sounds very appealing, with the perception that there is no need to map the organisations data into a schema and do all that upfront tedious and time consuming data normalisation analysis and design work. Also, If there is no schema then there is no need to have migration scripts for managing and controlling changes to the schema.

Schemaless does not mean “no schema”

An organisation will only want to store data because it has some value, now or in the future, to that organisation. The structure, type and meaning of that data will be well known and understood by the organisation, therefore it will have a schema regardless of whether the database is going to enforce it.

Schemeless does not mean the data in a database does not conform to a schema, it just means that most of the data will not have a database enforced schema applied to it, so there is no need to define every field in every table within your system. There will be some basic schema requirements though, the data will have to be organised into collections / tables and the data in those collections / tables will need a primary key and indexes in order to reference it efficiently.

Does the data need modelling?

Modelling the data and applying a traditional relational database schema may not always be possible, or the way the data is intended to be accessed and used may just better suit a schemaless database. In order to make the judgement of which type of database best suits the requirements of the system it will be necessary to do some analysis of the data and the system it will be used in. The decision should not just be based on the latest technological trend.

As architects / developers / testers and devops it is necessary to understand what the organisation’s data looks like and how best to model and store it, regardless of whether a schema is being enforced or not by the database of choice. A schema enforcing database will naturally document the systems view of the organisation’s data, which can give valuable insight to developers and testers alike.

Confidence in the data

Data is the lifeblood of the organisation and preserving the integrity of that data must be the highest priority.
All applications need to understand what the data they use looks like in order to be able to process it in the correct manner. The applications need to be able to have confidence that the data is valid in order to process it safely i.e a value is present when mandatory, or a value conforms to the correct type, size, value etc. When the database is enforcing the schema the data is automatically validated at the point of writing. This happens for every application that writes data to the database (including database maintenance tools), thus ensuring the data stored in the database is consistent with the schema enforced validations. Any application then reading the data can have confidence that the data it reads complies with the schema.

Application level validation

With a schemaless database the application cannot have that same level of confidence in the data it is reading, therefore additional validation must be performed within the application layer prior to it then being processed. In addition to validating the data when it is read, the application must also validate the data prior to it being written to the database, to ensure that the data stored is valid.

This means every application that reads or writes data will require code to validate the data, which will result in additional error handling flows to cater for the invalid data scenarios. This additional application code will lead to increased code complexity and will require extra coding time and testing effort to mitigate against the likelihood of introducing bugs.

Consistent Validation

Data access within an application must be made consistent. Every time data is read or written the validation must be consistently applied to ensure that invalid data is not processed and is kept out of the database. Backward compatibility must also be maintained within the application code, to prevent old data from causing errors with new code. For example if a new field is introduced then only new entries will have the field populated so the developer needs to be cognisant that existing entries will not have the field defined and must defend against it when processing those existing entries. The existence of old data may not be obvious to the developer / tester as they may not have access to the production data and any data in the test databases may not reflect old production data. This could lead to code being deployed that only fails in the production environment. The solution to this would be to run some sort of migration to ensure that the data in the database is compatible with the new code, which is exactly what a traditional schema based database would enforce.

Sharing data between applications

Where data in a schemaless database is shared between multiple applications then each application must validate the data and that validation must be performed in a consistent manner across all the applications. If the validation is changed in one application then other applications should be updated in line with the new validation. Keeping those validations in sync and ensuring that the applications are updated and deployed in a way that protects the integrity of the data can be problematic, so minimising sharing of data between applications is recommended.

Referential integrity

With a schemaless database the data within a single table / collection can represent complex relationships, for example you can have an order table which includes the individual order line items. With a relational database this would be impossible to do in a single table, instead an order table and an order items table would need to be created and associated by implementing an order id on each of the order item rows. Referential integrity constraints can then be used to ensure that there is a corresponding order row for every order item. Where a schemaless database application has a use-case requiring referential integrity between two tables to be maintained, then this responsibility will fall on the application code. Again leading to further complexity within the code and additional testing and maintenance overhead.

Schema change control

It’s often cited that it’s easy to add new fields to a schemaless database as you don’t need to create a new field before the data can be added, you can simply insert the data. This is true, however it is very important to carefully consider the new field before just adding it. Just because it is easy does not mean that it should not undergo a degree of consideration and caution. For example it is easy to add a new field only to find that the name does not conform to the system’s usual naming standards, undoing such a change can then be problematic. With a schema based database the scripts that modify a schema usually undergo rigorous change management where these problems are often caught.

In Summary

Deciding whether to choose a schemaless database needs to be carefully considered against the requirements of the system, how the data needs to be modelled and the impact of additional application level validations on the coding and testing effort. It should not be chosen just because it is currently fashionable or because of the perception that no data modelling is required. There is still a need to do a proper analysis of the data to determine how best to model it for the use-cases of the system. Documenting the modelling of the data is also very important in understanding which data items drive and control the processing of the system

Head over to Lydtech Consulting to read this article and many more on Kafka and other interesting areas of software development.

--

--