Revamping a database schema.

Published in

Vizzuality Blog

7 min readDec 7, 2017

Back at university, I was trained in relational database design, an activity where you take a complete list of requirements for the system to be built, identify all the entities in the domain of the system and how they relate to each other, then document that model and represent it on a diagram. Thus, you arrive at the design of a complete relational database schema. Many a beautiful diagram was produced and the method made a lot of sense to me.

However, when I started work I realised a number of things, such as:

Nobody knows the complete set of requirements from the outset, at least not in enough detail to inform the modelling process. Even if anyone did, those requirements would change along the way.
Nobody seems particularly interested in database diagrams and documentation at the beginning of development. They are more interested in seeing a product up and running quickly.

A common consequence is that the database schema ends up neglected in the early stages of development. I’m going to describe how we went about giving one of our databases some well deserved special treatment. Our stack is Ruby on Rails and PostgreSQL.

Database from hell.

Typically, the database schema ends up being designed in an incremental manner alongside the code. In theory that should be fine, since we have methods of dealing with changing schema, just like we have methods for dealing with changing code. Frameworks such as Rails make it very easy to modify the schema incrementally and write tests that exercise the entire stack. Yet somehow the schema is perceived as less flexible than the code, and as the product grows it often ends up being patched rather than continually refactored.

Many will find the resulting situation familiar:

a table with more boolean flags than data columns (“only show this record if a & b || c”),
duplicates in the data due to not enforcing unique constraints,
orphans and widows all over the place due to missing foreign key constraints and omnipresent polymorphic associations,
non-existent “not null”, “check” or any other constraints (“we’ll handle this in the business logic”).

Hardly an elegant data model, but worst of all it’s a maintenance liability, and often a performance bottleneck.

Rachel could have avoided trifle disaster by considering the relationships between the ingredients before putting them together.

This is why I strongly believe that for any quickly evolving product, the schema needs a rethink every now and then, and time to tidy up and improve the data model to ensure it adequately supports the evolving set of requirements. If you’ve waited too long, chances are everyone on the project will already see a need for this, as the project suffers from tell-tale signs of a schema which no one understands anymore and changes take longer than expected.

Time to revamp.

Once everyone agrees that it’s time to revamp the schema, how to go about it? Here are some tips we have gathered while reworking the database behind one of our projects.

First of all, it is helpful for the team (as well as your future self) to assess the schema and document the changes you propose to make. In our case we first defined some rules:

1. Consistently apply naming conventions.

a. Some naming conventions to make Rails a bit more happy: plural table names, serial numeric identifier called id as primary key column, foreign table name in singular with “_id” suffix as foreign key column.

b. Boolean field names: use either adjective form: “enabled”, “default” or verb form: “is_enabled”, “is_default”; also either “is_enabled” or “is_disabled” consistently.

2. Avoid nullable columns.

a. They are a frequent source of errors because of the non-intuitive 3-valued logic; NOT NULL sends a clear message and allows for simplification of queries.

b. On the other hand, where there’s clear a need for an “unspecified”, “n/a”, “unknown” column value, it’s best to use NULL.

c. Enforce default values where applicable, especially for boolean columns.

3. Ensure referential integrity.

a. Make sure foreign keys are in place.

b. Remove polymorphic associations by splitting them into multiple join tables.

4. Prevent duplicates on database level.

a.Make sure uniqueness constraints are in place.

b. Where the logic is too complex, consider using triggers.

5. Follow table normalization rules.

a. Apply 3NF.

b. Where it makes sense to have a denormalized structure, go for views / materialized views.

Establishing these rules allows you to not only rework the schema, but also hopefully make future changes a bit more consistent. With the rules in place, we went over all the existing tables, identified violations and suggested changes.

Before starting.

The next phase is actually implementing the proposed changes in the database and code. But wait! That’s not true! The next phase, before making any changes whatsoever, is establishing a method to verify that the system produces the same results before the migration as after. I cannot stress enough how important this is. One of the big goals of revamping the database is to improve maintainability of the product as a whole, so you must consider how to prevent errors, or even worse — data loss — from creeping in at the point of migrating the data and the codebase.

Your mileage may vary; maybe the suite of unit tests that you already have will be enough to guarantee that. In our case we needed another line of defense, because the unit tests on their own were not reliable enough. Our architecture is a backend emitting json and csv documents to be used by the frontend, so we focused on ensuring that those outputs from the backend match exactly the original responses from the system. This way the frontend is unaffected by the database revamp. We applied what is sometimes referred to as gold master testing — comparing the output of the system to the known correct gold master.

Move incrementally.

Now we can get down to rewriting the code. A useful strategy here is to not change data structures in place, but let the old and the new structure live in parallel worlds during the migration. This can be done quite easily when using PostgreSQL, which allows you to create multiple schemas (which would really be better named namespaces) within the same database.

Here’s how it works: most of the time in a typical Rails application all tables end up in the public schema. In our case that’s where our “old” tables live. All the new tables are in a separate schema called revamp.

Having both schemas in the same database makes a number of things easier, such as the process of migrating data between them. It might even be possible to instantiate the new tables as views or materialized views on top of the original tables. In our case we had a migration script which transforms between the two; but the great thing about it is that data never leaves the database, which makes the migration a much simpler process.

In the code, we also have a separation — between “old” and “new” controllers, models, serializers, etc. Those are namespaced separately. As we work our way through the migration process, both schemas continue to be in usable state and we can easily compare old and new versions of the code. At some point when the process is complete we will be able to discard the old code altogether.

Measure the change.

While there are many objectives of a database revamp, such as improvements in maintainability and extensibility, there’s one that has an unforgiving metric to it — performance. It’s important to verify that the changes we’ve made yielded the expected positive results in that regard.

The granularity of measurements may vary here. Perhaps the target was a single underperforming query, perhaps it was a whole transaction. There are some tools within PostgreSQL which will be helpful for all scenarios.

First of all, the extension pg_stat_statements helps with tracking slow queries on a live system. It makes it possible to identify queries with high average running time (slow queries) as well as high cumulative time (queries that get run a lot). There are also ways to use this extension to identify missing or unused indexes. This could be a starting point when looking for the 80 per cent improvement for 20 per cent effort.

When focussing on individual queries, EXPLAIN ANALYZE is the tool to use to understand the sources of slowness. Understanding it’s output is a skill straight from the school of wizardry, but it’s definitely one worth having.

Lastly, to benchmark you can use pg_bench, a tool which will exercise the database under load. You can give it a custom script with the queries you want to test and it will run them over and over and calculate the metric called transactions per second. Our use case for it is to benchmark the set of queries which get run to prepare the response for an API endpoint; this way we can compare how the queries on the old schema perform against those on the new schema.

Could this be prevented?

As satisfying as the cleaning up process is, there is no denying that it brings development to a halt and takes time to yield those long-term results. So the question to ask is whether we could avoid this “stop the presses!” moment? I believe many of the techniques described above could be put to work in an iterative life cycle of the project, including in the early phases — thus spreading the cost of this maintenance work as well as reducing it overall.

Revamping a database schema will leave you feeling as good as Carlton.

Agnieszka is a Senior Developer at Vizzuality with magic powers in Ruby on Rails and PostgreSQL. She takes an iterative approach to writing code: writing it, deleting it, then replacing it with better code until she’s satisfied it’s up to scratch. She’ll help you do it too if you ask.