Handling databases from a developer’s perspective

Published in

Miro Engineering

17 min readSep 14, 2020

When you develop a new feature using databases, the development cycle usually includes (but is not limited to) the following steps:

Writing an SQL migration → writing source code → testing → release → monitoring.

In this article, I want to share some practical tips on shortening the cycle time for each of these stages without any loss in quality — instead, we would like to improve it.

Since we are working with PostgreSQL and our server code is written in Java, the examples will be based on this stack. However, most of these ideas will apply regardless of the database or programming language you are using.

SQL migration

The first development stage after designing the feature is writing an SQL migration. The most basic advice is not to make any manual changes to the data schema, but always to make changes through scripts that are all stored in one place.

Developers write SQL migrations themselves at our company, so migrations are stored in a repository alongside the source code. For some organizations, the schema is managed by database administrators. In such a case, the migration registry is stored somewhere at their side. In both cases, this approach brings the following benefits:

You can always easily create a new database from scratch or upgrade existing ones to the current version. This allows you to deploy new test environments and local development environments quickly.
All databases have the same schema, which means no surprises in maintenance.
You can see the full history of all changes (versioning).

There are several available tools to automate this process, both commercial and free: Flyway, Liquibase, Sqitch, and others. We’re not going to compare and choose the best tool in this article because that is a big, separate topic, and many other articles have already been written about it.

We’re using Flyway, so here’s some information on it.

There are two types of migration: SQL-based and Java-based.
SQL migrations are immutable. After the first execution, an SQL migration cannot be changed. Flyway calculates the checksum for a migration file’s content and checks it for every run. For Java migrations to be immutable, additional manual handling is required.
The history of all migrations is stored in the flyway_schema_history (formerly schema_version) table. You can find the date and duration of each migration and its type, file name, and checksum there.

Under our internal arrangements, all changes to the data schema are made only through an SQL migration. Their immutability ensures that we can always obtain a current schema that is entirely identical to all environments.

Java migrations are only used for DML when you can’t write in plain SQL. A typical case is a migration to Postgres from another database (we are moving from Redis to Postgres, but that’s a story for another day). One more example is updating the data for a large table in several transactions to minimize the table lock time. It’s worth mentioning that since the 11th version of Postgres, this can be done with SQL procedures on plpgsql.

When Java code becomes obsolete, the migration can be removed because we don’t want to produce legacy code (the Java class for the migration remains, but it’s blank). We set this to happen no sooner than a month after performing the migration in production — we believe this is enough time for all test environments and local development environments to update. Since Java migrations are used only for DML, its removal does not affect the creation of new databases from scratch.

An important note for those who use pg_bouncer

During the migration, Flyway acquires a lock to prevent multiple migrations from co-executing. To simplify, it works like this:

A lock is acquired.
A transactional migration is executed.
The lock is released.

For Postgres, Flyway uses advisory locks in session mode, which means that to work properly, the application server must be running on the same connection when it is released as it was when it was acquired. If you use pg_bouncer in transactional mode (which is most common) or in statement mode, it can return a new connection for each transaction, and Flyway will not be able to release the lock.

To solve this problem, we use a small dedicated connection pool on pg_bouncer in session mode, which is intended for migrations only. On the application side, there is also a separate pool containing one connection that closes on a timeout after the migration is complete in order not to hold resources.

Writing source code

The migration is in place, so now we can write the code.

There are three approaches for working with the database on the application side:

using ORM (for Java, hibernate is the de facto standard)
using plain SQL + jdbcTemplate or the like
using DSL libraries

Using ORM reduces your SQL skill requirements — a lot of things are generated automatically:

The data schema can be created from an XML description or a Java entity available from the code.
Object relations are defined using a declarative description — ORM will make joins for you.
With Spring Data JPA, even tricky queries can be generated automatically by the repository method’s signature.

One more bonus is data caching out of the box (for hibernate, 3 levels of caching).

But it is important to note that ORM, like any other powerful tool, requires specific skills to use. Without proper customization, the code is still likely to work, but far from optimally.

The opposite option is to write SQL manually. This allows you to control the queries fully — your queries will be performed exactly as you wrote them, without any surprises. But that increases the amount of manual labor and the requirements for the developers’ qualifications.

DSL libraries

Somewhere between these two approaches is another one: using DSL libraries (jOOQ, Querydsl, etc.). These are usually much more lightweight than ORM, but more convenient than completely manual work with a database. Their usage is not so widespread, so in this article, we will briefly consider this approach.

We’re going to discuss jOOQ, which offers

database inspection and automatic generation of classes
fluent API for writing queries

Since jOOQ is not an ORM, there is no auto-generation of queries or caching, but some of the manual approach problems are fully covered:

classes for tables, views, functions, and other database objects are generated automatically;
queries are written in Java, which ensures type safety — if the query is syntactically incorrect or has a wrong type parameter, it will not compile — your IDE will immediately indicate an error. You won’t have to spend time launching the application to check if the query is correct. This speeds up the development process and reduces the error probability.

Queries in code look like this:

BookRecord book = dslContext.selectFrom(BOOK)
                       .where(BOOK.LANGUAGE.eq(“DE”))
                       .orderBy(BOOK.TITLE)
                       .fetchAny();

You can also use plain SQL if you want to:

Result<Record> records = dslContext.fetch(“SELECT * FROM BOOK WHERE LANGUAGE = ? ORDER BY TITLE LIMIT 1”, “DE”);

Obviously, in this case, the query’s correctness and the results parsing are completely on you.

jOOQ Record and POJO

BookRecord, in the example above, is a wrapper for the row in the “book” table that implements the active record pattern. Because this class is a part of the data access layer (moreover, its specific implementation), you might not want to pass it to other layers of the application but rather use your POJO. For an easy record ↔ POJO conversion, jOOQ offers several mechanisms: automatic ones and a manual one. The documentation referenced above contains various examples for reading, but there are no examples for inserting new data or updating it. Let us fill in this gap:

private static final RecordUnmapper<Book, BookRecord> unmapper =
  book -> new BookRecord(book.getTitle(), …); // some kind of logicspublic void create(Book book) {
    context.insertInto(BOOK)
           .set(unmapper.unmap(book))
           .execute();
}

As you can see, it’s simple enough.

This approach allows you to hide implementation details within a data access layer class and avoid leaks to other application layers.

Jooq can also generate DAO classes with a set of basic methods to simplify data handling for this table and reduce the amount of manual code writing (very similar to Spring Data JPA):

public interface DAO<R extends TableRecord<R>, P, T> {
    void insert(P object) throws DataAccessException;
    void update(P object) throws DataAccessException;
    void delete(P… objects) throws DataAccessException;
    void deleteById(T… ids) throws DataAccessException;
    boolean exists(P object) throws DataAccessException;
    …
}

We don’t use auto-generation DAO classes at our company — we only generate wrappers for database objects and write the queries manually. Wrapper generation takes place every time a dedicated Maven module for the migrations is rebuilt. Further in this article, there will be more details on how we implemented the wrapper generation.

Testing

Testing is an essential part of the development process. Useful tests guarantee the quality of your code and save time in its further maintenance. The opposite is also true: wrong tests may create an illusion of high-quality code, conceal errors, and slow down the development process. That’s why it is not enough just to decide that you’re going to write tests — you need to do it right. However, the proper test concept is very vague, and everyone has their idea of what it means.

The same is true for the classification of tests. In this article, we suggest the following classification:

unit-testing
integration testing
end-to-end testing

Unit testing implies checking the functionality of each module separately from others. Once again, there is no single opinion on the size of the module. Some believe it should be a single method; others that it should be a class. Separation means that all other modules will be replaced with mocks or stubs in testing. Follow this link to read Martin Fowler’s article about the difference between the two. Unit tests are small and fast, but they can only guarantee that an individual unit’s logic is correct.

Integration testing, unlike unit testing, allows checking how several modules interact with each other. Working with a database is an excellent example of a situation where integration testing makes sense because it’s challenging to properly mock a database so that every critical detail is taken into account. In most cases, integration tests for databases make a good compromise between execution speed and quality assurance compared to other types of testing. That’s why we will discuss this type of testing more thoroughly in this article.

End-to-end testing is the most extensive type of testing. To carry it out, you need to set up an entire environment. It guarantees the highest level of confidence in the product quality, although it is the slowest and most expensive type of test.

Integration testing

When it comes to implementing integration testing for code that works with a database, most developers have to address the same questions: how to start the database, initialize its state, and how to do so as quickly as possible.

Some time ago, it was common to use H2 in integration testing. H2 is an in-memory database written in Java that has modes that allow compatibility with most popular databases. It’s versatile and doesn’t require installing a database, making it a very convenient replacement for actual databases, especially in applications that don’t depend on a specific database and use only what is included in the SQL standard (which is not always the case).

The trouble begins when you start to use some tricky database features that are not supported in H2 or completely new ones from a fresh release. In general, since this is a “simulation” of a specific DBMS, it might not perfectly reflect the behavior of that DBMS.

Another option is to use embedded Postgres. This is the real Postgres shipped as an archive; it also does not require installation. You may work with it as you would with a regular Postgres version.

There are several implementations thereof. The most popular are the ones developed by Yandex and openTable. We used the Yandex version. Its disadvantages are a rather slow start (the archive is unpacked every time, and the database launch takes 2–5 seconds depending on the computer’s capacity) and the delay behind the official release version. We also had an error that occurred sometimes after an attempt to stop it with code: the Postgres process would keep running in the OS until we killed it manually.

Testcontainers

The third option is to use docker. There is a Testcontainers library that provides an API for working with docker containers through your code for Java. This way, any dependency in your application with a docker image can be replaced in tests with Testcontainers. Also, there are separate ready-to-use classes for many popular technologies that provide a more convenient API, depending on the image used:

database containers (Postgres, Oracle, Cassandra, MongoDB, etc.)
nginx
Kafka
etc.

By the way, when the Tescontainers project became popular, Yandex developers officially announced that they were stopping the development of their embedded Postgres project and advised everyone to switch to Testcontainers.

What are the pros:

Testcontainers is fast (starting empty Postgres takes less than a second).
The Postgres community releases official docker images for each new version.
Testcontainers has a dedicated process that kills hanging containers after shutting JVM down unless you have killed them yourself.
with Testcontainers you can use a unified approach to test your application’s external dependencies, which makes things easier.

An example of a test with Postgres:

@Test
public void testSimple() throws SQLException {
    try (PostgreSQLContainer<?> postgres = new PostgreSQLContainer<>()) {
        postgres.start();
        ResultSet resultSet = performQuery(postgres, "SELECT 1");
        int resultSetInt = resultSet.getInt(1);
        assertEquals("A basic SELECT query succeeds", 1, resultSetInt);
    }
}

If there is no separate class for the image in Testcontainers, then creating a container looks like this:

public static GenericContainer redis = new GenericContainer(“redis:3.0.2”).withExposedPorts(6379);

Testcontainers has extra support for JUnit4, JUnit5, and Spock, making writing tests easier if you are using these frameworks.

Speeding up tests with Testcontainers

Switching from embedded Postgres to Testcontainers made our tests faster because the Postgres launch was taking less time. However, over time, the tests began to slow down again due to the increased SQL migrations that Flyway performs at startup. When the number of migrations exceeded one hundred, the execution time was about 7–8 seconds, which significantly slowed down the tests. Here’s how it worked:

Before the next test class, a “clean” container with Postgres was launched.
Flyway performed the migrations.
Tests for this class were executed.
The container was stopped and removed.
All of this was repeated for the next test class.

Obviously, over time, the second step took longer and longer.

While trying to solve this problem, we realized that it is enough to perform migrations only once before all tests, save the container’s state, and then use it for all tests. So we changed the algorithm:

A “clean” container with Postgres launches before all tests.
Flyway performs the migrations.
The сontainer state is saved.
Before the next test class, a previously prepared container is launched.
Tests for this class are executed.
The container stops and is removed.
The algorithm repeats from step 4 for the next test class.

Now the execution time of an individual test doesn’t depend on the number of migrations, and with the current number of migrations (200+), the new algorithm saves several minutes on each run of all tests.

Here are some advanced tips on how to implement it.

Docker has a built-in mechanism for creating a new image from a running container using the commit command. This allows you to customize the images, for example, by changing some settings.

An important detail is that this command doesn’t save the data of the mounted partitions. But suppose you use the official Postgres docker image. In that case, the PGDATA directory in which the data is stored is located in such a separate partition so as not to lose data after restarting. So, when the commit is executed, the state of the database will not be saved.

The solution is to not use the partition for PGDATA but to keep the data in memory, which is quite normal for tests. There are two ways to achieve this: (1) use your dockerfile (it may look like this) without creating a directory, or (2) override the PGDATA variable when starting the official container (the former directory will still be made, but won’t be used). The second way seems much simpler:

PostgreSQLContainer<?> container = ...
container.addEnv("PGDATA", "/var/lib/postgresql/data-no-mounted");
container.start();

It is recommended to make a checkpoint for Postgres before committing to copy the shared buffers’ changes to the “disk” (which matches the overridden PGDATA variable):

container.execInContainer(“psql”, “-c”, “checkpoint”);

This is how the commit is executed:

CommitCmd cmd = container.getDockerClient().commitCmd(container.getContainerId())
                .withMessage("Container for integration tests. ...")
                .withRepository(imageName)
                .withTag(tag);
String imageId = cmd.exec();

It should be noted that the approach of using prepared images can be implemented for many other images, which will also save time during integration tests.

A few more words about optimizing build time

As it has been mentioned earlier, when building the separate Maven module with migrations, among other things, Java wrappers are generated over the database objects. We use a self-written Maven plugin that runs before compiling the main code and performs three actions:

It launches a “clean” docker container with Postgres.
It launches Flyway to perform SQL migrations for all databases, thereby checking their validity.
It launches Jooq to inspect the database schema and generate Java classes for tables, views, functions, and other schema objects.

As you can see, the first two actions are identical to those performed when the tests start. To save time on creating the container and running migrations before tests, we have moved the container state’s saving state to the plugin. This means that the prepared images of all databases used in the code will appear in the local Docker images repository immediately after rebuilding the module.

A more detailed code example

An adapter for PostgreSQLContainer for use in the plugin:

The plugin’s start goal implementation:

Thesave-stateand stop goals are implemented in a very similar way and therefore omitted here.

Using in pom.xml:

Release

The code has been written and tested, and now it’s time to release. In general, the complexity of a release depends on the following factors:

the number of databases (one or more)
the database size
the number of application servers (one or more)
whether the release is seamless (whether application downtime is allowed)

The first and third items above impose a backward compatibility requirement on the code since, in most cases, it is impossible to update all databases and all application servers simultaneously: there will always be a moment when the databases have different schemas, and the servers have different versions of the code.

The database size affects the migration time: the larger the database, the more likely you will have to perform a long migration.

Being seamless is partly a resulting factor: if a release is performed with a downtime, then the first three items are not so important and only affect the time when the app is unavailable.

Here’s what our service looks like:

There are about 30 database clusters.
The size of a single database is 200–400 GB.
Several application servers (their number is auto-scaled during the day depending on the load; it can exceed 100 at peak times), and each server is connected to all databases.
Our releases are seamless.

We do canary releases: a new version of the application is first applied to a few servers (we call it a prerelease), and after a while, if no errors are found in the prerelease, this version is released to the rest of the servers. Thus, production servers can have different versions.

Each application server checks the DB version against the script versions from the source code (in terms of Flyway, this is called validation). If the versions differ, the server will not start, which guarantees compatibility between the code and the database. This also prevents a situation where, for example, a code is trying to access a database table that hasn’t been created yet because the migration is in a different version of the server.

But this certainly does not resolve the problem when, for example, a new version of an application has a migration that removes a column from a table that can still be used in the old version of the server. Currently, we only check such situations at the review stage (which is mandatory), but we need to implement a different stage with this check our CI/CD cycle.

Sometimes migrations can take a lot of time (for example, updating data in a large table); in this case, we use a combined migration technique to avoid slowing down the releases. This technique consists of two steps. First, we manually run the migration on a running server (via the administration panel, without Flyway, and, accordingly, without any record in the schema history table). Second, we do the same migration in the “regular” way at the server’s next version. The migrations of this kind must satisfy the following requirements:

It must be written so that it does not block the application when it’s running for a long time (the main point here is not to acquire long-term locks at the database level). To do so, we have internal guidelines for developers on how to write migrations. In the future, we may also share them on Medium.
A “regular” run of the migration should realize that it has already been performed in manual mode; in this case, it should do nothing but add a new record in the schema history table. For SQL migrations, such a check is performed by executing an SQL query to see if anything has changed. There is another approach for Java migrations — we use some stored boolean flags that are set after a manual run.

This approach solves two problems:

The release is fast (though with some manual actions).
All environments (the local ones for developers and the test ones) are updated automatically without manual actions.

Monitoring

The development cycle doesn’t end after the release. To understand whether the new features work (and how they do), you need to add many metrics. They can be divided into two groups.

The first group are the very domain-specific metrics: it’s useful for a mail server to know the number of messages sent, a news resource to see the number of unique users per day, etc.

The second group’s metrics are pretty much universal: they determine the technical state of the server, such as CPU consumption, memory allocation, network load, database status, and so on.

What exactly you need to monitor and how to do it is the topic of an impressive number of dedicated articles so that it won’t be discussed here. We would like only to remind you of the essential things (here comes Captain Obvious):

Define the metrics in advance
You need to define a list of basic metrics. And you should do this in advance, before the release — not after the first incident when you don’t understand what is happening to the system.

Set up automatic alerts
This will speed up your reactions and will save time on manual monitoring. Ideally, you should know about problems before users start to notice and point them out to you.

Collect the metrics from all nodes
You can never have too many metrics or logs. Having data from each node of your system (application server, database, connection pooler, load balancer, etc.) allows you to see the complete picture of its status, and if necessary, to quickly localize the problem.

A simple example: loading data from a web page is getting slower. There could be many reasons:

The web server is overloaded and is taking a long time to respond to requests.
The SQL query now takes longer than usual to execute.
A longer queue appears at the connection pooler, and the application server can’t get a connection for a long time.
There are network problems.
Something else.

Finding the cause of the problem without metrics won’t be easy.

Instead of a conclusion

I would like to say quite an obvious thing about the fact that there is no one-size-fits-all solution, and the choice of one approach over another depends on the requirements of the particular task — what works well for others may not be applicable to you. But the more different approaches you are aware of, the more thorough and efficient your decision-making will be. I hope you’ve learned something new from this article that will help you in the future. I would be happy to see comments on what approaches you use to improve your working processes with databases!

Join our team!

Would you like to be an Engineer, Team Lead or Engineering Manager at Miro? Check out opportunities to join the Engineering team.