Testing Schema Migrations

Published in

PlanGrid Technology

9 min readAug 22, 2019

To make PlanGrid’s schema migrations more reliable, the infrastructure engineering interns of Summer 2019 took up a project to test schema migrations before they are run on a production database.

The Problem

Currently, the workflow for feature developers to check the correctness of a migration is to run them on PlanGrid’s dev or test environments. We also have the Database Council, a group of engineers who oversee any changes to schemas and ensure their validity. For rollouts, schema migrations for different features are usually combined into a single migration that is then run by an engineer.

While this has worked for us, it is not a very efficient workflow for developers because:

There is no way to test without hijacking the dev or test environments, which can only be done by one developer at a time
There is a dependency on the database-council to test this migration on an actual production system
There is no way of experimenting with migrations and database schemas

On top of that, we don’t have a way of simulating load on a database while running a migration. We realized this was a requirement for us due to incidents that were caused by dropping an index and creating a new one in a migration, which caused all the ongoing queries to slow down and eventually overload the database.

What do we need? An isolated environment for every developer to be able to test their migrations and simulate load and size similar to production environments. This allows developers to:

Test without affecting internal environments
Reduce workload on the database-council
Experiment with existing or new ideas

The Current Infrastructure

PlanGrid uses RDS instances (using Postgres) to support our services. We use Python + SQLAlchemy to design most of our backend services and Alembic to run migrations on them. Connections to our database go through pgbouncer, which is a connection pooler that acts as a proxy to our database. We configure a pgbouncer for all of our database instances.

Possible solutions

The Test Environment

The simplest solution would be to run our migrations first in a test environment. The test environment is supposed to be a small replica of the production environment where we maintain a subset of its information to run our tests. We also have other services that can populate our test databases with specific kinds of fake data. So why not use this for our migration tests?

The size of the databases is nothing compared to what we have on production, and even with the fake data generating services, it would take a while for it to get to that point
The queries executed by our fake data generating services are also a small subset of the queries running in production. We could extend these services to run other queries but that would require maintaining them in two different places

While this is the most obvious solution, and much of it is already in place, it will not help us simulate what is in our production database accurately.

Shadow production

The idea behind shadow environments is to capture incoming traffic for the service and replicate these requests to a copy of the service to be tested (referred to as a shadow of the service). This way, we can test changes to our services without affecting production. For our purposes, we explored two ways of replicating requests on the application level: replicate the request sent to the service itself or replicate the query sent to the database.

PlanGrid runs its services on a Kubernetes cluster with Istio, which allows traffic mirroring for HTTP requests. This means that we would need to create a shadow for all the deployed services with shadow databases and Istio would ensure that all the requests are duplicated and sent asynchronously to these shadows. Unfortunately, the cost of maintaining this shadow production environment will be very high. Instead, we could spin up the shadow service to be tested with its own RDS instance. However, this also doesn’t work since a single service might need to talk to other services which means they would also require shadows. We could potentially have our shadow service talk to the production services but that is problematic since most of the operations performed by these services are not idempotent and would also increase the load on our production services.

The other option would require us to somehow replicate Postgres queries from the application level. We looked into possible ways of doing this:

Create our wrapper of SQLAlchemy to query the shadow database asynchronously when a certain flag is enabled. We ruled it out due to the added complexity of integrating this with all of PlanGrid’s services
Mirroring requests to a pgbouncer shadow using Istio. This is also not possible since our pgbouncer runs in an auto scaling group of EC2 instances outside of Kubernetes and also doesn’t use HTTP to communicate.

Shadow production would be very cool to have, not just for testing migrations but also for testing major infrastructure changes. However, due to the added cost of maintaining this environment and seeing how it is overkill for our purposes, we decided to table this approach hoping someone can figure it out in the future.

Using Pgbouncer

If you think about what we are trying to do here, we are essentially looking for a reverse proxy for our Postgres requests. Pgbouncer is exactly that, wherein it forwards Postgres requests to our database when it knows it’s safe to do so. If we could have pgbouncer somehow forward our requests to the main database and also to a shadow database asynchronously, it would fit our use case perfectly.

We first tried to see if pgbouncer supports mirroring incoming requests to it. However, we couldn’t find something for this in the documentation. We also can’t run multiple pgbouncer instances on the same machine since you can only have one application listen in on a single TCP port at once. Our other solutions such as sniffing packets were poor design for production applications and so we decided to not pursue them.

Pgreplay

So far we have explored solutions that will mimic live incoming traffic. Is there a way to record incoming traffic and replay it later? Enter pgreplay. Pgreplay uses Postgres log files (not WALs) to replay all the queries that were logged by the database, in the same order with the original timing intervals between each query that it encountered. It creates a new session for every session ID that it encounters in the log file, trying to ensure that concurrent sessions don’t interrupt each other.

This is great if you are dealing with RDS instances since they provide error logs by default. Assuming that we can get these RDS instances to log all running queries, we would need to come up with a way to download these log files, run them through pgreplay and then finally run them on our shadow RDS instance. We would also need to ensure that this shadow RDS instance is at a point in time where the Postgres log file starts, to make sure queries don’t get rerun.

This seems like a reasonable lightweight solution that we can implement for our RDS instances.

Other ideas

One idea that we didn’t explore was AWS Traffic Mirroring. This tool would allow us to copy network traffic from one elastic network interface to another. We could attach these ENIs to our pgbouncers since they are already running on their EC2 instances. However, this also requires us to use Nitro based EC2 instances which may have its repercussions.

Another concept that we were familiar with but didn’t look into was change data capturing, the idea of capturing and logging any changes to a database and performing actions using these logs. It would be similar to the pgreplay approach and could be implemented for Postgres using triggers however, triggers also have a huge performance impact on a database which is why we decided not to pursue it.

What we picked and why?

Our final decision was to implement a solution for our RDS instances using pgreplay. We picked it for the following reasons:

Lightweight and simple enough to implement since RDS will handle log capturing for us
Pgreplay can be run locally and automated easily, and hence we can create a tool to run automated tests which could also be further integrated with our Jenkins / Spinnaker pipelines
We can spin up test RDS instances on the fly, which could be isolated for developers allowing them to test migrations without having to take over an environment
Requires minimal oversight since we can control how the RDS instances are spun up, also reducing the workload on the database-council

How we did it

We tried to follow the Unix principle, i.e. come up with smaller tasks that could be used individually and also combined to run an automated test in a single command.

We used python’s invoke library to create tasks in our tool, boto3 to communicate with AWS’s RDS APIs, pgreplay to simulate load and alembic to run migrations.

The first step was to figure out a way of downloading log files for our Postgres instances. Since RDS is only logging errors by default, we had to change the RDS parameter group to log everything. We also had to change the format of the logs from stderr to csvlog for pgreplay to be able to parse our files. With this, we could simply use boto3 to download our log files.

With the log files, we needed to spin up an RDS instance to run pgreplay on. We used RDS’s point in time recovery feature to spin up this instance copy by parsing the time of recovery from the first login the Postgres log file allowing us to create a simulation as close to production as possible. We also require two of these instances, one for running the actual migration under load and one as a baseline to check how many more queries failed on the migrate instance. Another thing to note is that we don’t spin up pgbouncer instances for this test, which takes us a step away from simulating our production environment but also adds a load higher than usual due to no connection pooling.

Once we had this, we created the following tasks to achieve a small unit of work:

download-logs to download specified hours of logs for a specified RDS instance
create-replay to create replay files from a specified log file
create-snapshot to use point in time recovery to create a copy of the database to be tested
run-pgreplay to run replays on a specified database
alembic-upgrade to run an alembic migration

We then combined these tasks to create a bigger one, test-migration which would run all these tasks to test a migration under load. The workflow that we came up with for this test was:

Pick a database, and download its log files
Parse log files to create replay files and get start time of queries
Spin up RDS instances from the start time
Once up, run pgreplay on both the instances
Run the alembic migration to be tested on the migrate instance
Check the results and differences in errors/time taken for pgreplay
Bring down the RDS instances

While this test is not perfect, it is a step in the right direction since it gives the developers a new playground, not just for testing migrations but also spinning up database instances to test queries, backwards compatibility, etc. without affecting production directly.

Conclusions

Schema migrations are a critical part of a company’s deployments. They are required to be versatile and innovative at the same time. The more we dug into this problem, the more we became aware of the possible pitfalls that come with its solutions.

While solutions like shadow production are very exciting, the amount of work required to maintain them is large. What we were looking for was a quick and simple solution that developers could start playing with right away. We also tried to make it as open-ended as possible, which would allow us to extend it to test things like statically analyzing alembic migrations.

Making something as simple as this tool will allow developers to get familiar with effective ways of writing and testing migrations. What seemed daunting at first can now be tested with a single command in a completely isolated environment and hence helps developers resolve any concerns that they have about affecting production users.

All in all, this was a very fun problem to tackle and gave us some major insight into how to test migrations. We are excited to see how PlanGrid uses this tool to its advantage and expands it further!