Comparing two PostgreSQL databases with Python

Ricardo Belchior
Nov 7, 2019 · 3 min read

Our team is in the process of upgrading the infrastructure of the apps backend and we’re going to replace our existing Postgres database instance with another one. Below is a rough diagram of what will happen:

Image for post
Image for post
Timeline of events being stored during the migration.

There is an existing database being populated as soon as async events arrive. Once the new infrastructure is in place, these events will be written to both the existing and the new database.

We don’t want to lose any data during the migration and therefore the content of both databases must be the same. This is not trivial because write operations happen frequently and at any point in time. And even though we have a procedure in place to prevent data loss, we wanted to perform some sanity checks to ensure the new database has exactly the same data as before. This blog post describes how those sanity checks were made.

Existing tools

From a quick search, it seems that most of the exiting tools for comparing Postgres databases only compare schema not data [1], [2]. We also wanted to stay out of commercial software because that might involve negotiating a new contract, which would delay the upgrade.

Besides that, a full comparison of all rows in all tables would not be feasible because auto-generated ids are not guaranteed to be the same in the new database. And therefore a more fine-grained comparison was required.

Perhaps there is a tool out there that would do exactly what we wanted, but I didn’t find a straight solution to my problem.

Solution

I had some previous experience with Python and Jupyter Notebooks and decided to start with some small queries, import them as Pandas data frames and go from there. Luckily I found a blog post describing precisely this approach. The last piece was being able to compare to data frames, which was accomplished using DataComPy package. Here’s some code:

Begin by setting up a connection to both databases, with the help of utility code methods from here:

Image for post
Image for post

Then, prepare a SQL query and execute it on both databases:

Image for post
Image for post

Note that, because of the size of the database, some tables were impossible to perform a query covering all rows. Where necessary, we took a compromise by comparing only the first and last X number of rows.

Then, remove auto-generated columns which we already know that are not guaranteed to be the same in both databases:

And finally, use DataComPy to compare both data frames:

Image for post
Image for post

When the data does not match, DataComPy becomes very useful by providing a human-readable report describing the differences between the two data frames.

As a conclusion, this approach is not very scalable because a considerable amount of manual work is required — analysing each table independently — and some queries were sub-optimal and took a long time to execute. Regardless of that, we were not looking for a bullet-proof solution, only a way of sanity checking the data in the new database is consistent and did not get lost during the deployment procedure. And for that it was helpful!

I hope you found this useful, let me know your thoughts in the comments =)

And feel free to follow me on Twitter https://twitter.com/belchii

Reach Product Development

The product development and engineering teams behind Reach…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store