Comparing two PostgreSQL databases with Python
Our team is in the process of upgrading the infrastructure of the apps backend and we’re going to replace our existing Postgres database instance with another one. Below is a rough diagram of what will happen:
There is an existing database being populated as soon as async events arrive. Once the new infrastructure is in place, these events will be written to both the existing and the new database.
We don’t want to lose any data during the migration and therefore the content of both databases must be the same. This is not trivial because write operations happen frequently and at any point in time. And even though we have a procedure in place to prevent data loss, we wanted to perform some sanity checks to ensure the new database has exactly the same data as before. This blog post describes how those sanity checks were made.
From a quick search, it seems that most of the exiting tools for comparing Postgres databases only compare schema not data , . We also wanted to stay out of commercial software because that might involve negotiating a new contract, which would delay the upgrade.
Besides that, a full comparison of all rows in all tables would not be feasible because auto-generated ids are not guaranteed to be the same in the new database. And therefore a more fine-grained comparison was required.
Perhaps there is a tool out there that would do exactly what we wanted, but I didn’t find a straight solution to my problem.
I had some previous experience with Python and Jupyter Notebooks and decided to start with some small queries, import them as Pandas data frames and go from there. Luckily I found a blog post describing precisely this approach. The last piece was being able to compare to data frames, which was accomplished using DataComPy package. Here’s some code:
Begin by setting up a connection to both databases, with the help of utility code methods from here:
Then, prepare a SQL query and execute it on both databases:
Note that, because of the size of the database, some tables were impossible to perform a query covering all rows. Where necessary, we took a compromise by comparing only the first and last X number of rows.
Then, remove auto-generated columns which we already know that are not guaranteed to be the same in both databases:
And finally, use DataComPy to compare both data frames:
When the data does not match, DataComPy becomes very useful by providing a human-readable report describing the differences between the two data frames.
As a conclusion, this approach is not very scalable because a considerable amount of manual work is required — analysing each table independently — and some queries were sub-optimal and took a long time to execute. Regardless of that, we were not looking for a bullet-proof solution, only a way of sanity checking the data in the new database is consistent and did not get lost during the deployment procedure. And for that it was helpful!
I hope you found this useful, let me know your thoughts in the comments =)
And feel free to follow me on Twitter https://twitter.com/belchii