Database design: The pitfalls of database upgrades (and how to fix them)

Published in

BYJU’S Tech Blog

4 min readMay 6, 2022

Aurora is part of Amazon RDS, a managed database service that makes it easy to set up, run, and scale a relational database in the cloud. It offers decoupled storage, and increased IOPS compared to vanilla SQL deployment. Aurora PostgreSQL 9.x reached the end of life on Jan 31, 2022.

This article highlights issues that often arise when version upgrades are done on critical databases using PostgreSQL as an example and how such pitfalls can be avoided.

The Issue

We run most of our relational database workloads on PostgreSQL, specifically, Aurora PostgreSQL. Due to the 9.x version of Aurora PostgreSQL reaching end-of-life, we started testing the upgrade for any potential issues post-migration and an estimate of downtimes, etc. Although the upgrades were completed in most test clusters in a reasonable amount of time, the team discovered an unusual issue post-migration — Even with a very long statement timeout, the majority of the queries time out.

Debugging the Issue

In our attempt to identify the root cause, we dug into the documentation of the new version, particularly the changelog and its recommended upgrade steps. Since hash indexes aren’t write-ahead-logged(WAL), it is recommended that we rebuild any hash indexes in use. Furthermore, none of the indexes in our database clusters are hash indexes primarily because these are not WAL. Also, Aurora 10.x changed the way joins are handled and the queries that are timing out contained joins as well. This led us to investigate the join-related functionality in the new version. We checked if we missed toggling any flags. However, even after several tweaks with join-related flags, nothing worked out. We began to question whether the queries were benefiting from indexing due to their extreme latency. Examining the query plans was one approach to confirm this. To our surprise, the query plans were quite inefficient compared to the 9.x version that was running in production.

So, we tried checking if there were any corrupted indexes. There weren’t any, as it turned out. After digging through the documentation, the team came up with another theory: the planner statistics could be out of place.

Query Plans and Planner Statistics

Before a query is run, the database engine creates a plan of execution such as which indexes to use, which algorithms to use, etc. These plans are created by a program known as a query planner. The plan can be viewed by prepending a SQL query with explain analyze. A sample output would look something like the below.

EXPLAIN ANALYZE SELECT * FROM tbl where col1 = 1;                          QUERY PLAN— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — 
Seq Scan on tbl (cost=0.00..169247.80 rows=9584 width=8) (actual time=0.641..622.851 rows=10000 loops=1)Filter: (col1 = 1)Rows Removed by Filter: 9990000Planning time: 0.051 msExecution time: 623.185 ms(5 rows)

But, how does the planner decide which algorithm to use, in which order to use it, etc.? A part of this decision is influenced by statistics Postgres collects. These are known as planner statistics. These statistics include information such as:

Total number of rows/columns in a table/index
Number of disk blocks occupied by a particular table/index
Most common values of a column
Number of null fields
Number of distinct values for a column

Postgres stores these statistics in tables called pg_class and pg_statistic. Postgres collects these during index creation, re-building, or routine vacuuming or analysis.

Verifying the Hypothesis

One way to recreate these statistics is to run analyze on a particular table/index or database. Since this could happen to all the tables and indexes, we ran a full analysis on the database. This analysis operation takes time as per the data present in the particular cluster. In some clusters, it took minutes, while on others it took hours. But, after this, query plans were significantly improved and comparable to the previous efficient versions.

Learnings

Although there are better ways to do an upgrade of Aurora PostgreSQL like logical replication, it is surprising that we have to manually run an analysis, which could lead to downtime during and post-migration. So,

Go through change log before running major version upgrades
Check for corrupt indexes post migration- activity
Check collations if joins are being performed. In case collations are different, the join performance takes a hit, and post a version update, it can immediately bubble up.
Run analyze verbose before considering the database cluster ready for production load. This will essentially happen for any PostgreSQL major version upgrade unless the documentation explicitly states otherwise

If you are excited about working on similar challenges, join us.