Case Study: A Server Crash by a PostgreSQL / glibc Problem on AWS Aurora

Published in

MobileAction Technology

5 min readNov 19, 2022

Last Friday evening we had a crash on one of our core Databases. Thank God it didn’t take too long but we had some lessons learned from the problem.

The Story

While we were preparing for the regular tech meeting of the week, we saw some server crash alarms triggered from both AWS and our application servers on Heroku.

We were used to seeing these alarms because of some high-load and memory problems that we couldn’t solve yet. So we didn’t get panic and a few of our teammates focused on the problem to find a solution. We generally scale up/down a few servers or call restarts as the fastest solution.

But the problem was different. There was no problem on the Heroku application servers and our Database on AWS RDS (Aurora PostgreSQL with 1 read replica) was not responding.

Our Desperate Actions

Working with a cloud-based managed service brings so many benefits. But in case of a problem, you are desperate with the limited tools on your hand.

After identifying the problem our team took some actions that were enough on our previous cases:

Call a reboot on the server: Reboot calls didn’t respond. The servers couldn’t complete the reboot order. They stuck in rebooting state.
Remove the read replica: The read replica seemed to be the problem. It was the one that responded latest. We tried to remove the read replica and continue with a single instance. But it didn’t change anything, the write server didn’t respond as well.
Create a new instance from the latest data (time-based snapshot): It took too much time (1 hour) to build, but the new server didn’t respond to our connection requests either.
Create a new instance from the daily snapshots: It took a while but the new instance worked properly. But the snapshot time was 14 hours ago, which means if we switch the system to this one we have to lose 14 hours of data. We kept the instance as plan B and continued the investigation.
Create a new instance from the 10 minutes earlier data. AWS supports taking snapshots in time, so we tried to build a new instance 10 minutes before the crash event. But again it seemed to take too much time, so we continued to find a different solution.
Finally, we decided to take support from AWS experts. We explained our problem and mentioned the problematic server instance.

Our technical support plan was not active because of its price. AWS technical support package requires around 10% of your monthly bill. Since we are already paying enough we didn’t decide to buy this additional cost. But since we completed all possible solution alternatives, this was the only remaining solution.

AWS Support Responses

While we were waiting for a response from our support ticket, we received a message from a different, automatically created one.

Your database cluster “XXX” in region “us-east-X” experienced a failure on index: ‘idx_kt_XXX_key’.
We performed a recovery procedure to get the cluster into a healthy state but as a result had to disable the index. Please drop and recreate the index using the commands below:
DROP INDEX ‘idx_kt_XXX_key’;
CREATE INDEX idx_kt_XXX_key ON XXX.XXX USING btree (XXX, XXX, XXX);

Their message was correct, our server was operational again. We applied what they asked and everything was working fine. We re-created our read replica and turned on all services again.

Probably the AWS support team received so many support issues and applied their solution to our instance as well. Probably they would have sent this message even if we didn’t create a support ticket. (we’ve received similar support for different problems in the past).

But what happened?

We’ve asked for the problem details to understand the root cause:

PostgreSQL depends on a library called glibc in the operating system for collation support. Your RDS for PostgreSQL instance or Aurora PostgreSQL instance is periodically updated with newer operating system versions that may also include a newer version of the glibc library. Rarely, newer versions of glibc may change the sort ordering or collation of some characters. This can cause data to sort differently or cause invalidation of index entries.
To mitigate this issue, apply all operating system updates in every instance in your cluster, and then you may need to rebuild indexes that are affected, using the current glibc sort order. A description of how to check which b-tree indexes are impacted and to rebuild the affected b-tree indexes is included below. For other index types, we recommend you rebuild them as no automated check mechanism is available.

The problem was a common open-source library issue. Since each library is developed by different vendors, they don’t have to keep their backward compatibility. You have to check your dependency upgrades for protecting your product integrity.

The problem above was larger than a classic open-source dependency. Since one of these libraries was embedded into Operating System. You have no control over OS libraries. You may fix your compatibility on an OS version but that limits your target community.

Using an open-source tool brings the risk of a crash or vulnerability. You have to keep your focus and implement with your own risk.

How to protect from this problem

AWS support shared their solution for identifying the problem on PostgreSQL instances:

1. Use the amcheck extension to check b-tree indexes:
CREATE EXTENSION amcheck;
SELECT bt_index_check(<indexrelid>);
If the output is a single blank line, then the index is not impacted. Any problems with b-tree indexes will be reported.
2. Rebuild any affected b-tree indexes:
REINDEX VERBOSE INDEX <index-name>;

Lessons Learned

If you can’t find a quick solution, ask for support from the service vendor, regardless of its cost. Since customer satisfaction is more important.
Always keep your focus on open-source tools. It’s always possible to get some problems. You can’t lean on open-source tools comfortably.
Backups are important, keep them as fresh as possible.