Troubleshooting Daily Write Spikes on Cermati’s PostgreSQL RDS
On a peaceful day at Cermati, I was looking at our NewRelic monitoring dashboard. Everything seems to be working just fine at the moment, so I went to see our performance a few hours earlier and found that there’s a spike on database access time around 1.15 in the morning.
After some checking to the production AWS RDS instance, we noticed that during the time there’s a huge IOPS spike. It always happens around 1.15 to 1.30 every day.
So the hunt for the root cause of this performance spike began. For this task, I worked together with Putra Sattvika, a fellow engineer on Cermati’s infrastructure team.
Scanning the Environment
The first thing we did was to check the IOPS during the daily spikes. According to AWS’ RDS instance performance monitoring graphs, the write IOPS is much higher than the read IOPS during the spikes.
The spikes happens around the same time every day, and we have background workers running and also some other processes, so we decided to try assessing the condition in our system during the time of incident.
Traffic is among the first things we checked regarding the spikes, as it’s a low-hanging fruit since we deployed ELK for our web request logs. We’d like to see if there’s any anomaly regarding the request during the time of the spikes.
We quickly found that there’s none. The spikes happen around the time when we have the lowest traffic of the day, and the endpoints hit by the traffic are about the same with the ones hit by usual traffic.
We also checked our background worker jobs to see if there’s anything that writes a large amount of data to the database in the host experiencing the write IOPS spikes.
We initially checked the schedule of worker job runs, but didn’t found anything that’s run specifically at that moment. We knew that some of the worker jobs regularly process a large volume of data and write the result to the problem database, so we opened up the logs to see if they process a particularly large volume every day around 1.15 in the morning since one of the worker EC2 instances seem to have high resource usage around the time.
After checking the worker logs, we found that there’s nothing from our background worker that writes so much data to the database at the time. Even during its peak time, the volume just doesn’t go up to the point required to cause the spikes.
We proceeded to see if there are any unusual queries running at the same time with the spikes. The query logs are analyzed using
logtop to see if we have any anomalies in the queries executed by the database before, during, and after the spikes. We didn’t find anything unexpected from the scripted query analysis, as the composition of queries executed during the spike period isn’t different with the ones we have during the time when there’s no spikes.
So we proceeded to read the log of queries performed around the time of spike, but we didn’t find anything particularly strange with the queries. The only thing that’s different is that the ETL extraction queries are executed around the time of spike, so we noted it to investigate it further.
Database Snapshot Schedule
Aside from ETL extraction job schedule, we also suspected that the scheduled daily database snapshot might have triggered the RDS instance to write data from memory buffers to disk and ended up causing the write IOPS spikes.
The daily snapshot also happens to be taken around the time of spike, so we decided to try and see whether changing the schedule moves the time of spike. So we scheduled the daily database snapshot to be 30 minutes earlier than usual and see if it changed the timing of the write IOPS spikes.
It doesn’t, so we concluded that the daily snapshots have nothing to do with it and focused on the ETL processes.
The ETL Processes
We checked to see the queries performed by the ETL processes to the target database. As it should, the ETL processes only perform
SELECT queries. We suspected that the
SELECT queries triggered some disk writes, so we performed
EXPLAIN ANALYZE on the select queries we executed every day to see if their planned execution is expected to cause the write IOPS spikes.
PostgreSQL can perform disk writes during
SELECT queries, and the
EXPLAIN ANALYZE we performed on our daily queries also shows some cases that can trigger disk writes from buffers. But those numbers are far below the volume written by the database to cause the spikes.
We didn’t initially see the how the write IOPS spikes relate with the ETL processes aside from the correlation between the spikes and the period of extraction query executions, until we found that among our ETL processes the execution of one extraction task always happens exactly at the same time as the highest write IOPS spike of the day. We tested scheduling the ETL flow containing the process to be executed 1 hour later for the next day. As we expected, the biggest spike also happens 1 hour after the other small ones, happening at the exact same time with the process whose execution we rescheduled.
The spike happens when we’re performing extraction on the biggest PostgreSQL table we have on production. The table contains big records, and those records are processed on a daily basis to put it to our OLAP database. We also have some other smaller spikes when performing extraction from smaller tables, so if left alone this problem might leave us with a daily downtime when the tables grow to the point that the performance spikes are significant enough to bring down the services connecting to the OLTP database. Luckily, we haven’t got to that point.
As we already pinpointed where our problem came from, we performed tests to see the performance of the same query we ran on production. We fed a test database with a set of sample data, using a table structured exactly like the table we extract from on production and dummy records that mimics the data structure of our production records. We managed to reproduce similar problem on the test environment as what we saw on production.
Along the way, we discovered about PostgreSQL hint bits from this wiki entry. While reading the entry, we stumbled upon this paragraph.
Another point to note is that the hint bits are checked and set on a per tuple basis. Although a simple scan will visit all the tuples on a page and update all their hint bits at once, piecemeal access (such as fetching single tuples via index scans) might result in many writes of the same page as various hint bits get updated over time.
The wiki entry mentioned that fetching tuples using index scans might cause PostgreSQL to write the same page over and over again to the disk. Our query contains an
ORDER BY statement, and we suspect that it is the root cause of the write IOPS spike. From that wiki paragraph, it seems to us that the problem is caused by the
ORDER BY statement in the query triggering the same page from the storage to be rewritten multiple times when the hint bits of the reordered tuples — which belongs to that storage page — are scanned and rewritten according to the query-specified ordering.
But we’re not very clear with when exactly the hint bits are updated, and whether it’s updated in every query. Another guess is that PostgreSQL writes the ordering result to disk since it exceeds the memory space allocated for in-memory record processing, which makes sense in this context.
By the way, on the previous
EXPLAIN ANALYZE queries we scripted happened to hit some wrong, unused tables, instead of the correct tables due to the naming convention when we performed data migration for the big tables before to optimize our database performance. So it didn’t show any disk write because the records to process for the day on those tables were practically zero. When we manually perform
EXPLAIN ANALYZE to the right table with
ORDER BY statement, we got a line similar to this.
Sort Method: external merge Disk: 2010376kB
The disk write will go away if we omit the
ORDER BY statement.
So we hypothesized that omitting the
ORDER BY statement would fix our performance problem due to abnormal write operation load. To test our hypothesis, we modified our query to avoid using the
ORDER BY statement on our test environment’s query and executed it on the test database. As we expected, the query performed faster.
We modified the query to be executed in production to avoid using the
ORDER BY statement and deployed the fix. This article explains that the external merge sort method writes to disk, and there are other sort methods we can try out. But we decided that the sort wasn’t really necessary for our data-processing flow to work.
After the deployment, our daily write IOPS usage around the time looks like the above chart. It eliminated the unnecessary disk write operations performed by PostgreSQL due to repeatedly writing the same page to disk in order to update the hint bits. As the result, the write IOPS spikes are gone and it’s just small write IOPS bumps at the time of our ETL processes’ runtime — which is pretty normal, since there are database write operations happening every time on the database.
Some Notes on Hint Bits
In the previous section, we learned that the
ORDER BY statement was the one triggering the disk writes. Our production problem is already solved, we got a reference explaining the relation between sort methods and disk write operations. But we also found some references regarding hint bits, which we don’t really understand yet but at the beginning seems to have similarities with the problems we have regarding how the hint bit updates worked and whether they are updated during
We probed around PostgreSQL’s codebase in their GitHub mirror repository and see where updating the hint bits is relevant to PostgreSQL’s operations. It seems that the hint bits should normally only be set during
VACUUM operations on
visibilitymap_clear() function calls. We didn’t see any
VACUUM operations happening on our tables around the time of spike when we checked the tables’ last vacuumed and auto-vacuumed time. Some heap operations on PostgreSQL also call those functions, and according to the wiki link we referenced earlier
SELECT queries can trigger the hint bit updates.
Any examination whatsoever of a tuple — whether by vacuum or any ordinary DML operation — will update its hint bits to match the commit/abort status of the inserting/deleting transaction(s) as of the instant of the examination. A plain SELECT, count(*), or VACUUM on the entire table will check every tuple for visibility and set its hint bits.
In the end, we’re still not very sure how hint bits affected our query performance at this point. But it should be worth for us to explore further in the future.
This incident didn’t get to the point where it could cause severe disruptions on our production system because we managed to identify and solve the problem early enough, before the tables grow even bigger than it currently is which would lead to the queries triggering even more disk write operations.
Aside from avoiding possible future emergency firefighting, from this troubleshooting we also got to learn various things about our own system. When we were exploring our own system to find the root cause of the spikes, we learned about the activities of background worker processes, the regular traffic pattern on related services, and some things regarding how PostgreSQL physical storage and query execution works.
In the PostgreSQL implementation, some quirky things are done in order to optimize the runtime performance. Understanding these things will help us to understand why PostgreSQL behaves in a certain way and how to plan our applications’ interactions with the RDBMS.
From this incident, we realized that we still have so much to learn even from PostgreSQL alone.
This article is adapted from one of our post-mortem report to be shared internally with the engineering division, and presented in one of our bi-weekly internal tech talk sessions held in April 2018.