TLDR Troubleshooting Daily Write Spikes on PostgreSQL RDS by E. Tunggawan

Proactive investigation of a spike anomaly in DB metrics lead to discovery of huge temp file writes and one crazy Pg quirk.

pavel trukhanov
some-tech-tldrs
2 min readAug 9, 2018

--

Once upon a time…

  • Looked in monitoring — found a spike in database access/response time.
  • On this AWS RDS instance during the time there’s a huge IOPS spike:
  • It was happening every day.
  • Write IOPS were much higher than read IOPS during these spikes.

So the guessing game begun:

  • Was that traffic? No, nothing unusual in traffic profile.
  • Was that some background jobs, crons? — Nope.
  • Oh, I know — let’s look in logs!

The query logs are analyzed using awk and logtop to see if we have any anomalies in the queries executed by the database before, during, and after the spikes. We didn’t find anything unexpected.

The only thing that’s different is that the ETL extraction queries are executed around the time of spike.

Found an ETL step, that highly correlated with spikes in IO. But ETL process should only perform SELECT queries. So why high write IO?

PostgreSQL can perform disk writes during SELECT queries.

Wait. What? O_o

So called PostgreSQL “hint bits”:

.. side effects of the implementation — hint bit processing can result in heavy writes to a database table even though you're just reading from it.

… a simple scan will visit all the tuples on a page and update all their hint bits at once, piecemeal access (such as fetching single tuples via index scans) might result in many writes of the same page as various hint bits get updated over time.

Learned that this “bad” ETL query contained an ORDER BY statement.

EXPLAIN ANALYZE for that query showed:

...
Sort Method: external merge Disk: 2010376kB
...

Learned that external merge sort writes to disk a lot, and that there are other sort methods we can try out.

But decided that we drop the ORDER BY statement altogether — figured didn’t need it in ETL after all.

Deployed this fix and it became better:

Still a spike, but not that HUGE as before.

Probably spikes reside due hint bits.

Learn more about hint bits in How SELECT Queries can produce disk writes.

TLDR: proactively fixed an issue that could’ve caused severe disruptions on production system. We also got to learn various things about our own system. From this incident, we realized that still a lot to learn about PostgreSQL alone.

This week sponsor okmeter.io — a complete Postgres monitoring. Stay on top of PG temp files writes, hint bits and be prepared to anything.

--

--