Vacuuming ‘dead tuples not yet removable’ in PostgreSQL and other tales.

Published in

precogs-tech

4 min readJul 6, 2018

PostgreSQL queries were taking much longer than expected and we didn’t know why. We found an expert to help us solve the problem and hopefully this post will allow you to avoid the same issue.

When it came time to choose which database to use, our team at Precogs went with a PostgreSQL. It is robust, easy to use, has a large support community and is available in AWS RDS, so for us the choice was quite obvious.

But even with a great database system, we still needed to have control of how we use it and despite the fact that we were aware of the basic admin tips and how queries can overload databases, we still made a bunch of mistakes. Those mistakes lead to some pretty annoying database issues, and those issues and how we solved them are the reason for this blog post. I won’t focus too much on the solutions we chose, but more on the different problems and where we looked at the wrong solution.

The principal issues we had were:

Simple queries like SELECT count(*) FROM table_with_1MLine took… a really… long… time. Like, more than 5 minutes. Even with tables of more than a million rows, that shouldn’t really happen.
Diverse operations, such as importing data, also took way too long.
Plus, at the beginning of the first analysis into these two issues, we also discovered that vacuuming our big tables took way too much time.

In a nutshell, the database was way too slow for our purposes.

The first issue to tackle was the vacuuming. It wasn’t working as expected:

When analyzing the pg_stat_activity we found that one of our tables, ‘opportunities’ (a calculation of client purchasing opportunities throughout the day) was vacuuming a lot: approximately every 10 minutes.
We turnedrds.force_autovacuum_logging_level to log in order to log all auto-vacuum and we found that the auto-vacuum of this table lasted for about 10 minutes.

Put simply, this table was always vacuuming. A few online searches later we discovered two things:

Auto-vacuuming should avoid tables growing unnecessarily big by removing “marked as deleted” tuples
A widely accepted rule is that if auto-vacuuming is taking too much time, it is due to the fact that it isn’t run often enough

To solve this issue, we run auto-vacuum (and analysis) more often on the tables with issues by changing :

autovacuum_vacuum_scale_factor
autovacuum_vacuum_threshold
autovacuum_vacuum_cost_limit
autovacuum_analyze_scale_factor
autovacuum_analyze_threshold

Plus we increased maintenance_work_mem in order to let auto-vacuum have enough memory.

The result was… well, it didn’t work at all!

At this point we decided we needed some help, so we searched for an expert and found Vik Fearing from 2ndQuadrant here in Paris. With their help we ran an audit and found the real problem, along with other precious advice and better comprehension of these databases! Win!

It turned out that the issue wasn’t that auto-vacuum wasn’t being run often enough, but that it was blocked by something (spoiler alert: at least one transaction) and couldn’t remove dead tuples while working. It was really clear when we examined the logs:

pages: 0 removed, 21146 remain, 0 skipped due to pins
 tuples: 0 removed, 152873 remain, 26585 are dead but not yet removable
 buffer usage: 104893 hits, 54167 misses, 2165 dirtied
 avg read rate: 3.548 MB/s, avg write rate: 0.142 MB/s
 system usage: CPU 0.28s/0.97u sec elapsed 119.26 secYes

Yep, that’s right, 26,585 tuples ‘dead but not yet removable’ but with no rows removed for 2 mins of auto-vacuuming… that’s a waste of time and resources!

We had also made the problem worse by tuning our analysis, causing multiple vacuums. Analyze uses a table to save results and when it is run too often, a lot of dead tuples are created leading to vacuuming on the analysis tables…

Finally the solution was found:

We used an ORM (Ruby on Rails) and some long operations (especially deletes) encapsulated in a transaction. We reduced the transactions and modified our way to deletion.
ORM uses a lot of updates if not correctly configured. And from a strict database point of view it is like a delete followed by an insert. So 10 updates leads to 10 dead tuples. Reducing this was also a huge improvement.

In summary, for us it was really difficult to find the perfect solution to our problems even though we researched and read the best blog posts and articles about vacuuming. The intervention from an expert like Vik can be useful and without it, we might have continued trying to increase our vacuuming! Hopefully you can find a solution yourselves or this post will help, but if not, don’t waste time and make things worse with such a critical element of your architecture… find an expert!

Your questions or remarks mean a lot to us, so feel free to comment :)

Vacuuming ‘dead tuples not yet removable’ in PostgreSQL and other tales.

Written by Mathieu Tamer