PostgreSQL MultiXactId Error in Vacuum

PG::InternalError: ERROR: MultiXactId 331682782 does no longer exist

Sometimes, vacuuming is hard!

For the past couple of weeks, I’ve been working to improve the performance of our Greenhouse Recruiting product as our average customer size continues to increase. What worked for the product when a customer was a couple hundred users doesn’t necessarily scale to a customer size of tens of thousands.

One day we were notified of a database CPU usage spike. This is where it all started…

What was the problem?

An investigation into the CPU usage spike led me to uncovering a different issue. I was checking the current running processes in our database and one line of output caught my eye…

2018-06-02 07:59:29| autovacuum: VACUUM public.notes (to prevent wraparound)

So, what is a wraparound? From PostgreSQL’s official document:

PostgreSQL’s MVCC transaction semantics depend on being able to compare transaction ID (XID) numbers: a row version with an insertion XID greater than the current transaction’s XID is “in the future” and should not be visible to the current transaction. But since transaction IDs have limited size (32 bits) a cluster that runs for a long time (more than 4 billion transactions) would suffer transaction ID wraparound: the XID counter wraps around to zero, and all of a sudden transactions that were in the past appear to be in the future — which means their output become invisible. In short, catastrophic data loss. (Actually the data is still there, but that’s cold comfort if you cannot get at it.) To avoid this, it is necessary to vacuum every table in every database at least once every two billion transactions.

I drew a picture to better explain this problem. Note that in order to simplify the problem, the maximum value for transaction id was 100 instead of 4 billion.

As you can see from the picture, there were 100 rows in a table that had insertion transaction id 1 to 100. The first row was assigned transaction id 1, the second row was assigned id 2, the 100th row was assigned id 100 and then it reached the limit. So, how do we assign the transaction id for the 101st record? PostgreSQL will have to wrap around and assign the transaction an id of 1.

So, what is wrong with that? In PostgreSQL, rows with older transaction ids are visible to all transactions in the future, but rows with newer transaction ids are not visible to transactions in the past.

In our example, rows inserted by transaction 2 should be visible to transaction 100, but rows inserted by transaction 100 should not be visible to transaction 2. When the 101st transaction got wrapped around and assigned id 1, all of a sudden, transactions that were in the past (transactions with id 2 to 100) appear to be in the future. This meant that all the data inserted from the past was invisible in the database. It existed, but was unable to be found. Basically, we were in a state of data loss.

To prevent this problem from happening, PostgreSQL has a mechanism called vacuum. It works as a garbage collector of sorts. One of the things vacumming does is recycling transaction ids — mark old rows as frozen with a special transaction id. This indicates that they are visible to all current and future transactions. I drew another picture to explain this:

In this example, with the 101st transaction wrapping around, rows with transaction id 2 to 100 are old. The vacuum will mark them as frozen and assign them a special id(-2). So, now the 101st transaction can see all past rows. Problem solved!

But why was our database trying to prevent the wraparound from happening on the notes table? After some investigation, it turned out that there was an error which stopped both auto-vacuum and manual vacuum on the notes table. By the time I discovered this on June 6th, it had already been in this state for 2 weeks!

How did I find it?

From Sumologic (the service we use for log collection), I found this error message:

“error_message”:”PG::InternalError: ERROR: MultiXactId 331682782 does no longer exist — apparent wraparound\n: VACUUM ANALYZE public.notes”

Periodic::DatabaseVacuumAnalyzerWorker is a background worker that runs daily to manually vacuum our database tables. This error message started showing up on May 22nd.

I found more evidence in our Sidekiq admin panel. There were 10 background jobs in the retry queue with the same error message below:

Lastly, checking the pg_stat_user_tables table, I found the last vacuum for the notes table happened on May 21st.

greenhouse=>  SELECT relname, last_vacuum FROM pg_stat_user_tables where relname = 'notes';
relname | last_vacuum 
notes   | 2018-05-21 06:58:22.774598+00 
(1 row)

How bad was this situation?

From this AWS Database Blog post:

When the number of unvacummed transactions reaches around 2.09 billion, the database will warn us.
When the number reaches around 2.099 billion, PostgreSQL will set the database to read-only mode and requires an offline, single-user, standalone vacuum. This vacuum requires multiple hours or days of downtime (depending on database size).

Apparently, it could set the database to read-only mode and lead to days of downtime!

How much time did we have to fix the problem?

We can answer this question using the MaximumUsedTransactionIDs metric, which is generated by Amazon RDS agent with this database query:

SELECT max(age(datfrozenxid)) FROM pg_database;

relfrozenxid is a transaction id of a cutoff transaction in a relation or a table. Within this table, all transactions that are older than the relfrozenxid have been marked as frozen.

datfrozenxid (database frozen transaction id) is the minimum of the per-table relfrozenxid (relation frozen transaction id) values within the database.

This is a bit confusing, isn’t it? Let me try to explain…

To prevent the wraparound problem, the database marks old rows as frozen. But in reality, it will not mark all the old rows — rather only rows with transactions that are “old” enough. Continuing through our example, lets assume that transactions that have at least 50+ transactions after it are old transactions.

With this assumption, a row inserted by transaction 1 has 99 transactions after it — so it will be marked as frozen. Same for rows with transaction id 2 to 49. As for row with transaction id 50, it only has 50 transactions following it, so it is not old enough. It’s the oldest transaction to remain unfrozen. This is what the relfrozenxid is set to.

After understanding what relfrozenxid is, we still need to understand what age is to fully comprehend the MaximumUsedTransactionIDs metric. The age of a transaction id is the number of transactions that happened after the transaction. So, the age of datfrozenxid tells us the number of transactions that have not been vacuumed after the oldest transaction for the entire database.

With the metric, we can calculate how long we have until we hit the wraparound. The rate of increase for MaximumUsedTransactionIDs was roughly 7.5 million per day. At this rate, we had ~300 days until the world ended — assuming that growth rate didn’t accelerate any further.

As you can see in the chart, the maximum_used_transaction_ids started increasing on May 22nd, and continued to do so until we finally fixed the problem on June 8th.

How did we fix it?

Our awesome infrastructure team set me up with a test database to play around with. The first command I ran failed unsurprisingly:

greenhouse=> vacuum verbose notes;
INFO:  vacuuming "public.notes"
ERROR:  MultiXactId 338212822 does no longer exist -- apparent wraparound

So I had to find the note object with a MultiXactId of 338212822 and delete it. Here are the queries:

greenhouse=> select xmax, id from notes where xmax = 338212822 limit 1;
xmax    |    id
338212822 | 151215184
(1 row)

greenhouse=> delete from notes where id = '151215184';

Now, the vacuum should work, right? I tried, but it failed again with the same message and a different multiXactId. Apparently, there were more objects in the notes table with the same issue. So how could we find all rows with this problem?

I noticed that the normal rows all have an xmax value of either 0 or a 10 digit integer, and the problem rows had an id like 338212822 (9 digit). Here is the query:

greenhouse=> select id, xmax from notes where char_length(xmax::text) < 10 and char_length(xmax::text) > 8;

This query found all records with values in the xmax column that were exactly 9 digits long. After getting the note ids from the query result, the remaining work was pretty simple.

greenhouse=> update applications set rejection_note_id = null where rejection_note_id in (ids);
greenhouse=> delete from notes where id in (ids);

After cleaning up the notes in a bad state, I ran the vacuum… and… it worked!

greenhouse=> vacuum verbose notes;

How did we know it worked for sure?

Once the whole vacuum process finished, the last_vacuum field was updated to the latest date.

greenhouse=> SELECT last_vacuum FROM pg_stat_user_tables where relname = 'notes';
2018-06-06 01:21:34.183632+00

And based on the query below, the age of relfrozenxid of notes table was not at the top 20 anymore.

As a reminder, the age of relfrozenxid shows the number of transactions between the cut off transaction (since last whole-table vacuum) and the current transaction on this table. The larger the number is, the longer it has been since last time the table got vacuumed.

greenhouse-> SELECT relname, age(relfrozenxid) as xid_age,
greenhouse->     pg_size_pretty(pg_table_size(oid)) as table_size
greenhouse-> FROM pg_class
greenhouse-> WHERE relkind = 'r' and pg_table_size(oid) > 1073741824
greenhouse-> ORDER BY age(relfrozenxid) DESC LIMIT 20;
relname             |  xid_age  | table_size
resumator_applicants            | 170164449 | 1475 MB
resumator_messages              | 170164449 | 4557 MB
migration_candidate_attachments | 149870439 | 2204 MB
migration_phone_numbers         | 149867987 | 1643 MB
migration_email_addresses       | 149867987 | 1834 MB

So it actually worked!

The final steps to execute this in production were pretty straightforward. I ran a script to clean up erroneous notes rows and kicked off the vacuum again. While it took longer than a day to complete, the result was quite satisfying.

greenhouse=> SELECT relname, last_vacuum, last_autovacuum, last_analyze FROM pg_stat_user_tables where last_vacuum < timestamp '2018-06-06 00:00:00' and last_autovacuum < '2018-06-05 00:00:00';
relname | last_vacuum | last_autovacuum | last_analyze
(0 rows)

All tables have been vacuumed successfully and the world was safe for another day!