Postgres in anger
Investigating high DB CPU usage
We are also big fans of Fred Hebert’s book Erlang in Anger: https://www.erlang-in-anger.com/
We recently released a new feature in oVice that integrates Calendar (Google and Outlook) events with the user profile status. Ever since it became publicly available, our team noticed a gradual increase in PostgreSQL CPU usage that, after a couple of days, reached critical levels.
The usage fluctuated depending on load but it was always above 60%. The significant dips in the graph were our attempts at potential fixes but they would only help temporarily. We’ll go through the attempts that had the most impact later on.
At oVice we are rewriting major parts of our system in Elixir for better scalability and implementing all new features with it. So far we didn’t have use cases for running jobs asynchronously in the background. But with the introduction of the Calendar feature we needed to meet the following requirements:
- Refresh requests for calendar status updates so that Google Calendar keeps pushing them to our API.
- Schedule in-app status changes for users to run the job that will update the user’s badge/profile description at the specified time.
- Have the OAuth exchanges happen in the background in order to spare the user from the loading time.
Elixir processes are cheap and provide an accessible interface to off-load tasks that will run in the background, but using
spawn/1 for these relatively complex requirements wouldn’t cut it. We needed a more robust scheduling library that allows for monitoring job state and handling retries.
We investigated different tools provided by the Elixir community for reliably implementing background jobs such as Exq and Verk but ultimately decided to integrate Oban into the codebase for multiple reasons:
- Oban is more actively maintained, way more popular and prominent enough to frequently be part of Elixir conferences. This makes finding resources and asking for assistance much easier.
- It required zero extra system dependencies since we were already using PostgreSQL.
- Guarantees benefits of using an RDBMS for queueing jobs. The jobs state is managed by transactions to ensure better consistency, and is periodically backed up/replicated to avoid any data loss.
- The Web package provides a web application for real-time monitoring jobs and actively controlling queues and nodes.
After a few weeks of using Oban, we averaged a few hundred jobs per minute and had around 2 thousand stored in the database at any given time, which is way below what Oban or Postgres are capable of handling.
The few database operations that these jobs performed were minimal, well optimized, and had little to no bearing on this CPU usage issue.
Initially, our attempts to solve this problem were as follows:
- Tweak the Oban settings to avoid running the purge query (Oban.Plugins.Pruner) for completed/failed jobs every 30 seconds and optimize the
oban_jobsindexes by rebuilding them once a day. (Oban.Plugins.Reindexer)
- Convert immediate Oban jobs to scheduled ones to reduce the number of database triggers happening at the same time. These triggers made all our nodes query the database for the next jobs to execute, which is why we suspected that this process was overloading the database.
- Keep only one job runner node to reduce the database traffic to a minimum.
The third approach was the most successful one, judging by the significant decrease in CPU utilization in the above graph, although it only lasted a day before rising again.
What was causing this?
While we were working on the new calendar functionality, there was a separate, parallel project to integrate Fivetran with our Postgres to periodically extract data for analysis purposes. The way that Fivetran’s Postgres RDS connector works is that it reads from the Postgres write-ahead log (WAL).
The WAL is an append-only file where Postgres stores any updates made to database records. Its primary purpose is for crash recovery because it is guaranteed to be written to before changes to data files are. While this might seem redundant to have to write to two separate parts of the disk(s), the WAL is more robust and easier to flush to disk because it is sequential whereas changes to data files might require multiple pages to be written to disk and there is greater chance the system could crash while doing so. So for crash recovery, Postgres only needs to identify what the current state of the data pages are and replay the WAL to reach a consistent state.
The properties of the WAL also make it convenient for replication via log shipping. If a replica has a snapshot of the database and the WAL segments from the point of the snapshot to the current time, the replica can apply those WAL events to catch up to the primary’s current state. For consistency across all copies of the database, the primary has to keep WAL segments until it knows they have been sent to every replica. A Postgres feature for keeping track of this is replication slots.
So Fivetran attaches to Postgres as if it were any other replica and creates a replication slot. It then subscribes periodically (the interval depends on how it was configured) to that replication slot to pull the latest data changes, which is also when the WAL segments can safely be deleted.
When selecting these replication slots in our master, we found:
SELECT slot_name, plugin, slot_type, active FROM pg_replication_slots;---------fivetran_pgoutput_slot | pgoutput | logical | f |
ovice_fivetran_td_lc_slot | test_decoding | logical | f |
ovice_fivetran_pgoutput_lc_slot | pgoutput | logical | f |
The first two slots were not in use anymore, one was created to test the Fivetran connector and the other was the result of a deleted connection. The consequence of not cleaning up these slots is that Postgres was keeping WAL segments forever, expecting Fivetran to initiate replication through them. But that never happened since the connections were already deleted from Fivetran’s interface.
The most relevant AWS metric that confirmed this is at “Transaction logs disk utilization (MB)” where we can see an infinitely expanding WAL size:
Dropping the unneeded replication slots caused PostgreSQL to instantly free the log space and recover from the high CPU usage in minutes, which is presently stable at around 10% at the time of writing.
Deep dive into the high CPU usage
Even though we solved our immediate problem, our team likes to spend additional time to deeply understand the root cause of an issue. We believe this extra work pays off through a higher overall system understanding and aiding in future debugging.
The two conundrums we had to solve in order to find the underlying cause were:
- How can an increasing WAL size affect CPU utilization?
- What does it have to do with Oban?
If it wasn’t for the high CPU utilization, we probably wouldn’t have realized what was going on until disk space ran out or some other indirect issue arose.
Growing WAL impact on performance
Typically, when we see a database performance issue, we start by investigating any long-running queries that might be the source of the problem:
SELECT now() - pg_stat_activity.query_start as duration, query
WHERE state = 'active'
ORDER BY duration desc;---------duration | query
00:00:03.448002 | UPDATE \"public\".\"oban_jobs\" AS o0 SET
| \"state\" = $1, "completed_at\" = $2 WHERE
| (o0.\"id\" = $3)
pg_stat_statements table revealed that the queries that were taking too long to complete were extremely trivial, such as updating by primary keys. The only common aspect was that they all involved the
The next logical step was to analyze one of the queries on the
EXPLAIN ANALYZE SELECT * FROM oban_jobs WHERE id = 1;--------- Index Scan using oban_jobs_pkey on oban_jobs (cost=0.023..0.130 rows=1 width=329) (actual time=0.011..0.011 rows=0 loops=1)
Index Cond: (id = 1)
Planning Time: 1230.494 ms
Execution Time: 0.139 ms
The planning time corresponds to the amount of time it took for the query planner/optimizer to devise the query’s execution plan.
A simple select query by primary key took more than 1-second planning time, which is insane; the same issue occurred with all other queries on the
Now that we know it’s a query planner issue, let’s go through what it internally does when building an execution plan, as specified by the PostgreSQL documentation:
Quoting the documentation, the first task that it does is:
The query planner needs to estimate the number of rows retrieved by a query in order to make good choices of query plans.
This information is found in the
pg_class system catalog so we tried to manually query it:
EXPLAIN ANALYZE SELECT * FROM pg_class WHERE relname = 'oban_jobs';--------- Index Scan using pg_class_relname_nsp_index on pg_class (cost=0.27..8.29 rows=1
width=265) (actual time=0.009..0.009 rows=1 loops=1)
Index Cond: (relname = 'oban_jobs'::name)
Planning Time: 0.201 ms
Execution Time: 0.048 ms
The query was extremely fast and was undoubtedly not causing problems.
When a query has a
WHERE clause, only a subset of rows are usually fetched. So in order to build an efficient query plan, the planner estimates the selectivity of the
WHERE clauses (most common values for each row, distinct counts/rates, etc). This information is kept in the
pg_statistic system catalog:
EXPLAIN (ANALYZE, BUFFERS) SELECT starelid from pg_statistic WHERE starelid = 24631;--------- Index Only Scan using pg_statistic_relid_att_inh_index on pg_statistic
(cost=0.40..80.43 rows=2 width=4) (actual time=35.450..786.859 rows=18 loops=1)
Index Cond: (starelid = '24631'::oid)
Heap Fetches: 1115035
Buffers: shared hit=3001
Buffers: shared hit=7
Planning Time: 0.076 ms
Execution Time: 786.903 ms
starelid=24631 is the object id of the
oban_jobs table, it references the
pg_class.oid value and it’s a dynamic identifier for each table/index different per database.
An index only scan query with 786 ms execution time that only gets 18 rows may sound absurd. However with more than 1 million heap fetches occurring, this is well within the realm of reasonable.
Heap fetches are the number of rows that the database had to look up in the table, instead of the index to check their visibility (whether they were deleted or not).
In other words this means that we have over a million pg_statistic deleted tuples that never got vacuumed as confirmed by this query:
SELECT relname, last_autovacuum, n_dead_tup FROM pg_stat_all_tables WHERE relname = 'pg_statistic';--------- relname | last_autovacuum |n_dead_tup
pg_statistic | 2022-07-22 18:24:54.577448+02 | 1116077
Even though PostgreSQL does routine vacuuming to never end up in this situation, something was preventing the vacuuming jobs from deleting these tuples. According to a Stackoverflow answer, there are 3 reasons why dead tuples wouldn’t get removed when doing a vacuum, one of which is having unused replication slots in the database.
Causal link between the issue and Oban
Technically there is none, Oban just catalyzed the issue by doing a fair amount of database operations on the
oban_jobs table which made PostgreSQL recalculate statistics for it every now and then, thus deleting the old tuples from pg_statistic and recreating new ones.
Here is a logs sample that shows analyze operations on the
oban_jobs table. Early in our investigation we thought these were normal database operations, but it turned out these were growing the
2022-07-21 11:10:13 UTC::@::LOG: automatic analyze of table "ovice.public.oban_jobs" system usage: CPU: user: 1.60 s, system: 0.00 s, elapsed: 10.56 s
2022-07-21 11:10:37 UTC::@::LOG: automatic analyze of table "ovice.public.oban_jobs" system usage: CPU: user: 1.71 s, system: 0.01 s, elapsed: 12.28 s
2022-07-21 11:11:00 UTC::@::LOG: automatic analyze of table "ovice.public.oban_jobs" system usage: CPU: user: 1.57 s, system: 0.01 s, elapsed: 10.60 s
2022-07-21 11:13:50 UTC::@::LOG: automatic analyze of table "ovice.public.oban_jobs" system usage: CPU: user: 1.65 s, system: 0.01 s, elapsed: 14.86 s
2022-07-21 11:14:49 UTC::@::LOG: automatic analyze of table "ovice.public.oban_jobs" system usage: CPU: user: 1.67 s, system: 0.01 s, elapsed: 10.00 s
2022-07-21 11:23:25 UTC::@::LOG: automatic analyze of table "ovice.public.oban_jobs" system usage: CPU: user: 1.73 s, system: 0.00 s, elapsed: 11.45 s
Problems can arise even if you check multiple times that all replication slots are working as expected at specified times. It’s always possible for a connection to randomly die for a multitude of reasons (paused read replica, changing indirect configurations, outages, etc) and never recover without notice. As a safety net, it’s a good idea to monitor the amount of disk space occupied by transactions and schedule an alerting mechanism in case any abnormalities are found.