oVice
Published in

oVice

Postgres in anger

Investigating high DB CPU usage

Database CPU utilization
Database CPU usage
  • Refresh requests for calendar status updates so that Google Calendar keeps pushing them to our API.
  • Schedule in-app status changes for users to run the job that will update the user’s badge/profile description at the specified time.
  • Have the OAuth exchanges happen in the background in order to spare the user from the loading time.
  • Oban is more actively maintained, way more popular and prominent enough to frequently be part of Elixir conferences. This makes finding resources and asking for assistance much easier.
  • It required zero extra system dependencies since we were already using PostgreSQL.
  • Guarantees benefits of using an RDBMS for queueing jobs. The jobs state is managed by transactions to ensure better consistency, and is periodically backed up/replicated to avoid any data loss.
  • The Web package provides a web application for real-time monitoring jobs and actively controlling queues and nodes.
  • Tweak the Oban settings to avoid running the purge query (Oban.Plugins.Pruner) for completed/failed jobs every 30 seconds and optimize the oban_jobs indexes by rebuilding them once a day. (Oban.Plugins.Reindexer)
  • Convert immediate Oban jobs to scheduled ones to reduce the number of database triggers happening at the same time. These triggers made all our nodes query the database for the next jobs to execute, which is why we suspected that this process was overloading the database.
  • Keep only one job runner node to reduce the database traffic to a minimum.

What was causing this?

SELECT slot_name, plugin, slot_type, active FROM pg_replication_slots;---------fivetran_pgoutput_slot          | pgoutput         | logical   | f |
ovice_fivetran_td_lc_slot | test_decoding | logical | f |
ovice_fivetran_pgoutput_lc_slot | pgoutput | logical | f |
Increasing WAL size graph
Increasing WAL disk usage

Remediation

select pg_drop_replication_slot('fivetran_pgoutput_slot');
select pg_drop_replication_slot('ovice_fivetran_td_lc_slot');
Dropping WAL size
Dropping WAL disk usage
Dropping database CPU usage
Dropping database CPU usage

Deep dive into the high CPU usage

  • How can an increasing WAL size affect CPU utilization?
  • What does it have to do with Oban?

Growing WAL impact on performance

SELECT now() - pg_stat_activity.query_start as duration, query 
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration desc;
---------duration | query
-----------------+--------------------------------------------------
00:00:03.448002 | UPDATE \"public\".\"oban_jobs\" AS o0 SET
| \"state\" = $1, "completed_at\" = $2 WHERE
| (o0.\"id\" = $3)
EXPLAIN ANALYZE SELECT * FROM oban_jobs WHERE id = 1;--------- Index Scan using oban_jobs_pkey on oban_jobs  (cost=0.023..0.130 rows=1 width=329) (actual time=0.011..0.011 rows=0 loops=1)
Index Cond: (id = 1)
Planning Time: 1230.494 ms
Execution Time: 0.139 ms
(4 rows)
EXPLAIN ANALYZE SELECT * FROM pg_class WHERE relname = 'oban_jobs';--------- Index Scan using pg_class_relname_nsp_index on pg_class  (cost=0.27..8.29 rows=1
width=265) (actual time=0.009..0.009 rows=1 loops=1)
Index Cond: (relname = 'oban_jobs'::name)
Planning Time: 0.201 ms
Execution Time: 0.048 ms
(4 rows)
EXPLAIN (ANALYZE, BUFFERS) SELECT starelid from pg_statistic WHERE starelid = 24631;--------- Index Only Scan using pg_statistic_relid_att_inh_index on pg_statistic
(cost=0.40..80.43 rows=2 width=4) (actual time=35.450..786.859 rows=18 loops=1)
Index Cond: (starelid = '24631'::oid)
Heap Fetches: 1115035
Buffers: shared hit=3001
Planning:
Buffers: shared hit=7
Planning Time: 0.076 ms
Execution Time: 786.903 ms
(8 rows)
SELECT relname, last_autovacuum, n_dead_tup FROM pg_stat_all_tables WHERE relname = 'pg_statistic';---------   relname    |          last_autovacuum          |n_dead_tup
--------------+-----------------------------------+-----------------
pg_statistic | 2022-07-22 18:24:54.577448+02 | 1116077
(1 row)

Causal link between the issue and Oban

2022-07-21 11:10:13 UTC::@:[1108]:LOG:  automatic analyze of table "ovice.public.oban_jobs" system usage: CPU: user: 1.60 s, system: 0.00 s, elapsed: 10.56 s
2022-07-21 11:10:37 UTC::@:[2399]:LOG: automatic analyze of table "ovice.public.oban_jobs" system usage: CPU: user: 1.71 s, system: 0.01 s, elapsed: 12.28 s
2022-07-21 11:11:00 UTC::@:[3006]:LOG: automatic analyze of table "ovice.public.oban_jobs" system usage: CPU: user: 1.57 s, system: 0.01 s, elapsed: 10.60 s
2022-07-21 11:13:50 UTC::@:[10604]:LOG: automatic analyze of table "ovice.public.oban_jobs" system usage: CPU: user: 1.65 s, system: 0.01 s, elapsed: 14.86 s
2022-07-21 11:14:49 UTC::@:[13883]:LOG: automatic analyze of table "ovice.public.oban_jobs" system usage: CPU: user: 1.67 s, system: 0.01 s, elapsed: 10.00 s
2022-07-21 11:23:25 UTC::@:[4770]:LOG: automatic analyze of table "ovice.public.oban_jobs" system usage: CPU: user: 1.73 s, system: 0.00 s, elapsed: 11.45 s

Follow-up steps

--

--

Create your own virtual space for your team, community, and organization. Host events and work together in one customizable world with oVice.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store