Solving Latency Spikes & Locking in a Distributed PostgreSQL Query

Scott Brisbane
SafetyCulture Engineering
6 min readApr 11, 2024

At SafetyCulture, we use the PostgreSQL Citus extension to enable horizontal scalability for some of our key relational databases. One of our favourite things about Citus is that it allows our databases to scale without adding too much complexity and burden on our engineering teams. Database tables are sharded in a largely transparent way, and teams can interact with the database in much the same way they would if it was running on a single database server.

Citus helps us solve big problems at scale

We recently discovered that a query being run by one of our services against a sharded database table was experiencing spikes in latency. The remainder of this article will detail our investigation into this query latency, the solutions we identified and some of the things we learned along the way.

Investigating the Cause of the Query Latency

Initial investigations into this latency issue showed us that the query was being held up on advisory locks within PostgreSQL. This was identified by looking at the PostgreSQL log entries relating to the query in pganalyze, a PostgreSQL observability tool that we use. A snippet of these logs can be seen below.

Feb 19 08:33:18am AEDT Z00 389574 service_name LOG: process 389574 acquired ShareUpdateExclusiveLock on advisory lock [16386,0,107123,5] after 1142.005 ms
Feb 19 08:33:18am AEDT Z00 365664 service_name LOG: process 365664 acquired ShareUpdateExclusiveLock on advisory lock [16386,0,107123,5] after 1133.057 ms
Feb 19 08:33:18am AEDT Z00 365664 service_name LOG: process 365664 acquired ShareUpdateExclusiveLock on advisory lock [16386,0,107123,5] after 1133.057 ms
Feb 19 08:33:18am AEDT L71 387470 service_name LOG: process 387470 still waiting for ShareUpdateExclusiveLock on advisory lock [16386,0,107123,5] after 1000.047 ms

The interesting thing to note about advisory locks in PostgreSQL is that they are not system imposed locks and are locks that an application or database user makes use of directly. We quickly established that the service and database in question were not using any advisory locks themselves directly and so attention turned to Citus and whether it could be responsible for the locking. After some investigation we discovered that this was indeed the case, but we didn’t yet fully understand why.

At this point it’s probably worth detailing the query and schema in question. Both are detailed below, with some details removed for simplicity.

UPDATE gateways 
SET last_seen_time = $1
WHERE source_name = $2
AND lower(source_id) = lower($3)
CREATE TABLE gateways
(
source_name varchar(255) NOT NULL,
source_id varchar(255) NOT NULL,
tenant_id uuid NOT NULL,
name varchar(255) NOT NULL DEFAULT '',
last_seen_time timestamp NOT NULL DEFAULT to_timestamp(0),

CONSTRAINT gateways_pkey PRIMARY KEY (source_name, source_id, tenant_id),
);

SELECT create_distributed_table('gateways', 'tenant_id', shard_count => '24');

The key things to note here are that the gateways table is a Citus distributed table, meaning that it is sharded across multiple Citus nodes, and that the distribution column (or sharding key) for the table is tenant_id. Also worth noting is that there is no value constraint on the distribution column being specified in the WHERE clause of the UPDATE query. Generally when working with distributed tables in Citus, it is best practice to query tables in such a way that a value is being specified for the distribution column in a WHERE clause. This allows the Citus query planner to route the query directly to the shard with the required data and means that it doesn’t have to send the query to every shard.

Armed with this knowledge, and the information about advisory locks being part of the issue here, we consulted the Citus Technical Readme section on locking. This allowed us to narrow in on the cause of the locking (outlined in the Distributed Execution Locks section) being that Citus needs to use advisory locks to avoid distributed deadlocks when running multi-shard operations that update data. Whilst we knew that the data being updated in this query only resides on a single shard, the Citus query planner couldn’t infer that based on the query we were sending it and so the query was considered a “multi-shard update”.

Working on Potential Solutions

With the cause of the locking and resulting query latency now understood, we started thinking about solutions. Naturally, the simplest solution would be to update the query to specify a value for tenant_id (the distribution column) in the WHERE clause. This would have looked like the following query:

UPDATE gateways 
SET last_seen_time = $1
WHERE tenant_id = $2
AND source_name = $3
AND lower(source_id) = lower($4)

Unfortunately we didn’t have easy access to the correct value for tenant_id in the code path that runs this query, so this wasn’t an option.

The next idea was to try a subquery to query the gateways table for the tenant_id before doing the UPDATE. This looked like the following query:

WITH gateway_to_update AS (
SELECT tenant_id
FROM gateways
WHERE source_name = :source_name
AND LOWER(source_id) = LOWER(:source_id)
)
UPDATE gateways
SET last_seen_time = :last_seen_time
FROM gateway_to_update
WHERE gateways.source_name = :source_name
AND LOWER(gateways.source_id) = LOWER(:source_id)
AND gateways.tenant_id = gateway_to_update.tenant_id;

Unfortunately this solution didn’t resolve the problem and we still saw the same locking and query latency. We ran an EXPLAIN VERBOSE on this query to understand what was happening and quickly realised that this wasn’t working because the Citus query planner was running the subquery (SELECT) and main query (UPDATE) together as a single unit on each shard. This meant that the queries being run on the shards still constituted "multi-shard updates" and required the use of advisory locks to avoid potential distributed deadlocks. The result of the EXPLAIN VERBOSE is shown below, with some details redacted.

Custom Scan (Citus Adaptive)  (cost=0.00..0.00 rows=0 width=0)
Task Count: 48
Tasks Shown: One of 48
-> Task
Query: WITH gateway_to_update AS (SELECT gateways_1.tenant_id FROM gateways_116096 gateways_1 WHERE (((gateways_1.source_name COLLATE "default") OPERATOR(pg_catalog.=) 'monnit'::text) AND (lower((gateways_1.source_id COLLATE "default")) OPERATOR(pg_catalog.=) lower('123456'::text)))) UPDATE gateways_116096 gateways SET last_seen_time = '2024-02-19 00:17:30'::timestamp without time zone FROM gateway_to_update WHERE (((gateways.source_name COLLATE "default") OPERATOR(pg_catalog.=) 'monnit'::text) AND (lower((gateways.source_id COLLATE "default")) OPERATOR(pg_catalog.=) lower('123456'::text)) AND (gateways.tenant_id OPERATOR(pg_catalog.=) gateway_to_update.tenant_id))
Node: host=<citus-worker-hostname> port=5432 dbname=<service_a_database>
-> Update on gateways_116096 gateways (cost=0.28..4.34 rows=0 width=0)
-> Nested Loop (cost=0.28..4.34 rows=1 width=20)
Output: '2024-02-19 00:17:30'::timestamp without time zone, gateways.ctid, gateways_1.ctid
Join Filter: (gateways.tenant_id = gateways_1.tenant_id)
-> Index Scan using gateways_pkey_116096 on gateways_116096 gateways (cost=0.14..2.16 rows=1 width=22)
Output: gateways.ctid, gateways.tenant_id
Index Cond: ((gateways.source_name)::text = 'monnit'::text)
Filter: (lower((gateways.source_id)::text) = '123456'::text)
-> Index Scan using gateways_pkey_116096 on gateways_116096 gateways_1 (cost=0.14..2.16 rows=1 width=22)
Output: gateways_1.ctid, gateways_1.tenant_id
Index Cond: ((gateways_1.source_name)::text = 'monnit'::text)
Filter: (lower((gateways_1.source_id)::text) = '123456'::text)
Query Identifier: 3414107486434098674
Query Identifier: -6411422672665295139

In the end, the solution we landed on was to run 2 separate queries against the table with the first to obtain the tenant_id and the second to perform the UPDATE itself. On a normal PostgreSQL database this would generally be suboptimal and counter-intuitive, but in this case made a lot more sense so that we could avoid running any queries that constitute a “multi-shard update”.

SELECT tenant_id FROM gateways
WHERE source_name = $1
AND LOWER(source_id) = LOWER($2)
UPDATE gateways 
SET last_seen_time = $1
WHERE gateways.source_name = $2
AND LOWER(gateways.source_id) = LOWER($3)
AND gateways.tenant_id = $4

After making this change, we were able to get the average query time down from around 48ms to a total of 12ms across the two queries. More importantly, the locking issue was gone and query latency spikes were significantly reduced. The results can be seen in the below p99 latency graph for the relevant service endpoint. This isn’t the end of the journey though, and represents a short-term solution to the query performance issue we were facing.

--

--