When AWS Redshift hangs: a story of locked queries and undead backends
How we detected deadlocks, terminated backends, and learned a lesson
Recently we started using Amazon Redshift as a source of truth for our data analyses and Quicksight dashboards. The setup we have in place is very straightforward:
- We perform minimal ETL with a node script, and the result ends up in Redshift.
- Quicksight dashboards are placed on top of this stack to create handy visualizations.
After a few months of smooth execution, suddenly we noticed our ETL script hanging for apparently no reason. Here is how we figured out what was going on.
Step by step
According to Amazon Redshift documentation, there are various causes why a query can be hanging. We ended up ruling out all the options except from the last: there is a potential deadlock. We started following the advice listed in the documentation:
- Looking in the
STV_LOCKS
andSTL_TR_CONFLICT
tables brought us little insight, as both turned out to be empty. - We queried the table that records recent running queries, to figure out which ones — if any — were stuck (see simplified example below). Indeed, turns out quite a few queries had been waiting for a long time. However, the suggested
pg_cancel_backend(PID)
did nothing to improve our situation.
select pid, trim(starttime) as start,
duration, trim(user_name) as user,
substring (query,1,40) as querytxt
from stv_recents
where status = 'Running';
pid | start | duration | user | querytxt
----+------------+------------+---------+--------------------------
511 | 2017-07-04 | 2656430457 | user1 | select id from "foobar"
512 | 2017-07-04 | 2192060486 | user1 | select id from "quxbar"
- We had a look at the existing transactions and to which locks they are holding, and then went on to execute
pg_terminate_backend(PID)
Indeed, this time looks like it’s working! We see the running queries for the terminated backends disappear, and it seems that the situation is unlocking.
select txn_owner,txn_db,pid,lock_mode,granted from svv_transactions where pid <> pg_backend_pid();txn_owner | txn_db | pid | lock_mode | granted
----------+--------+-----+-----------------+---------
user1 | dbname | 511 | AccessShareLock | t
user1 | dbname | 511 | ExclusiveLock | t
user1 | dbname | 512 | AccessShareLock | t
user1 | dbname | 512 | ExclusiveLock | tselect pg_terminate_backend(511);pg_cancel_backend
-------------------
1
Raising the undead
Our joy was short lived: despite all our efforts to terminate it, one of the backends didn’t seem to want to go away. This seems to be a not-so-uncommon issue, but none of the ongoing discussions helped us figuring out what to do next.
To sum it up, this was our situation:
- The table
svv_transactions
showed that process with PID 512 was still holding a lock. - The table
stv_recents
showed that there were no running queries - We still didn’t know which, between our running services, was holding the lock on what specific resource
- At that moment, we could’t terminate 512 from the command line
One problem at the time.
Who was holding the lock? After a bit of digging, we found this helpful tutorial through which we can finally get some understandable information on what resources are currently locked. It goes digging into the table containing which locks are granted and what processes are waiting for locks to be acquired (see Postgres Wiki for more info)
select
current_time,
c.relname,
l.database,
l.transaction,
l.pid,
a.usename,
l.mode,
l.granted
from pg_locks l
join pg_catalog.pg_class c ON c.oid = l.relation
join pg_catalog.pg_stat_activity a ON a.procpid = l.pid
where l.pid <> pg_backend_pid();(simplified output)... | relname | ... | pid | usename | ...
-----+---------+-----+-----+-----------+-----
... | foo | ... | 512 | user1 | ...
... | bar | ... | 512 | user1 | ...
This tells us that the infamous 512 was holding the lock for tables foo
and bar
. Looking up through the Redshift console, in the “Queries” tab of out cluster overview, we could see that the only query involving foo
and bar
came from Amazon Quicksight. We resolved to investigate this at a later stage, but it was already good information! Now we needed to get to work to free these resources.
Undead queries and how to (not) manage them: this article has a very nice overview on undead queries, how they come to be and how to fix them on the fly. Luckily for us, it also provided an easy solution: restart the cluster. Our very limited use of Redshift today allowed us to do so without sacrificing functionality, and indeed it solved all our problems (for now).
Obviously we are still looking into what really caused the deadlock in the first place, but in the meantime we brought Redshift back to a usable state.
Lesson learned
Aside from figuring out where to look to find out what was going on, the hardest thing was to find the source of the deadlock. This would have been a lot easier if all different sources of access to Redshift had used a dedicated user. Instead, we configured one generic user with read access to the tables we use for data analysis, and that made it almost impossible to understand which was the source of the hanging request.
We are just now getting the hang of this, and any tips or solutions to similar stories are always welcome!
Happy coding!
Want to learn more about coding? Have a look to our other articles.
Photo by: Andrew Martin