Understanding Redshift Workload Management

One of the cool things about Redshift is that it’s kinda like PostgreSQL, but can magically run huge analytics queries really fast. However, under the hood it’s very different. One of the more confusing differences is how queries are scheduled.
In Postgres, each connection gets some server resources when it starts and has exclusive use of those until the connection ends. Redshift operates a queuing model: having a connection means you can ask Redshift to execute your query, but it might have to wait for other queries to finish up first. This can mean that even very simple queries can appear to be extremely slow, because they’re actually waiting for some super-slow analytics query to finish!
It’s easy not to notice the queue when you first start using Redshift. The default configuration allows for 5 concurrent queries, which is plenty for smaller teams. However, as you grow your team and product, there are more and more queries being run and queries take longer to run. Luckily, Workload Management (WLM) is a great feature to manage this.
Workload Management lets you define multiple queues and route queries based on the group(s) a user is in and a query_group set on the connection (which makes it possible for a user to opt a single query into a different queue).
The downside of WLM is that each concurrent query “slot” has to be allocated memory all the time, so you want to avoid having a queue that rarely runs queries, because that’s just wasting cluster resources!
There are two compulsory queues:
- the
superuserqueue, which doesn’t show up in your config — this is there so superusers can run queries in an emergency - the
defaultqueue, which is the last entry in your config — this can’t have any conditions, since any query that doesn’t get routed to the other queues is routed here
You can then add additional special queues. We use extra queues to separate our four main workloads:
- generating reports for end-users — these need to run asap and not get blocked by other queries
- importing data — e.g. from our production databases, analytics services, etc.
- tools like Tableau
- ad-hoc queries (mostly by data scientists)
When looking at why Redshift was sometimes “slow” to run ad-hoc queries, we found it was mostly caused when Tableau was refreshing a big dashboard, or new analytics data was being loaded. By moving that to a separate queue, it isolates any queuing there to only affect other import queries.
There’s one more cool WLM feature to talk about: hopping. Among the options for each queue is an execution timeout. This is ace for making sure that a queue keeps moving — anyone trying to hog a spot in the queue for too long just gets kicked out! However, data scientists do sometimes need to run slow queries. Hopping lets you get the best of both worlds: when a query times out on one queue, WLM retries it on the next possible queue (except the default queue).
So the ad-hoc queue is actually split into:
fast_adhocwith a timeout of 10 secondsslow_adhocwith a timeout of 1 hour
This means that most queries run on the fast queue, but any slow ones get moved out of the way onto the slow queue by magic ✨

