AWS Redshift: Troubleshooting made simple

Published in

Doctolib

10 min readJan 2, 2020

Welcome! Here you will discover how to make Redshift great again by identifying the exact problem you have, and the adequate solution for it.

At the beginning, everything was so simple. You were alone on your beautiful new cluster. He was new, big, and so fast. But quickly, more and more people are working on it. Just like invaders on a new promise land, they plunder the wealth of your cluster without thinking about consequences.

Only one person is responsible for it: You! After a while, you encounter some problems that are penalizing:

Disk full error
Your query is not running
Your query is slow

We will see together how to tackle those problems. This article also discuss the basics of a performant Redshift cluster, where you can improve a query by an average x20-x30 time, especially Distkeys and Sortkeys parts. Feel free to use it for troubleshooting, but also for performance tuning.

Disk Full

Everything was fine. It’s a sunny day. Birds are singing. And you see it …

error: Disk full

It can’t be possible. Not during your watch. You run toward your admin interface (AWS Portal > Redshift > Clusters > Your cluster > Cluster performance > Percentage disk) … And you see it: The Great Wave off Kanagawa

This is the apocalypse. When disk hits more than 100%, every query is killed (not only yours). So, you can expect to have every user coming to you to know what did happen. Fun fact: the user often thinks he is the source of the problem and keeps it to himself, ashamed.

What did happen? There was not enough memory space, so the query used the disk to store the temporary needed data. This process is called “SWAP”. But again, it was not enough. All the disk was used, causing a massive crash. We will see how to avoid this through volume management.

To identify the tables that need your attention the most:

SELECT “table”, size, tbl_rows
FROM svv_table_info
ORDER BY size DESC;

Architecture for volume management

Use Spectrum to externalize historical data. Ask your teams what data they need on a regular basis, and what data are considered “old”. Unload as Parquet the “old” data into S3. Use Glue crawler to have the structure. Remove the Data from Redshift.
Good practice: 66% of disk used by data when there is no Swap. If you are above it and there is nothing to reduce it, you need more nodes. Choose classic resizing if you add some nodes.
If you need temporary space, you can do an elastic resizing of your cluster to get more space. As it is a fake resizing (it redistributes the slice, not the data, which can induce skewness), go back to your previous configuration after doing it.

Table management for volume improvements

ENCODE every column (except sortkeys) as AZ64. If it’s a varchar column, encode it as ZSTD.
Remove tables that are never used (not scanned since 30 days). You can find a script here.
Reduce at maximum the varchar columns as they are responsible for most of your disk space use. You can use the following query to know the max varchar size of a column:

-- get max varchar size of your_column
SELECT max(octet_length(your_column)) FROM your_table

Query optimization for volume handling

Don’t use large varchar columns (VARCHAR > 256) for “join” as the keys could be put in memory for faster join, which can overload memory.
Never use select *, take only the column you need.
Limit the source tables with “where” filters.
Avoid data redistribution between nodes that multiplies the volume of your query SEVERAL times. Use Distkeys, which indicates how your data are split between your nodes. For example, if you have several tables with client_id and you know you will make JOIN with it, use it as the Distkey.
Do regular Vacuum and Analyze, so Redshift can use the Distkeys.
No nested SELECT, they don’t remember the Distkey.
When you JOIN data coming from Spectrum, use a temporary table with Distkey to avoid data redistribution from a huge Spectrum table.
Read the “query plan” to know if there is a data redistribution:

Query plan: look at uppercases, if the last one is not “NONE” then there is a data redistribution - and it’s bad!

The good thing when you have a full disk error is that at least, you had your query running!…

Query is not running

You don’t understand. Your query is soooo slow. You don’t even know if it’s really running or not. You want to know? Simply run those queries:

-- displays every query that run or will run
SELECT user_name, db_name, pid, query
FROM stv_recents WHERE status = ‘Running’;-- displays every running queries
SELECT * FROM stv_inflight;

If your queries are not running, you will see it in the result of the first query, but not in the second. What if there are more queries like yours? You can see it in the AWS Redshift interface: AWS Portal -> Redshift -> Clusters -> Select your cluster -> Database performance

You can see that on Dec-30 at 2:40 (ETL scheduling), we had more waiting time than query execution (reading + writing time). There are 4 possible causes: locks, bad queues, architecture, bad performance.

Locks

Some queries (like DROP or TRUNCATE) need an “exclusive access” to tables. Doing so, it prevent queries from accessing to those tables. To identify those locks, you can execute this query:

SELECT * FROM svv_transactions;

If “lock_mode” is “AccessExclusiveLock”, you have a locked item. To unlock the item, you can either wait, or delete the query by doing:

SELECT pg_terminate_backend(your_pid_here);

In some cases, the query can be slow to revert (as Redshift needs to rollback some tasks). In other cases, this “terminate” command is not enough and you have to … Reboot the cluster. It will take several minutes.

Bad queues and WLM management

Sometimes your queries are blocked by the “queues” aka “Workload Management” (WLM). It’s bad, but Redshift without queue management can be even worse (numerous disk full error and slow queries). Queues allow you to limit the number of parallel queries and the memory allocated to them, depending on the user or rules. You would want to do that for two main reasons:

Managing query priority
Prevent SWAP on disk

As I won’t focus on WLM here, I can’t develop all the possibilities. I prefer manual WLM for all the control it gives compared to elastic WLM (didn’t work on our cluster because of the reporting tool).

WLM can allocate minimum memory always available for a user (you, the reporting tool, ETLs, …). The downside is that if you give 10% reserved memory for the ETL, those 10% will not be used, even if there is no ETL running. That’s why we do not use “Memory %”. As we are only using WLM to avoid “Disk full” and memory SWAP, limiting the memory would not be a good idea. Note that the AWS advice is to have around 15 queues.

Why and how do you prevent Redshift SWAP with WLM? SWAP uses disk, which is much slower than memory, and you want to avoid it. By knowing how many queries of the same type can run on your cluster, you can prevent SWAP (or other negative effects) with WLM. Here is an example with our cluster:

3 queues for ETLs and automates. They hit the CPU hard and slow each other if there are more than 3 queries doing Insert/Copy/Drop/Truncate/…
10 queues for reporting. They are fast but come all together. With 10 queries at the same time we rarely hit the disk for swap.
3 queues for everything else. We can only have 3 user queries at the same time, which is enough for 25 data people. We added a timeout at 15 minutes for bad queries. NB: 15 minutes for a SELECT is often a REALLY bad performance.

To know if your query is blocked by WLM:

SELECT * FROM stv_wlm_query_state ORDER BY query;

You will see the current state (waiting or running) of your query depending on the WLM with the queue_time.

Architecture

As Redshift is an OLAP database, it’s oriented to work on analytical queries (as opposed to OLTP with transactional queries). To resume, it prefers to get a couple of big queries rather than a lot of small ones. As soon as we know this, if you don’t want to overload/break Redshift you should avoid to:

Stream events: prefer Kinesis or Lambda to aggregate bulk of data.
Do multiple insert to load row by row. Prefer COPY.
Connect a reporting tool directly on Redshift. Every common reporting tool has a cached/in-memory database. Put the Redshift data in it.

The best way to avoid blocked queries, is to have only quick queries. Let’s see how we can do this…

Your query is slow

As soon as your SELECT takes more than 3 minutes, there is a great probability that you can do better. You need specific nodes, and a full understanding of distkeys and sortkeys!

Your cluster nodes

If you are not in a region where “ra3” nodes are available, use “dc2” nodes for your cluster, not “ds2”. Dense compute nodes have more RAM and an SSD disk (faster than HDD). Ds2 nodes have more disk space, but you don’t need it with Spectrum.

Distkeys

When you create a table, you have 3 distribution styles called “distyle”:

Diststyle key:Data are distributed across nodes by the key. The same keys are on the same node, even between different tables. Example: the distkey of table1 and table2 is the customer_id. If you join both tables by customer_id, data are joined locally. It spares a lot of networking and memory, as you move only a subresult and not the full data. Do it if the table is big, with a lot of different values on the key, with data well distributed between keys, and the key regularly used in the JOINs.
Diststyle ALL: It puts the full table on every node. It’s interesting when the table is small and you have several JOIN keys in it. With this, every join on this table will be done on the local node. The data will never be replicated during a query as they are already on the node.
Diststyle EVEN: The data are redistributed equally between nodes. Do it when the other two solutions aren’t available.

Table 1: distkey(col_id_T1), Table 2: distkey(C_ID_T2). Same IDs are on the same slice. If they are JOIN together, there is no data network

Again, the main goal of distkeys is to avoid any broadcast of data between nodes. By doing so, you also want your nodes to be equally loaded. If data are not equally distributed, you will have “skewness”: the overloaded node will always be late compared to others.

It’s not EQUAL, you have a 6% difference. You lose a 6% performance because the overloaded node needs to work 6% more

To identify table with skewness, you can use this link.

-- to know the slice distribution between nodes. It should be exactly equal. If not, you need to do an manual (not elastic) resizing of your cluster
SELECT node, count(*) FROM stv_slices GROUP BY 1;

Sortkeys

It sorts the data into the blocks (1 MB) of slices (a part of a node) of nodes, and stores the possible values in a block zone map (min and max values). Thanks to this, Redshift knows if it needs to get the data quicker and doesn’t use the block if there is no need to this.

Compound sortkey

This is the default sortkey. It orders the data by the column that you specified, ordered by the priority you specified. Example: sortkey(id, date, city) will order your data by id, then by date for each id, and finally by city for each date for each id. It will help a lot the WHERE filter. Be aware that if the city is in the WHERE clause, but not the date, your sortkey will probably be a handicap as you will still need to get every block for specific city. You need to be sure that the first sortkey will always be used if you want the others. Dates are usually good compound sortkeys as they are often used in queries.

Interleaved sortkey

With this type of sortkey, you order the data by block without priority. If you know that you will want to filter either by date (without city), or by city (without date), or by other columns, this is the good sortkey.

One optimization above all: the Merge Join

When your distkey and your first compound sortkey are the same, Redshift uses the “merge join”. This join is incredibly fast and doesn’t need to use memory. As soon as your distkey can be the value you use to join tables, you should go for it. As you will have a sortkey with this join value, “WHERE” clauses won’t be so effective, but this is a small loss for a big win. This is super effective because Redshift has only one scan to do on both tables. As they are sorted alike (sortkey) with the same node distribution (distkey), the join can be done at the same time.

Example: A perfect match for this would be every customer table. “customer_id” is a good distkey (high cardinality, data split more or less equally between customers, very often used for join). In this case, you put “customer_id” as a distkey and sortkey of every table containing it.

If you want to fully understand and get recipe for distkeys and sortkeys, I strongly recommend you this AWS article.

Reminders

NEVER encode your sortkeys.
Run regular Vacuum and Analyze (after ETL for small tables, end of day for big tables). Be careful, it’s a long and heavy process. You need this because sortkeys are not respected by some queries (“INSERT INTO” for example), and tables <80% sorted can’t use sortkeys optimizations.

I hope you liked the article. More than just a troubleshooting review, these tips are my best practices. I regularly use these tricks to reduce queries time by 90–95% … Really! Feel free to ask any advice on Linkedin.

If you want more technical news, follow our journey through our docto-tech-life newsletter.