What’s the recipe for a healthy Data Model in Amazon Redshift?

Well, you have launched a perfect Redshift Cluster with the right configuration settings. Now what? You may ask. Why do my queries take longer time to run? Why is my BI report taking too long to load? What did I do wrong?

The Answer is a 3 Ingredients recipe.

Ingredients:

Compression Encoding
Sort Keys
Distribution-style

Recipe:

1. While you are creating your tables, use compression encoding in all your columns. Compression allows more data to be stored within a cluster. It improves performance by decreasing I/O. To determine what compression encodings are best for your table, Create the table without any compression and then Load the table with a sizeable amount of data and run the below:

ANALYZE COMPRESSION enterprise_db.product;

This will list the optimal compression encodings that you will need to use while creating your table.

For example, creating a table with compression encodings:

CREATE TABLE enterprise_db.product(
product_id int encode zstd,
location varchar(32) encode bytedict,
load_dt date encode zstd);

Tip: It is better not to use compression on small tables (< 10000 rows) or if there are sparse columns (i.e. columns with a lot of nulls). Usually, for such tables, the suggested encoding by Redshift is “raw”.

Or

If you are using the COPY command to load the data from S3 to redshift, it analyzes and compresses data by default during the first load into an empty table for you.

Tip: Try to leverage COPY to load data from S3 to Redshift as much as possible.

2. Choosing the right columns as sort keys for your tables, so that the data is physically sorted based on that column in disk. This improves performance by reducing I/O.

Example:

CREATE TABLE enterprise_db.product(
product_id int,
location varchar(32) encode bytedict,
load_dt DATE encode zstd
) SORTKEY(product_id,load_dt);

Here, the data is first sorted on product_id and then if there is a tie, it is sorted on load_dt.

How do I choose the sort key? Use the columns that are frequently filtered on (columns used in your “where” conditions) and choose the lowest cardinality column as your first column in your sort key. Usually, it is a date column in a typical data warehouse.

Tip: It is better not to use compression on the first column of your sort key

3. Distribution-style allows Redshift to distribute the data based on a specific type in the entire cluster.

Distribution Styles

*Query to identify row skews:

SELECT diststyle,skew_rows
FROM svv_table_info WHERE “table” = ‘your_table’;

The value of skew_rows[Ratio: slice with the most number of rows/slice with least number of rows] should be close to 1. In this case, we are good.

Example:

CREATE TABLE enterprise_db.product(
product_id int,
location varchar(32) encode bytedict,
load_dt DATE encode zstd
)
DISTKEY(product_id)
SORTKEY(product_id,load_dt);

4. If you are using subqueries, try not to use “select * from schema_name.your_table” , always specify the columns in the select list. Reason being, Redshift uses a columnar architecture. It stores the data on a disk by column rather than a row.

Tip:

If you want to query the count of records in a table,

Use

SELECT count(1) FROM schema_name.your_table

instead of

SELECT count(*) FROM schema_name.your_table

5. Keep the Datatype lengths as narrow as possible. For example, varchar(64) instead of varchar(255) if the column length does not exceed 64. It helps with query performance.