Best practices with Amazon Redshift: Architecture, organization and performance optimization.

A practical guide, based on personal experience, that covers Redshift cluster architecture, data types, column compression, distribution style and sort key, with the aim of directly improving performance.

Alice Thomaz
11 min readApr 25, 2023

Optimizing data processing is one of the main concerns of Afya’s engineering team. Recently, we sought to enhance our knowledge about Data Warehousing, with the collaboration of AWS team.

Data Warehouse is a system that acts as a centralized repository to manage, store and analyze different data sources. In our company, we use the Redshift tool provided by AWS, which allows us to access, manipulate and process large volumes of data efficiently. With tables that reach up to 20 million records per day, the scalability of Redshift is essential to support our data processing needs.

The goal of this post is to provide an overview of how Redshift works, not only for data engineers who work directly with the tool but also for those who are not yet familiar with it and want to better understand how it works behind the scenes.

>> Properties and Organization

Redshift is known for using the ACID concept in its transactions, which ensures that they have four fundamental properties: atomicity, consistency, isolation and durability. This feature is essential to ensure that queries are executed safely and efficiently, avoiding conflicts with other transactions. In addition, information is preserved even in case of system failures and transactions are executed only after the complete cycle, ensuring that no process is delivered partially.

When creating a data structure in Redshift, it creates a single database by default, which contains the organization of the models in a hierarchical schema: Database > Schema > Table/view.

Redshift follows the schema-on-write model, where it is necessary to specify the structure, columns, data types, etc., when defining a new model in our schemas, which group the tables. To develop and create these models in the Redshift environment, it is necessary to use the SQL language. Below is an example of how to create a table in the tool:

CREATE TABLE campaigns.campaigns_submit_leads
(
csl_root_id CHAR(36) ENCODE zstd
,csl_root_tstamp TIMESTAMP WITHOUT TIME ZONE ENCODE RAW
,csl_persona VARCHAR(33) ENCODE zstd
,csl_title VARCHAR(128) ENCODE zstd
,csl_page_bounce INTEGER ENCODE az64
,csl_br_features_pdf BOOLEAN ENCODE zstd
)
DISTSTYLE KEY
DISTKEY (csl_root_id)
SORTKEY (csl_root_tstamp)

In the next topic, we will explore the architecture of the tool to understand the processing and storage of data in Redshift. From there, we will address how to create a query for table creation, using the previously mentioned query as an example. We will analyze each aspect of the query, such as data type, encoding, distribution key and sort key, and understand how each one contributes to the table’s performance.

>> Architecture: Processing and Storage

The Redshift architecture is composed of several layers. When users access this tool, the connection can be established via JDBC (Java Database Connectivity) or ODBC (Open Database Connectivity). Both are APIs used to connect an application to a database. However, ODBC is more suitable for applications that need to connect to different databases using different languages, while JDBC is more suitable for applications developed in Java. At Afya, we usually use the DBeaver tool to access and develop queries in the Reshift environment.

Inside the Redshift Cluster, the next layer is the Leader Node. It is responsible for receiving the query, compiling and defining a processing plan for the next layer: the Compute Nodes. Redshift uses parallel processing, which means that the processing plan is divided among the available machines and at the end of the process, it is returned to the previous layer.

The number of Compute Nodes is adaptable and can be adjusted according to the company’s needs. The more nodes, the more parallelism, slices, memory and performance, but also the higher the cost of the tool.

Each Compute Node is composed of a group of Slices, whose quantity is defined according to the size and type of chosen node. At Afya, we work with the new Ra3 instance, which allows us manageable storage and greater flexibility to scale our cluster.

The Slice is a portion of disk and processing. The disk is where data is stored and according to the chosen distribution mode, the data can be in Slices or in different nodes. Within this disk, data is stored in blocks of 1MB and unlike some transactional databases, such as MySQL, which store data by row, in Redshift, this storage is columnar. This is because in an analytical database, many filters and joins will be performed in queries for analysis and with this storage model, we can gain performance.

>> Choosing the Data Type — Structured Data

When it comes to storing and processing data in a database, choosing the appropriate data types is essential to ensure good performance and accuracy in data analysis. There are many types of data available in Redshift, each with a fixed set of associated properties. Some examples of data types and their definitions according to AWS are:

  • TIMESTAMP WITHOUT TIME ZONE — Date and time (without time zone)
  • CHAR — Fixed-length character string
  • VARCHAR — Variable-length character string with user-defined limit
  • INTEGER — Signed four-byte integer
  • BOOLEAN — Logical Boolean (true/false)

Using data types with a size larger than necessary, such as creating a weekday column with VARCHAR(150) instead of VARCHAR(9), will cause Redshift to reserve a larger memory space than necessary.

When the system processes this information, it tries to process it in memory to accelerate the process, but if there is too much memory waste, part of the processing needs to be done on disk, which can negatively impact performance. Therefore, it is important to avoid memory waste and choose the most suitable data types for each column.

The complete list of data types can be found at this link: AWS — Data Types.

>>Column Compression

As mentioned earlier, Redshift uses columnar storage, making column compression extremely important. For example, when storing the “day of the week” column, all the data in that column has the same type, which allows for more efficient compression when applied to the entire column. Redshift is capable of achieving compression rates 3 to 4 times higher than the original file.

In addition to reducing requests for storage, compression also improves Redshift’s performance, since the performance of this database is directly related to I/O (Input/Output), that is, to the communication between devices, involving the transfer of data between device and memory on a machine or processor.

There are several types of compression, which can be defined as ENCODE AUTO, where Redshift itself defines the ideal type, or can be defined manually when creating the table. By default, the column defined as sortkey remains without encoding, that is, encoded as RAW. In the case of other columns, at Afya we usually work with 2 types of compression: AZ64 and ZSTD. AZ64 is used for columns of the types SMALLINT, INTEGER, BIGINT, DECIMAL, DATE, TIMESTAMP and TIMESTAMPTZ, while other types use ZSTD.

The complete list of encoding types can be found at this link: AWS — Compression Encodings.

>> Distribution Style — DISTKEY

As mentioned earlier, when creating a query processing plan in Redshift, parallelism is used to organize data based on the available nodes and slices.

An important property for organizing data is the distribution style of the table, which allows for better control in the organization of internally stored data. The goal is to minimize data transfer during the query, ensure equal distribution of data for parallel processing and ensure that each node has access to the same amount of information.

AWS provides four types of distribution, described below:

  • AUTO — This is Redshift’s default option, where the distribution style is determined based on user interaction with the table. It can dynamically change the style based on current usage, but only during periods of low usage to avoid performance impacts.
    By default, for small tables, the initial setting selected is ALL. As the table grows, Redshift may switch to the KEY type if it identifies a suitable key based on the pattern of queries being processed. If it doesn’t find an ideal key, it can change to the EVEN type.
  • KEY — In this distribution style, it is necessary to define the key column to organize the nodes of the cluster, as data with the same value in that column will be stored in the same slice. It is important to choose a key with high cardinality to avoid row imbalances between slices. Generally, it is recommended to choose a popular column in JOIN and WHERE processes when creating queries.
  • ALL — In this distribution style, the entire table is stored in the first slice of each node. When the table is invoked in a query with different sources, it can be processed on the same node as the other source, avoiding data transfer between nodes. As a result, although the all type has a higher storage cost, it provides performance gains, but this distribution style is only recommended for static tables with less than 3 million rows.
  • EVEN — In this distribution style, data is evenly distributed across all available slices. If your table doesn’t fit into the previous distribution styles, the even style is recommended.

>> Sorting Key — SORTKEY

Sorting keys have the function of physically ordering data on disk and date columns are an excellent example of a column for this purpose. Within the data blocks, it is possible to sort and group according to the chosen sorting key, allowing the ordered data to be explored in the best possible way in query processing.

One available option is the AUTO key, in which Redshift monitors the query pattern and automatically chooses the most appropriate sorting key, typically when most of the filters or joins performed with the table are made with specific columns.

It is possible to use more than one sorting key in the same table, which is called a compound sort key. It is important to define the sorting columns in ascending cardinality order, without exceeding the limit of 4 columns. The choice of these columns must be made carefully, as depending on the level of difficulty it can generate problems during sorting via the Vacuum command.

VACUUM is a command that helps sort rows and recover space in the chosen table. Redshift automatically runs the command in the background during periods of low usage, but it is also recommended to use it always after a significant load or when the data is highly disorganized. The command we typically use:

VACUUM FULL campaigns.campaigns_submit_leads TO 100 PERCENT

A useful tip is to order the data with the “ORDER BY” command during significant data insertion into a table. This can help with sorting and make the subsequent Vacuum process less time-consuming.

>> Practical Example

After analyzing the functioning and importance of each previously described topic, it is important to understand how these concepts apply in a real case. To exemplify, we will use the query mentioned in the first topic and detail its operation.

CREATE TABLE campaigns.campaigns_submit_leads
(
csl_root_id CHAR(36) ENCODE zstd
,csl_root_tstamp TIMESTAMP WITHOUT TIME ZONE ENCODE RAW
,csl_persona VARCHAR(33) ENCODE zstd
,csl_title VARCHAR(128) ENCODE zstd
,csl_page_bounce INTEGER ENCODE az64
,csl_br_features_pdf BOOLEAN ENCODE zstd
)
DISTSTYLE KEY
DISTKEY (csl_root_id)
SORTKEY (csl_root_tstamp)
  • Data Type: After a careful analysis, we chose the ideal data type for each column, considering the nature of the data stored in each one. As mentioned earlier, choosing the appropriate data type is essential to ensure efficient storage, saving disk space and improving query performance.
    Taking the column csl_root_id as an example, which contains data of unique universal identifiers, a fixed length of 36 characters was defined according to the product team’s specifications.
    In the case of the column csl_persona, even with variable-sized data, we were able to maintain an approximate length of the actual capacity needed for proper storage.
  • Encode: In the query, each column is encoded with a specific compression method to reduce the size of the data stored on disk and enhance the efficiency of queries, without compromising the accuracy of the information. We aim to follow the standards established by AWS for choosing different encoding types.
    Regarding our sortkey, the csl_root_tstamp column is encoded as RAW, which means that the column is stored without compression to allow for efficient sorting of the data. For the csl_page_bounce column, we opted for the AZ64 encoding, which is AWS’s recommendation for this type of data (INTEGER).
    For the other columns that didn’t fit the profile of AZ64, we opted for ZSTD encoding. We understand that this is one of the most efficient compression techniques available in Redshift currently.
  • Distribution Style: At Afya, we work with very large tables, with billions of rows, which makes the ALL method unsuitable in most cases. When analyzing the models that use the campaigns.campaigns_submit_leads table, we noticed that it is often used as a source of additional information for more complete event tables.
    This table is usually linked via JOIN to event tables using the unique identifier, the Id. Therefore, to improve query efficiency and better utilize resources, we opted to use the csl_root_id column as the distribution key in this case.
  • Sorting Key: Our goal is to provide the analysis team with increasingly comprehensive data periods. However, given the large amount of annual data and the need for regular monitoring, it has become essential to organize the environment to facilitate both specific analyses and daily inserts made by our data transformation tool, DBT.
    An example of this is the choice of the csl_root_tstamp column as the basis for sorting. Being the main date column in the table, it helps in the incremental modeling process. This is because old data is already included in the final table, which contains the historical load.
    Therefore, when inserting more recent data, it isn’t necessary to perform a complete analysis of the table. It is enough to update the data with the most recent dates.

Understanding the operation of Redshift has been essential for planning improvements in our Data Warehouse. By following the specifications mentioned above, we were able to significantly reduce processing time. When dealing with a large volume of data — as is the case with Afya, where we have tables with billions of rows — dedicating efforts to improving performance becomes vital for the proper functioning of the tool.

Our multidisciplinary team, consisting of engineers, analysts and scientists, aims to improve and make the most of the tool’s functionalities, as well as to make the usage environment more accessible and constructive for other teams that actively use it. If you have any suggestions, questions, or comments about the content, feel free to contact me on Linkedin.

Versão em Português: https://medium.com/@alice_thomaz/394ee0394fbb

--

--