Making sense of Redshift performance

Luis Guilherme Almeida
4 min readApr 1, 2020

--

photo of a person at the end of a mysterious data-center corridor
Credit: Alan Brandt

Redshift is a powerful data warehouse that, at first sight, seems to magically provide results for complex and data-intensive queries, that otherwise would be impractical (if not impossible) to run on regular OLTP Databases, such as MySQL. As you probably know, the performance offered by such databases comes at a cost: it does not offer integrity constraints, meaning it should never be used as a transactional database.

But how does it work?

There are three main points that together, will speed up queries dramatically when compared to regular relational databases. Those are columnar storage, distribution across nodes and data encoding, we will be discussing those individually.

Columnar storage

In transactional databases, data is stored in a spreadsheet fashion, that is: each row, containing several columns, is piled one after the other. On the other hand, in columnar databases, each column is stored separately in its own pile.

spreadsheet and its physical storage representation on row based and column based layouts. rows piled vs columns piled
Comparison between row-based and column-based database storages

To show the performance difference, in a row-based MySQL database, two tables are built, both of them holding 2 million rows: The first one contains a single amount column and the second holds 10 columns, one of them being the amount.

The following execution times are obtained by running an aggregation over the entire dataset.

SELECT SUM(amount) FROM table_single_column; 
#> 0.75s execution time
SELECT SUM(amount) FROM table_multiple_columns;
#> 4.8s execution time

They both yield the same results, but the second query, that was run against the table with more columns, took much longer to complete!

What if those same queries are run on Redshift?

SELECT SUM(amount) FROM table_single_column; 
#> 0.44s execution time
SELECT SUM(amount) FROM table_multiple_columns;
#> 0.45s execution time

They also yield the same result, but now, the time to compute results are roughly the same. Why is that?

As mentioned, Redshift will store columns in their own separate file, so when running a query against a column, the database will only load that one column, making the execution time remain constant when summarizing the two tables. That’s not the case for MySQL, though, as it would require to load the table completely and then compute the results. That’s why an aggregation on a table with only one column will be much faster than the same operation on a table with lots of columns (I’m not considering indexes here).

Distribution Across Nodes

Probably the most important feature of Redshift is its ability to parallelize queries. In this database, the data is distributed across various computers, also called nodes. By doing so, records can be filtered and operated individually and simultaneously in each of the nodes.

To show the impact of serial versus parallel executions, two tables, each holding 120 million rows were built. In one of them, the rows were forced into a single node, which will cause the queries to be serialized.

In the other table, however, the rows were distributed evenly across the nodes. The execution time changes drastically between the serial and parallel versions:

SELECT SUM(amount) FROM distributed_table; 
#> 1.1s execution time
SELECT SUM(amount) FROM table_running_on_single_machine;
#> 6.7s execution time

If you want to read further on how Redshift parallel processing works, I recommend the official documentation page.

Data encoding

The last thing that will be covered in this post is data encoding, which is another fundamental feature to speed up query executions. In layman’s terms, data encoding is all about reducing the disk space usage to store database records, without data loss.

But how reducing disk space will improve query performance?

Basically, if we store the same information in less space, the query execution would be a lot faster, since the database would need to do fewer disk reads (which is a very expensive operation).

Redshift offers various types of compression algorithms that should be carefully chosen for each of the columns in a table, or left to the database to automatically apply the appropriate one. For me, the most interesting compression types are Run Length and the Byte-Dictionary Encoding. You can check them out here.

That’s it for this post, hope you enjoyed it :)

As a Software Developer at EBANX, I get to work in a world-class engineering team that leverages cutting-edge technologies such as Redshift to deliver the best payment solutions in the market. Make sure you check out our open positions!

UPDATE: There is a second part: Improving Redshift query performance where I’ll go through the performance optimizations that can be applied in order to achieve the most with Redshift.

#ebanxlife

--

--