Cost-effectively scaling a Data Warehouse by decoupling compute from storage in Amazon Redshift

Published in

Salesloft Engineering

5 min readNov 13, 2019

In my previous article, titled “Developing a Data Warehouse in the Cloud for SaaS at SalesLoft”, there was a mention regarding the storage challenges while scaling the data warehouse in the cloud. Is it possible to store this data cost-effectively and consume in a meaningful way through analytics? We noticed while consuming data from partitioned tables, there is an exponentially larger storage footprint in Amazon Redshift. We also discovered that Amazon Redshift cluster uses disk storage space differently than what is normally observed in other traditional databases. Moreover, the amount of disk storage space allocated to two tables (created using same DDL & having same number of rows) that are in different Amazon Redshift clusters can vary significantly. The 1 MB block size also contributes to exponential storage space usage in Redshift.

Here’s an example of minimum table size calculation on Redshift cluster:

If you have a table with 50 columns on a dc2.xlarge 21 node cluster, then the smallest size the table can have evenly populating all 20 slices (excluding leader/coordinator) is calculated as follows:

1 MB * (50 + 3 system columns) * 20 = 1060 MB

However, If you have a table with 50 columns on a dc2.8xlarge 3 node cluster, then the smallest size the table can have evenly populating both slices (excluding leader/coordinator) is calculated as follows:

1 MB * (50 + 3 system columns) * 2 = 106 MB

We see a major storage problem when consuming data from a distributed database into a Redshift data warehouse.

For any SaaS product, login data is a vital indicator of daily active usage, retention, stickiness, growth, adoption, and other usage metrics. Below is the DDL for table ‘identifies’ which shows the user login events for SalesLoft application.

Most frequently accessed data by users is current data. Users always want a fast response on their data retrieval action. However, the cost of your Redshift cluster depends on the combination of data storage and compute resources. Unfortunately, storage and compute cannot be procured/scaled independently in Redshift. Therefore, by storing only current data on Redshift cluster and historical data on S3, you can manage storage and compute independently without upgrading the Redshift cluster.

Let’s say we have 5 years of data out of which current data (latest 1 year) is frequently accessed while the historical data (prior 1 year) is less-frequently accessed by the business users. How do we manage scaling costs with efficient storage on Redshift cluster while also having the ability to readily access historical data when needed? To solve this problem, we perform a series of steps listed below:

Archive historical data to Amazon S3
Create an external table, and partition it (Amazon Spectrum)
Store only current data on the Redshift cluster
Create a view that spans Amazon Redshift and Redshift Spectrum external table

Step 1: Archive historical data to Amazon S3

Storage and retrieval of data on S3 is much cheaper than storing the data on Redshift cluster. Also, S3 storage provides the ability to store an infinite amount of data. We can query S3 data by writing normal SQL query leveraging Amazon Redshift Spectrum for more complex analyses.

Archiving historical data from Redshift table to S3 storage involves unloading historical data in Redshift table to Amazon S3 bucket using the SELECT SQL Command in the UNLOAD command as shown below.

Step 2: Create an External table with Partitions

External table provides seamless integration of S3 storage with Redshift table. Also, you can query an external table using the same SELECT syntax you use with other Amazon Redshift tables. External tables are created in external schema. You may refer AWS documentation on how to create external schemas. Below is the DDL to create external table.

Here we partition our data by Date (Received_at) to restrict the amount of data that Redshift Spectrum scans by filtering on the partition key. We have seen lightning performance on queries that scan less data on S3. Moreover, your costs will go down as there is less data scanned. If you are frequently querying at the granularity of day, it is suggested to partition by the day. However, you may choose to partition your data at a higher granular key such as year, month etc. if you query data at summarized level.

We store historical data in S3 according to the partition key we choose. You can add the partitions as shown below.

Step 3: Store only current data on the Redshift cluster

To store only the current (1 year) data on Redshift cluster and move the historical data to S3, an ETL process to unload the historical data to S3 partitioned folder is sufficient. You need to clean up the current data on Redshift cluster post transfer to S3 as shown in the SQL below.

Execute the VACUUM DELETE operation to reclaim disk space occupied by rows that were marked for deletion and also defragment the table to free up consumed space and improve performance.

Step 4: Create a view that spans Amazon Redshift and Redshift Spectrum external table

We now create a common view that spans Amazon Redshift and Redshift Spectrum external tables. This enables us to query frequently accessed current data from Amazon Redshift cluster and less-frequently accessed historical data from Amazon S3 through a single pane by creating a common view as shown below.

You can now query via a single pane current data from Redshift cluster and also the historical data stored on S3 through a common view.

Conclusion: Comparing the above query performance of current data and historical data accessed through a single view shows that the current data is super fast to query while you can access historical data if you need. Albeit slowly. Storing only current (frequently accessed) data on Redshift cluster and archiving historical data to Amazon S3 can control your company spend while scaling the data warehouse in AWS cloud. This approach helps to store more data without adding new nodes and freeing up that budget to serve your business stakeholders.

Cost-effectively scaling a Data Warehouse by decoupling compute from storage in Amazon Redshift

Written by Sonny Mohnalkar