Redshift — Hot and Cold Data

The need for separation of hot and cold data

David Truong
3 min readMar 20, 2018
“A group of people brainstorming over a laptop and sheets of paper” by Štefan Štefančík on Unsplash

So you’re working with big data and storing it all in Redshift. Processing terabytes of data per day in near real time. At some point, you will realize that much of that data is not frequently accessed, if at all, and wonder if it’s needed in your data warehouse. If you haven’t come across this realization, go to your AWS billing dashboard and look at your Redshift costs.

The concept of hot data and cold data is simple. Frequently queried data is hot. Less frequently queried data is cold. However, if you ask the data scientists, the machine learning platform team, or other users of your data, they will consider all of the data to be hot and deemed absolutely necessary. More realistically, their use cases will skew towards frequently accessing data from just a specific subset based on some criteria such as dates, times, geographical locations, age groups, etc. If most of your use cases only require frequently accessing data from the most recent month, why keep older months of data in your Redshift cluster and incur the premium costs?

Now this is where Spectrum comes into the picture. Spectrum allows you to access data in files using SQL-like queries. This means you can unload data from your Redshift cluster into files stored on S3 or not have to load data into your cluster in the first place and still be able to access that data easily.

The main difference between having your data in Redshift versus Spectrum is in the performance and billing. Redshift clusters are billed on an hourly basis regardless of load and queries perform better since the data is managed on the cluster with query result caching, data sorting, optimized data distribution, etc. Spectrum is billed per query (based on the amount of data scanned) and the cost of storing files in S3 is hundreds of times cheaper than storing the data in a Redshift cluster. By utilizing Redshift for hot data and Spectrum for cold data, you get the best of both worlds, performance on hot data and reduced cost on cold data.

On my team, we were able to bring down disk usage by 75% on our Redshift cluster on both of our production and test environment. We were using 8 “ds2.8xlarge” cluster nodes at the on-demand price of $6.80 per hour per node resulting in average monthly costs of:

6.80 (cost per hour per node) x 8 (nodes) x 24 (hours per day) x 365 (days a year) = $476,544 per year

That is the bare minimum we would have paid just for data storage without processing any more data. We were quickly running out of disk space on our 8 Redshift cluster nodes and each additional node would cost an additional $60,000 per year.

By unloading our cold data to S3 from Redshift, we were able to scale down our cluster to just 4 nodes with plenty of disk space to spare and reduced our costs by 50%. Besides costs, we also saw added performance on our queries due to having less data in our Redshift cluster.

--

--