Optimizing Data Query Processing and Storage — Redshift vs. Redshift Spectrum

Sanjay
Pepperfry Tech
Published in
6 min readDec 21, 2020

Companies today know the worth and value of preserving and using their data intelligently.

This data keeps growing every day. We need an analytics-driven system that can handle large data streams and derive business value out of them.

Amazon Web Services (AWS) leverages this requirement to become the first public cloud provider to offer a cloud-native, petabyte-scale data-warehousing service called Amazon Redshift.

Amazon Redshift has been undergoing several enhancements since 2013. One of the critical aspects of this endeavor was moving data to the “Serverless” Amazon Redshift Spectrum.

This data storage solution has a massively-scalable feature, with a serverless query processing engine to allow joining data, that sits in Amazon S3 with data in Amazon Redshift.

But with great operational scalability, come cost issues.

In this blog, we’ll give you insights on the tactful techniques used by the product team, at Pepperfry, to gather and process large data columns for better decision-making.

Before that, let us understand why,

Moving Data to Amazon Redshift Spectrum is Important

Redshift Spectrum offers the ability to use the Redshift cluster to query the data stored in S3.

When data is put in an S3 bucket, the schema catalog defines the information for Redshift.

The schema catalog stores meta-information on the files’ location, how they’re partitioned, and what data they hold.

The setup is straightforward, where you glue your data in S3 and a Redshift cluster with other data. You can then start querying the data from Redshift and even join Redshift tables with S3 data.

This multi-piece setup could impact performance and cost because of the Redshift cluster size, file format, partitioning, etc.

S3 and Redshift Spectrum enable cheaper data storage, faster setup, intelligent data querying, and storage scalability.

With Redshift Spectrum, data volume is not concerning. We write to S3 with a predictable and reliable throughput.

While Amazon Redshift handles “analytical workloads,” Spectrum is the “bridge” layer that provides Redshift an interface to S3 data.

Redshift is the access layer for enterprise applications. Spectrum is the query processing layer for data accessed from S3.

The following image depicts the relationship between these services.

Image source: Freecodecamp

Here’s how we Optimized Data Management and Storage Costs on Amazon Redshift.

At Pepperfry, we use Redshift as an analytical tool to support our product team with decision-making and performance management.

The data we use are sourced from multiple sources.

We use a cron job to schedule processes to run periodically at fixed times, dates, or intervals in our dynamic environment.

Data processing and handling is a three-step cycle in Redshift.

The hourly cron fetches raw data from S3, processes data, and store clean data in the cluster.

As Pepperfry is growing, we get more hits online, which need to be processed and stored in Redshift.

Eventually, to accumulate the large data volume, nodes have to be added to our Redshift cluster.

We have had one Redshift Cluster with 10 nodes (dc2.Large) storing over 1.4TB of data.

Adding nodes (manually or automatically) is expensive. Also, to add nodes in clusters and ensure they function correctly, it takes an average of 3 hours ( depends upon number of nodes and data size ).

During this time, all the other services using Redshift need to be shut down.

This causes excessive process delays, which is not ideal for day-to-day data operations.

The Solution

Instead of storing data in Redshift, we started storing data in AWS S3 and querying it using Redshift spectrum (external tables).

With this architecture, we need to use the cluster with only a single node.

The combination of AWS S3 and Redshift Spectrum made the system work like a well-oiled machine.

Storing the data in S3 is cost-effective. The analytical processing in the Redshift Spectrum layer, also cost us only for the scanned query data.

However, we optimized the architecture performance and cost in 4 attempts, including storing data in different formats.

Attempt 1

For the above architecture, we tried to store data in the .csv format. Later, we realized that this exceeded our cost threshold for the size limit set in S3 and the spectrum query’s speed.

Also, the operational costs for Spectrum were high because of the data volume we scanned. Spectrum charges you by the data amount scanned per query.

Attempt 2

Here, we stored the data in the .gzip format. This took up lesser space and lesser data scanning time compared to the .csv format.

This was a good achievement in terms of cost reduction, but not the most efficient yet. It saved nearly 20% of the costs.

Attempt 3

After learning lessons from the past, we now decided to change the data compressor implementation from .gzip to Parquet.

The results were excellent in terms of file size compression, and also, the data scanning time for a query had reduced drastically.

The data files were compressed to 1/10th of the size of .csv files, and the query performance had improved 10 folds.

To migrate older data from Redshift to parquet, we have used “UNLOAD” command of redshift.

Why Apache Parquet?

· It is designed for efficient columnar data storage compared to row-based files like CSV.

· Apache Parquet can handle complex nested data structures.

· It is built to make compression and encoding schemes super-efficient.

· Apache Parquet promotes lower data storage costs and maximized query data efficiency with serverless technologies like Redshift Spectrum, Amazon Athena, and Google Dataproc.

Attempt 4

At Pepperfry, we believe there’s always a scope for query execution optimization. We process 24 files per day.

Earlier, whenever we queried for monthly data, it needed to scan 720 (30 days * 24) files.

The lower the number of files to be scanned in your query, the faster the generation of query results would be.

To overcome this, we created a daily cron that merges all the daily files into a single file.

We then moved to a single file per day instead of multiple files to be processed in the Spectrum layer.

Wrapping up

The Key Takeaways by our experts from this endeavor included,

· Using any columnar format for data files that are supported by Spectrum.

· The file size should range between 100 MB and 1 GB.

· Lesser number of files to scan helps improve the query performance.

The trend of zero-administration pre-built data processing suites and “serverless,” interactive query services are rapidly gaining momentum.

It gives teams new opportunities to go faster with lower investments.

Spectrum makes data analysis in Amazon S3 easier using standard SQL.

When you pay only for the queries run, or resources like CPU and storage, data optimization becomes crucial.

Thanks to Pushkar for the PoC & migration and our technical writer Sowmya Narayanan.

If you have great ideas to optimize data query performance and cost, we’d love to hear from you.

Connect with our experts @Pepperfry to learn new ways of solving Data Processing and Storage challenges.

Happy Discovering to you!

--

--