Redshift — An Overview & Analysis

Shimul
Pocket Gems Tech Blog
4 min readMar 9, 2020

Back in 2013, as Pocket Gems’ data-analytics need was growing, we decided to migrate our database to the AWS Redshift. Now, almost 6 years later, our requirements continue to grow, with nearly 20TB+ data growth every month and hundreds of charts, dashboards, data modeling, scheduled analyses, and tools. Based on our experiences with Redshift, here is a summary of Redshift as a cloud data warehouse.

The Good

Cheap and Flexible Pricing Model

Redshift is cheap (you can start off with <1$ per hour cluster). AWS provides both on-demand and reserved pricing models for Redshift. If you only need to run your analyses infrequently on static data, the on-demand model is good. Alternatively, if you need to run a 24x7 ETL system, data analysis, and modeling, the reserved pricing model gives you up to 75% discount on 3-year billing. [Recently, it also released a feature called concurrency clustering that lets you mix on-demand and reserved pricing models.]

Redshift has two types of database clusters, DC2 (Compute Cluster) and DS2 (Storage Cluster), that you can select based on your need for faster compute or more storage. This creates some flexibility based on the company’s data size and analysis requirements. Remember, though; you can’t change from one type to another (the only way is to start a new cluster after creating a data snapshot from the old one).

It’s PostgreSQL

Redshift was built on Postgres 8.0.2. Although there are some differences in their internal architecture, for most users, the usage would be the same as using Postgres. It accepts all the SQL syntax for Postgres, including the admin tables and configurations (Note: Redshift has additional useful admin tables and views available). Since most of our data users (data scientists, product managers, marketing) were well-versed in Postgres, this was a significant advantage for us.

Also worth mentioning, the community support for both Postgres and Redshift (being one of the oldest and most deployed) is very good.

The Bad

Not Optimized for Cloud

Redshift can be considered a hosted service of Postgres, later modified for cloud optimizations. As a result, it cannot quickly scale up and down (adding a DS2–8x node can take ~5–10h for a 100TB database), provide robust recovery/backup options (available only via snapshots of the full database which takes a long time to accomplish), load balancing among others.

AWS is slowly adding more cloud type features, but those features are still a long cry from achieving the full power of a cloud-based system. Concurrency Scaling is expensive since it works only with on-demand nodes. Elastic resize doesn’t provide optimal query performance until data is distributed to the new nodes (which then ends up being classic resize).

Needs a lot of DB Management

To optimize performance, you will have to run vacuuming on your db cluster regularly. If you are ingesting data 24x7, vacuuming can be slow and painful. It also creates locks on the table that can cause other issues. Based on your data, you might also need to analyze and modify your data compression formats.

So one can ask what the big deal is? We have DBA (Database Admin) for that, right? Well, the thing is many of the other modern cloud data warehouse providers do all these internally for you already.

No Time Travel

Isn’t it everyone’s dream to be able to time travel? I am talking about the ability to restore data to a specific past time. Compared to other modern data warehouses that are able to restore data (granular to seconds), without any extra work, Redshift can only restore data from a snapshot. The process of creating a snapshot is also lengthy and time-consuming, so you can’t create it too often. Disaster recovery is essential to any infrastructure nowadays, and the solutions provided by Redshift’s snapshot system feels outdated.

The Ugly

Storage and Compute are Co-Related

The worst thing about Redshift is that you always have to have enough nodes to contain all your data. If you are ingesting data at a very high rate (in a few TB+ per month), you will have to keep adding nodes to your cluster even if your analytics (compute) needs do not keep growing at the same pace. It’s also worth noting that the Redshift performance varies based on your available cluster space.

AWS does provide an alternative called Redshift Spectrum, where you can query S3 data with a redshift cluster. Theoretically, you can then store all your data in S3 and use Redshift only for compute, thereby solving the issue of data and compute pricing being co-related. In reality, spectrum performance is much slower than Redshift, and you also can’t run standard DDL queries for S3 data, making this not so useful.

Closing Thoughts

Redshift is no doubt a powerful data warehouse solution and is used by many big companies (as shown in the AWS official page). It’s easy to use and has a broad appeal to a large chunk of existing SQL users. However, given its drawbacks and limitations, the data-infra engineering team at Pocket Gems started exploring better and more suitable alternatives.

In a future post, I will go through some performance benchmarking and feature analysis done by the team with a couple of other providers.

As part of the data engineering team, we develop, monitor, and improve the performance of our ETL system, BI tools, and several other business-critical systems for Pocket Gems (including CRM, AB tests, Data Modeling). If you are interested in working in a high performing infrastructure team, we are hiring.

--

--

Shimul
Pocket Gems Tech Blog

Director of Engineering - Pocket Gems; Expert in building teams; Problem Solver; Successfully co-failed a startup.