How to create a virtual data warehouse in Redshift

Marat Levit
Apr 16, 2020 · 4 min read
Image for post
Image for post
Photo by R Mo on Unsplash

With the introduction and general availability of materialized views, I wanted to explore the possibility of creating “virtual” data warehouses. By virtual, I am referring to a warehouse that is built without ETL tools (i.e., Talend, Informatica, dbt, or Dataform for that matter).

Now obviously the entire warehouse can’t be completely virtual, we still generally recommend organisations build integration layers (something along the lines of Data Vault) but in all honesty, you could bypass that if you’re not integrating many source systems and if your organisation is not likely to change its core systems/ways of working.

The data warehouse

Let’s briefly introduce a typical data warehouse architecture.

Image for post
Image for post

A typical warehouse consists of three layers, staging, integration, and presentation. All data for the warehouse is sourced from our data lake and made available for consumption within the presentation layer.

Now let’s overlay the above with the appropriate AWS services.

Image for post
Image for post

Our data lake is, of course, Amazon S3 (no surprises there), our staging layer is not actually physicalised within Redshift but actually just Redshift Spectrum querying our Amazon S3 data lake, the integration layer (if required) is comprised of tables within Redshift and finally, we have our presentation layer. Up until this point, the presentation layer has always been physicalised (think tables)… but now we can actually virtualise it without compromising on performance thanks to materialized views.

The solution

I created a simple database within Redshift to learn about and understand materialized views. I converted MySQL’s Sakila database to Redshift (found here on GitHub) for use in this solution.

The solution consists of one fact, rental, and four dimension tables, customer, film, staff and store. For those visual learners out there, he’s a crummy ER diagram :)

Image for post
Image for post

This is the point where we begin to utilise the materialized views. Instead of typically using tables for our dimensions and facts, we instead create five materialized views. Below you’ll find an example of fact_rental and dim_customer.

Image for post
Image for post

As you can see, materialized views act similarly to tables, they can be distributed and sorted to ensure performant joins and queries.

To refresh the data within the materialized view, you simply run REFRESH MATERIALIZED VIEW sakila.fact_rental and Redshift will perform either an incremental refresh or a full refresh depending on the SQL constructs used within the view. This is all handled automatically by Redshift.

In many cases, Amazon Redshift can perform an incremental refresh. In an incremental refresh, Amazon Redshift quickly identifies the changes to the data in the base tables since last refresh and updates the data in the materialized view.

~ REFRESH MATERIALIZED VIEW

Refreshing the data within the views should be built into your data processing pipeline. Once the previous layers have been populated (new data has been ingested) a simple refresh call is made to update the presentation layer. No ETL processes, no data pipelines.

The final step is to create one denormalized view for easier consumption by our consumers (this is where distribution plays an important role, see Amazon Redshift Performance Standards for Data Vault for tips).

Image for post
Image for post

Final thoughts

Firstly, please don’t hang me for the not-so-realistic dimensions and facts. Yes they’re not perfect, yes there are no point-in-time joins etc :) It’s just an example.

Secondly, Redshift’s implementation of materialized views is actually quite mature. There’s no auto-refresh of data (would be nice to see), but the fact that it allows joins between tables already puts it head and shoulders above its main competitors (i.e., Snowflake & BigQuery).

Thirdly, the use of Redshift Spectrum to create a virtual staging layer and materialized views to create a virtual presentation layer reduces data engineering effort and increases data delivery velocity. That’s a win-win in anyone's books.

Servian

The Cloud & Data Professionals

Marat Levit

Written by

Technology and Gaming enthusiast | AWS APN Ambassador

Servian

Servian

At Servian, we design, deliver and manage innovative data & analytics, digital, customer engagement and cloud solutions that help you sustain competitive advantage.

Marat Levit

Written by

Technology and Gaming enthusiast | AWS APN Ambassador

Servian

Servian

At Servian, we design, deliver and manage innovative data & analytics, digital, customer engagement and cloud solutions that help you sustain competitive advantage.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store