Analytics Challenges — Modern BI Architectures (Part 4)

EDW only options

4 min readJan 11, 2020

In the previous part of this series, I have tried to analyze OLAP-on-Hadoop to realize if they can capable of meeting the requirements of Interactive BI.

In this part, I will explore EDW options AWS Redshift, Snowflake, Redshift to if they could meet alone the requirements of modern BI Platform, which:

can ingest JSON, XML, CSV, TSV to store/process them at PB scale in several hours
can keep the data where it should belongs to (EU based customer data should be stored in EU Data Center) considering Cloud deployment
can support interactive BI queries with sub-second response times under high concurrency
is widely adopted
is portable
is scalable
can cope with fast data
and can do all these at the cheapest cost possible!

Snowflake

Snowflake is a cloud data warehouse built on top of the AWS and SaaS offering. Snowflake can be differentiated from traditional data warehouse as follows:

Storage and Processing can be scaled independently which reduces cost and meet different workload requirements. Snowflake’s multi-cluster, shared data architecture makes it possible to allocate multiple independent, isolated clusters for processing while sharing the same data.
Traditional data warehouses define the distribution key at the table creation time where Snowflake can define them dynamically during join processing (broadcast join, hash-partitioned join)
Data retention Similar to Redshift’s automatic backups to S3, Snowflake’s Time Travel feature enables you to revisit data at any point in time over the last ninety days. You can rapidly recover from corrupted data.
Automatic tuning efforts: Snowflake tunes the system to improve query run time as you use it. It can automatically scale your warehouses to match your query volume, resizing itself to be larger or smaller based on active demand from your analysts.
It can handle all data types such as XML, CSV, TSV, JSON, ORC, Parquet

Redshift

Redshift is an MPP Database using a leader node to coordinate the queries and uses Postgre based kernel in the Compute Nodes to run the queries. Redshift has shared nothing architecture where data is distributed to multiple nodes. The data access is through the leader node.

Redshift Spectrum can query data stored in S3 with additional node cluster called Spectrum nodes. Redshift Spectrum allows querying data that exists directly on S3.

Redshift is a scalable alternative to RDS/Postgre (handling star schemas) and offers an alternative to Hadoop in storing large data sets and can be even positioned as Active Data Archive.

Redshift does not separate Storage and Compute. If you need more Storage, you would need to add additional nodes, which means you are essentially paying for more compute power.

Massive parallel processing (MPP) data warehouses like Amazon Redshift scale horizontally by adding compute nodes to increase compute, memory, and storage capacity. The cluster spreads data across all of the compute nodes, and the distribution style determines the method that Amazon Redshift uses to distribute the data. When a user executes SQL queries, the cluster spreads the execution across all compute nodes. The query optimizer will, where possible, optimize for operating on data local to a compute node, and minimize the amount of data that passes over the network between compute nodes.

With Redshift, it is required to Vacuum / Analyze tables regularly. Snowflake manages all of this out of the box. With very big tables, this can be a huge headache with Redshift.

Let’s evaluate these cloud based data warehouses against the criteria set earlier to define the characteristics of a modern BI Platform:

concurrency — Snowflake can scale the cluster based on peak concurrent loads. Redshift suggests 15 concurrent users. Above 15 concurrent users, the performance detoriates considerably. Auto-scaling based on concurreny is also possible. Thus, both options can handle high concurrency.
high performance (sub-second queries) — For Snowflake, comple xqueries are between 10–60 secs. Redshift has no indexing. As any MPP database (or noSQL database for that matter), distribution key and sort key choice for a table will drive the performance. Both options can support Interactive BI queries for non-complex queries.
low cost — Snowflake supports automatic pause to avoid charges if no one is using the data warehouse and charges for the data storage and hours the virtual data-warehouses are in operation.
data structure flexibility (JSON, XML, CSV) — Snowflake supports these types natively. Redshift will need to convert from these source files into Redshift tables.
cope with fast data — Snowflake supports streaming data by automatically loading it from S3 bucket with 1 minute refresh time. Redshift has similar support.
portability: Snowflake is portable across cloud platforms.

Conclusion:

The outcome is that more or less all products deliver similar results and is not the best choice directly out of the box for interactive BI use-cases where sub-second response times required under highly concurrent workloads.

The most important differences between warehouses are the qualitative differences caused by their design choices: some warehouses emphasize tunability, others ease-of-use. If both Redshift and Snowflake based options will remain in operation for the same duration, Snowflake will be much cheaper.

Both Hadoop with complementary components and EDW only platforms can meet the requirements for a modern BI solution.

EDW has wide industry adoption and if Cloud is considered, it can be the main option to consider. The data can be stored in Object Storage (Data Lake), then loaded into EDW to perform data integration and serving. For advanced analytics and fast ETL for data integration, Data bricks Unified Analytics platform would be the most cost effective solution.

'Hadoop only' option can be considered, if on premise deployment is currently the target and it may move to Cloud in the future. Hadoop needs complementary components such as OLAP-on-Hadoop vs EDW. The selection of the right component would need to be determined based on consumption figures.