Design Patterns for Cloud Analytics

Published in

Data Analytics

13 min readSep 1, 2020

1. Introduction

With the growing number of new technologies coming into the market every year it becomes very difficult for engineers and their leadership to choose the right combination of elements to get the “right solution” in place. In this article, I provide architectural patterns for a cloud-centric analytics platform, their pros and cons and when each should be used.

Let me start with the definition of the “right solution”. We’ll use a widely accepted set of evaluation criteria called “Architectural Concerns”. There are 7 of them that creates focus areas when building or evaluation a solution. These concerns are coming into the digital world from a much wider and older engineering field.

1.1. Evaluation Criteria / Architectural Concerns

Concern 1: Availability — what’s the desired uptime depending on the criticality and the use case.
Concern 2: Performance — how quick it responds to user activities or events under different workloads.
Concern 3: Reliability — how reliable system needs to be for every type of break, i.e. if the disk is broken, node down, data-centre is down, etc.
Concern 4: Recoverability — how quickly and how in principle the system would recover from breaks. Some of the recoveries are automated, like in HDFS or S3, others like node failure need to be considered in advance.
Concern 5: Cost — how much money are we willing to spend to bring the solution up (Infra, Development) and maintain it later (Operations).
Concern 6: Scalability — how scalable the solution needs to be, i.e. peak hour traffic, changing trends, growth over the next few years.
Concern 7: Manageability — how to ensure compliance, privacy and security requirements

We’ve defined the evaluation criteria now let’s look at several common scenarios: Entry, Enterprise Non-real time and Enterprise Real-time.

Entry Level — built for a small organization (or organizational unit) with less than 1 Bn. records and annual growth of up to 20%
Enterprise Non-Real Time — used for consolidation of several regionally distributed Entry Level data sets and/or 1–100 Bn. records with annual growth of up to 20%
Enterprise Real-Time — used for real-time analytics and/or consolidation of several regionally distributed Entry Level data sets with more than 100 Bn. records with high growth factors.

Now let’s define architectural patterns for each of the solutions and provide analysis based on architectural concerns.

2. Entry-Level Solution

For small use cases (< 1 Bn. records) most of the transformations and dimensional storage could be kept within the tool itself. Modern BI solutions (Qlik, Tableau) come with in-memory storage capability directly linked to the self-discovery and dashboarding UI. However, they create a heavy query load on source transactional databases to dynamically refresh the dimensional models and that’s why it’s highly recommended to create a CDC copy of the original relational tables and not link directly into the transactional DBs. That’s also advisable from a security perspective based on the decoupling principle.

2.1. Entry Level Conceptual Architecture

Cloud Analytics — Entry Level Solution Conceptual Architecture

If we look at the conceptual architecture above we can note the following core capabilities that we need to deliver:

Stage the data for Analytics (full or partial relational data);
Store transitional data (Consolidation, curation, enrichment);
Process extraction, transformation and loading of data from/to every storage layer;
Self-discovery, Dashboards, Data Wrangling UI;
Efficiently serving the data into the Self-discovery, Dashboards, Data Wrangling UI

I’ve specifically outlined core capabilities above to take them through all solution types and provide analysis of issues based on architectural concerns listed above.

2.2. Technology Architecture

First, let’s fit the selected technology stack into the conceptual model to get a better feeling of the solution. For this scenario, my source systems are SAP ERP and I’m using AWS as a cloud provider, Tableau as a BI tool of choice.

Technology Stack: SAP ERP, AWD DMS, AWS RDS, Tableau

Assuming SAP ERP is using Oracle DB for its on-prem transactional storage I’m selecting AWS DMS to CDC into the AWS RDS Oracle (as a read-only copy). AWS RDS is used to serve Tableau as a staging area capability. All the other capabilities (ETL, Consolidation, Serving) will be used as part of the Tableau solution. That means we are deploying Tableau and then use the UI to hook it to AWS RDS Oracle and build transformations within Tableau.

Please note that although this design seems to be straightforward for cases with > 100 mil. records you might need to create additional indexes and/or summarised data structures to increase the query speed on the Tableau side.

Cloud Analytics — Entry Level Technology Architecture

Let’s evaluate this approach based on the selected set of criteria.

2.3. Evaluation of the solution

Concern 1: Availability: AWS RDS supports HA through Multi-AZ deployments, AWS DMS needs to be configured HA at the virtualization level, Tableau HA and load balancing configuration on AWS is well documented, see [1] and [2]. HA of the on-prem to AWS channel usually provided by Direct Connect provider or done by setting a separate backup VPN channel through the Internet.

Concern 2: Performance: Performance would depend on selected compute instances. RDS was not designed to scale up or down automatically so it needs assessment based on expected workload. Tableau installation supports automated scaling up (Load Balancing) so it is not a concern.

Concern 3: Reliability: Reliability of AWS infrastructure provided by the Multi-AZ design. On-prem nodes need to be considered separately depending on the virtualization technology in use.

Concern 4: Recoverability: All components are automatically recoverable by design. Configuration needs to be backed up separately.

Concern 5: Cost: AWS DMS is free to use when replicating to/from RDS. RDS itself is priced based on the node type and the storage. AWS budgets could be utilized to limit the costs for auto-scaling functions. Personnel to support the DMS and AWS RDS is essentially the same who supports the existing Oracle DBs. So the only missing skills are AWS IaaS and Tableau.

Concern 6: Scalability: Easy to scale up RDS with minimum downtime (<1h). Other components to be scaled up automatically constrained only by budgets.

Concern 7: Manageability: Easy. Use AWS Organizations to set company-wide policies, use AWD AD Connector to enforce authentication rules and SSO for Tableau. Use MS AD to centrally define authorization rules. CDC ensures almost instant CRUD follow-up on the RDS instance.

In general, this would be the quickest to deploy and least costly solution from infrastructure and people.

3. Enterprise Level Non-Real Time Solution

For non-real-time use cases (distributed data sets with up to 100 Bn. records and moderate growth rate) I would use a different approach replacing RDS with cheaper intermediary storage on S3 and involving proper data integration toolset. Let’s look at how conceptual architecture has to change in response to increased data volumes.

3.1. Conceptual Architecture

With larger data sets we have to introduce two new technical capabilities into our design. First is the analytic database (e.g. Redshift, Snowflake) that specifically designed to serve dimensional models and enable data wrangling and self-discovery over large data sets. Data in dimensional models are usually denormalized and not compliant with 3NF or 5NF compared to transactional DBs. That denormalization helps with fast data retrieval reducing table joins.

Dimensional models are usually built on aggregate data rather than contain every detailed transaction from the source. Think of Sales analysis over the following hierarchies: Products, Sales channels, Customer Types, Locations, Time — These are Dimensions and sales transactions are Facts. In Dimensional model you analyze Facts over the combination of Dimensions. It is usual to have facts as daily (or hourly) summaries over lowest dimensional hierarchy levels. Dimensional concepts have been explained in great detail in a well-known book by Bill Inmon — the father of data warehousing [3].

The second capability is a proper data integration toolset that orchestrates all the data transformations. Centralizing ETL jobs in a specialized tool has many advantages over spreading across different systems in the long run. Technically you can build transformation even using shell scripts running Python, however, if you look at architectural concerns each of them will be negatively impacted so it’s not the way and I’m ruling that option out.

Cloud Analytics — Enterprise Level Non-Real Time Solution Conceptual Architecture

3.2. Technology Architecture

Attunity (Qlik) have developed a very niche product that gained popularity over recent years. It’s a set of tools to essentially move (Attunity Replicate) [4] and transform (Attunity Compose) the data [5]. These tools are standing out of others by intuitive interface, simplicity and wide range of ready connectors.

Technology Stack: SAP ERP, Qlik Attunity Replicate and Compose, AWS S3, AWS Redshift (or Snowflake), Tableau

For an analytic database, I’ve picked native AWS Redshift. There are no major blockers to pick Snowflake that has growing popularity. So far I haven’t seen proper performance and cost comparison across a range of use-cases so will keep it as an option here.

Cloud Analytics — Enterprise Level Non-Real Time Solution Technology Architecture

Replicate needs to be located close to data sources. It collects database changelogs, packs them into small compressed files and sends to S3 over encrypted TLS channel. Alternatively, it can publish to Kafka topic but that would be if we need to introduce real-time analysis capabilities.

Compose is essentially a visual pipeline designer. It can integrate with a range of source and target database systems. For our specific use-case, I’m using EMR cluster for Compose to get data from S3.

Main tasks for Compose are to:

transform the data from relational into a dimensional model for fast analytic queries (e.g. compare 5-year monthly sales data across regions and customer types — should be done in seconds not minutes),
take only what’s needed and leave unnecessary details behind,
consolidate and enrich data sets
create specialized data presentation (materialized views) for different roles of users (basic design principle of “need to know”)

How it works: Data Designer using Compose interface defines transformations, then Compose creates target tables in Redshift and starts loading the data. Once data has landed to Redshift you can start enjoying Tableau.

Antipattern here would be taking source data as-is and put them into Redshift for further processing using Tableau. That is frequent mistake when building the data lakes. That leads to substantially increased Redshift costs and reduce response time in Tableau. Other than very small use-cases I would strongly recommend preparing your data before feeding into Redshift (or any other analytical database).

Tableau setup is straightforward and similar to the previous one.

3.3. Evaluation of the solution

Concern 1: Availability: AWS S3 is highly redundant, designed to provide 99.999999999% durability and 99.99% availability of objects. EMR is HA by design. Attunity Replicate needs to be configured HA at the virtualization level on the source system side. Attunity Compose supports HA configuration with primary and secondary node installations [4]. Tableau and other components are similar to the Entry-Level use case.

Concern 2: Performance: When we are talking about the performance of the analytics system it depends on two key factors: 1) right data structure and 2) right infrastructure. We’ve talked about what’s the right data structure above. Coming to infrastructure — AWS Redshift is designed as a columnar database with elastic resize [7] feature so that’s the solution that you should be looking for enterprise-level solutions.

Tableau performance notes covered above — no change here.

Concern 3: Reliability: All AWS services used here are managed services so their reliability provided out of the box by the Multi-AZ design. On-prem nodes need to be considered separately depending on the virtualization technology in use.

Concern 4: Recoverability: Convenient recoverability is achieved by 1) building the AWS environment by code (CloudFormation) and keeping it in the backup region; 2) enabling S3 cross-region replication to overcome outage of the whole region; 3) regularly taking images of Replicate and Compose hosts and keeping them in backup region.

Concern 5: Cost: With unmatched performance for analytics queries Redshift (as well as Snowflake or any other dedicated analytics database) is more expensive than simple RDS. I would expect Redshift to cost less you can commit to a certain level of usage. Usually with enterprise-level solutions that could be easily done. Qlik doesn’t publish prices but you can get the feeling of the price from AWS Marketplace [8]. Typical Replicate price is $1.866/hr.

Concern 6: Scalability: We need to consider user NFRs for Tableau and Redshift sizing. Data velocity and volume will be the input for EMR and Compose node sizing and also connections to on-prem infrastructure. S3 doesn’t need any capacity configuration.

Concern 7: Manageability: As we have centralized all our transformations in Compose it now became very simple to manage them. New data structures in Redshift can be made ready in hours.

Enterprise non-real-time pattern should be considered for a wide range of solutions. Provided we have the agility to scale up or down quickly in the cloud, I would use this even for medium-sized analytics. Using dedicated tools for data integration pays back quickly when you start operationalising the solution and look to enable CI/CD for analytics.

4. Enterprise Level Real-Time Solution

This option should be considered when either you have a large number of source systems or building real-time analytics. This pattern is very much in the centre for every real-time analytics repeating what I’ve described in one of my previous posts — Real-Time Security Data Lake [9].

Use cases could be log analytics for security/website/mobile app data, IoT event data, machinery logs and many others.

4.1. Conceptual Architecture

A conceptual architecture for downstream remains mainly the same, however upstream has been changed to add the streaming capability for real-time data pipelines.

Real-Time Cloud Analytics — Conceptual Architecture

4.2. Technology Architecture

Real-time data pipeline introduces stringent requirements for response time at any scale. Even if vendors are referring to near real-time capability (e.g. in Compose) I would be hesitant to put it into this architecture. Kinesis Family (Firehose and Analytics), as well as Kafka, has been designed ground up for streaming use-cases. When we use the right tools in relevant scenarios then we can suddenly realise more hidden capabilities that we can start using with little effort.

Technology Stack: AWS Kinesis Data Firehose, AWS Kinesis Data Analytics (Managed Apache Flink installation), AWS Redshift, Tableau, AWS Athena and Amazon S3 for ad-hock queries.

Real-Time Cloud Analytics — Technology Architecture

Streaming data pipelines require a dedicated tool and Apache Flink is the one with growing popularity. AWS Kinesis Data Analytics is a managed Flink installation with tight integration into other AWS services. That’d be my preference for the streaming DI tool. Data gets curated, transformed and aggregated as needed in Kinesis Data Analytics then pushed into Redshift as part of streaming data ingestion.

Kinesis Data Firehose could be further integrated into AWS CloudFront CDN network to provide closer data collection end-points and protect from a range of attacks on the internet.

I’ve borrowed the image below from AWS blog [10] to illustrate how accelerated and more secure front-door could be configured with AWS CloudFront CDN.

Global Data Ingestion with Amazon CloudFront and Lambda@Edge

4.3. Evaluation of the solution

Concern 1: Availability: AWS Kinesis and CloudFront are HA by design.

Concern 2: Performance: All components are easy to scale up or down to any size with little or no reconfiguration.

Concern 3: Reliability: By design

Concern 4: Recoverability: Similar to non-real-time.

Concern 5: Cost: Operational costs would be slightly higher than the non-real-time version as maintaining or adding new streaming pipelines with Flink requires higher skills. Flink pipelines are coded in Java. Infrastructure costs would be scalable based on budget constraints.

Concern 6: Scalability: This is a massively scalable solution. The largest known use of Flink based streaming pipeline is at Keystone (Netflix Real-time Stream Processing Platform) for 3 trillion events per day [11]

Concern 7: Manageability: All streaming components are native to AWS so I would expect small operational overhead to keep this running. New pipeline development would require Java skills with no GUI. Debugging of streaming applications are more difficult than static.

Streaming use-cases are growing fast with the growing number of connected devices. The approach described in this section enables both real-time analytics as well as the ability to implement automated event-based response system which reacts to an event as it happens.

5. Summary

Entry Level solution is easy to start with but limited in scalability. There will be a linear increase in infrastructure and operational costs with growing data.

Enterprise Non-Real Time approach could be used for a wide range of cases provided we have the agility to scale up or down quickly in the cloud. Using dedicated tools for data integration pays back quickly when you start operationalising the solution and look to enable CI/CD for analytics.

Real-Time Analytics is costly to maintain as it requires skilled Java developers to build and support data pipelines however this approach enables a new generation of event-based intelligent response systems.

If you have any questions or comments feel free to reach out via https://www.linkedin.com/in/fgurbanov/