What questions to ask when designing a Big Data Solution on AWS?

Zihan Guo

Follow

Published in

Data Alchemist

6 min readFeb 17, 2020

--

Preface

I tried to include more charts and visualizations because they are concise and effective in communicating big concepts.

Essential concepts are formatted in question and chart/explanation style to keep things tidy.

**Note: all materials are from the AWS Data Analytics Fundamentals Training, and so do all credits go to.

How to transform raw data into decisions?

AWS training for Data Analytics shows four components of a data analysis solution

What challenges do we face when designing a data analysis solution?

What type of data is out there?

What type of data to collect?

How do we store data?

what advantages do S3 data storage bring?

Note: IAM makes S3’s centralized architecture possible.

What is a Data Lake?

Data Center (single source of truth) for any kinds at any scale from various sources (streams or bulk uploads).
Convenient for quick analysis, data processing and AI modeling/predictions.

How to build a data lake?

Can analysts run analysis without moving the data outside of S3?

How is Data Warehouse different from Data Lake?

How to process or analyze large amount of data if we use a Data Lake solution?

One method is to use S3 distributed copy to move data from S3 to EMR’s HDFS. From there, use Hive/Spark for analysis or data processing then send the analyzed data back to S3 for persistence. Another method is to use EMRFS that stores persistent data in S3 with consistent view and data encryption.

What are two main types of data processing jobs?

Note: Airflow e.g. might be able to handle most Batch processing, but it is difficult for Airflow to handle stream processing or time sensitive data.

How is batch data processing different from stream data processing?

How to design batch data solution using EMR and Hadoop?

Use EMR notebook (Jupyter Hub) for serverless development (analytical queries and visualizations).
Ganglia for monitoring.
Run processing steps using Spark, Hive, HBase, Presto or Flink.
Connect to storages like (S3 or DynamoDB). Or even better, use EMRFS for persistence (when running distributed jobs).

What could be the components when designing a basic batch analytics system?

There could be many variations, but here are two classic ones.

What are some common use cases for batch processing?

log analytics from web/mobile (EMR)
search and query across data stores (Glue Data Catalog + Athena/EMR/Redshift Spectrum)
predictive analytics (Spark batch on EMR)
queries against a Data Lake (s3 + Glue Data Catalog)

How to handle data in high velocity (higher velocity than batch job)?

What are benefits of stream processing?

What are four major Kinesis services and what each is used for?

Kinesis Firehose (data ingestion, easiest to capture, transform and load to data stores)

Firehose is mostly for light-weight ingesting using API or Kinesis Agent. Reference.

Kinesis Data Stream (data processing then ingestion)

Kinesis Data Stream is more generic. It can be used for ingestion too (KPL, KCL, then Kinesis Connector to S3, Redshift, etc). Reference. However, Data Stream is most suitable if processing or analytics is necessary.

Kinesis Video Stream (Ingestion, Processing, Analysis or Image Recognition/other ML analysis)
Kinesis Data Analytics (use SQL or Java) to process data (think it as a data processing service for aggregation, filtering and processing).

How does a streaming job architecture look like?

Notice that we use Firehose twice. One for ingestion to processing layer; another for ingestion to serving layer. Analytics is used for processing logic.

How about a batch job?

Glue is the magic here. It combines the two data sources from two S3 locations. Glue’s other features can be found here.

Combined, we have:

decoupled ingestion and processing, and independent end-end solution.

Where to store data for each variety?

What are the issues with Flat-file data?

can’t handle duplication
certain field can lead to ambiguity
difficult to represent missing data
can’t communicate relationships well

What is the difference between OLTP and OLAP?

How to choose different AWS storage option for each variety of data?

OLTP: RDS
OLSP: Redshift
Key-value/document: DynamoDB (20million/second at peak support) e.g. gaming application for key-value, log files for document database.

** DynamoDB doesn’t support ACID compliance

Graphs: Neptune e.g. social subscription service

What are pros and cons for relational and non-relational database?

What does data life cycle look like? How does data integrity play a role in each?

Data audit is crucial at each stage. In addition, managing user access to data will eventually prevent data damage. Data Audit: what is clean? Is the change good? Does original have value?

What’s the difference between ACID and BASE?

How does ETL solution look like using AWS services?

Data Source (s3 for flat file, RDS for transaction data, Redshift for analytical data, DynamoDB for non-relational data)
Collection (Kinesis for stream or EMR/Glue for Batch). EMR is more suitable for customized job and Glue has faster development cycle and is server-less.
Data Lake (S3)
Operational Analytics (ElasticSearch, Kibana)
Data Warehouse (Redshift)
Analysis (QuickSight for visualization and Athena for query)

What are different types of analytics?

What AWS service to use for batch and interactive analytics?

How about streaming?

Lambda is a great choice for file format conversion, light data transformation

Putting all pieces together

Now, the topic everyone cares about: in order to build a successful big data solution, what jobs or roles are necessary or important for an enterprise?

Data Platform Engineer (DBA 2.0)
Data Pipeline Engineer (ETL)
Data Architect (design architecture, server/network administrator)
Data Analyst
Data Scientist (predictive analytics)

Let’s put all AWS services we mentioned into one chart in term of processing speed and data temperature.

What questions to ask when designing a Big Data Solution on AWS?

Written by Zihan Guo