What questions to ask when designing a Big Data Solution on AWS?
Preface
I tried to include more charts and visualizations because they are concise and effective in communicating big concepts.
Essential concepts are formatted in question and chart/explanation style to keep things tidy.
**Note: all materials are from the AWS Data Analytics Fundamentals Training, and so do all credits go to.
How to transform raw data into decisions?
What challenges do we face when designing a data analysis solution?
What type of data is out there?
What type of data to collect?
How do we store data?
what advantages do S3 data storage bring?
Note: IAM makes S3’s centralized architecture possible.
What is a Data Lake?
- Data Center (single source of truth) for any kinds at any scale from various sources (streams or bulk uploads).
- Convenient for quick analysis, data processing and AI modeling/predictions.
How to build a data lake?
Can analysts run analysis without moving the data outside of S3?
How is Data Warehouse different from Data Lake?
How to process or analyze large amount of data if we use a Data Lake solution?
One method is to use S3 distributed copy to move data from S3 to EMR’s HDFS. From there, use Hive/Spark for analysis or data processing then send the analyzed data back to S3 for persistence. Another method is to use EMRFS that stores persistent data in S3 with consistent view and data encryption.
What are two main types of data processing jobs?
Note: Airflow e.g. might be able to handle most Batch processing, but it is difficult for Airflow to handle stream processing or time sensitive data.
How is batch data processing different from stream data processing?
How to design batch data solution using EMR and Hadoop?
- Use EMR notebook (Jupyter Hub) for serverless development (analytical queries and visualizations).
- Ganglia for monitoring.
- Run processing steps using Spark, Hive, HBase, Presto or Flink.
- Connect to storages like (S3 or DynamoDB). Or even better, use EMRFS for persistence (when running distributed jobs).
What could be the components when designing a basic batch analytics system?
There could be many variations, but here are two classic ones.
What are some common use cases for batch processing?
- log analytics from web/mobile (EMR)
- search and query across data stores (Glue Data Catalog + Athena/EMR/Redshift Spectrum)
- predictive analytics (Spark batch on EMR)
- queries against a Data Lake (s3 + Glue Data Catalog)
How to handle data in high velocity (higher velocity than batch job)?
What are benefits of stream processing?
What are four major Kinesis services and what each is used for?
- Kinesis Firehose (data ingestion, easiest to capture, transform and load to data stores)
- Kinesis Data Stream (data processing then ingestion)
- Kinesis Video Stream (Ingestion, Processing, Analysis or Image Recognition/other ML analysis)
- Kinesis Data Analytics (use SQL or Java) to process data (think it as a data processing service for aggregation, filtering and processing).
How does a streaming job architecture look like?
How about a batch job?
Combined, we have:
Where to store data for each variety?
What are the issues with Flat-file data?
- can’t handle duplication
- certain field can lead to ambiguity
- difficult to represent missing data
- can’t communicate relationships well
What is the difference between OLTP and OLAP?
How to choose different AWS storage option for each variety of data?
- OLTP: RDS
- OLSP: Redshift
- Key-value/document: DynamoDB (20million/second at peak support) e.g. gaming application for key-value, log files for document database.
** DynamoDB doesn’t support ACID compliance
- Graphs: Neptune e.g. social subscription service
What are pros and cons for relational and non-relational database?
What does data life cycle look like? How does data integrity play a role in each?
What’s the difference between ACID and BASE?
How does ETL solution look like using AWS services?
- Data Source (s3 for flat file, RDS for transaction data, Redshift for analytical data, DynamoDB for non-relational data)
- Collection (Kinesis for stream or EMR/Glue for Batch). EMR is more suitable for customized job and Glue has faster development cycle and is server-less.
- Data Lake (S3)
- Operational Analytics (ElasticSearch, Kibana)
- Data Warehouse (Redshift)
- Analysis (QuickSight for visualization and Athena for query)
What are different types of analytics?
What AWS service to use for batch and interactive analytics?
How about streaming?
Putting all pieces together
Now, the topic everyone cares about: in order to build a successful big data solution, what jobs or roles are necessary or important for an enterprise?
- Data Platform Engineer (DBA 2.0)
- Data Pipeline Engineer (ETL)
- Data Architect (design architecture, server/network administrator)
- Data Analyst
- Data Scientist (predictive analytics)
Let’s put all AWS services we mentioned into one chart in term of processing speed and data temperature.
**Note: all materials are from the AWS Data Analytics Fundamentals Training, and so do all credits go to.