Data Analytics Lifecycle using AWS

Published in

Analytics Vidhya

12 min readSep 23, 2020

What comprises of Data Analytics Pipeline ? Confused with n-number of data channels ? Don’t worry !! This blog will try explain it with much ease and efficacy.

Data Ammunition :

“ As Nearly all mechanical weapons require some form of ammunition to operate, likewise, organisations in today’s era need Data to Operate and forecast. “ — By Praveen Kasana

Data is a strategic asset of every organisation. As data continues to grow, databases are becoming increasingly pivotal to understanding data and converting it to valuable insights. IT leaders and entrepreneurs need to look at different flavours of data and based on it look for ways to get more value from their data. With the rapid growth of data — Not just in volume and velocity but also in flavours, complexity and interconnectedness — the needs of data analytics and its corresponding databases have changed.

Before we begin discussing the Data Analytics pipeline, it’s imperative to understand the common data categories and use cases.

Data Analytics Pipeline :

Before data can be analysed, it needs to be generated, collected, stored and processed.You can think of this as an analytics pipeline that extracts data from source systems, processes the data, and then loads it into data stores where it can be analysed. Analytics pipelines are designed to handle large volumes of incoming data from heterogeneous sources such as databases, applications, and devices.

In general, Data Analytics pipeline consists of six sections.We will understand each of these sections in details and evaluate accordingly.

1. Generate :

Data is being continuously generated by several sources such as IoT devices, Web logs, Social Media feeds, Transaction and ERP systems.

IoT system : Devices and sensors around the world send messages continuously.Organisations see a growing need today to capture this data and derive intelligence from it. Just like with a server, these devices that make up and generate logs. From a hardware perspective, the states of the onboard memory, the microcontroller, and any sensors are all described by logs. That data can tell you anything from whether the system is functioning as expected.
Web Log : Logs generated from Web Servers. IIS, Apache, Tomcat,Web Sphere, NGINX, and every other web engine can generate useful logs.
Social Media : Social media logs.
Transactional Data : Data such as e-commerce purchase transactions and financial transactions, is typically stored in RDMS (Relational database management system). An RDBMS solution is suitable for recording transactions and when transactions may need to update multiple table rows.
NoSQL Data : A NoSQL database is suitable when the data is not well structured to fit into a defined schema or when the schema changes very often.
ERP : Perceptible logs are important for any software, an ERP is no exception.

2. Collect :

After data is generated, it needs to be collected somewhere. Web applications, mobile devices, and many software applications and services can generate staggering amounts of streaming data—sometimes terabytes per hour—that need to be collected.

Data Collection -Using Amazon DMS, S3, DataSync & Snowball : Image — AWS

Data Collection — Using Polling Application, Amazon Kinesis and Kafka Stream. Image — AWS

Data is collected using :

Amazon DMS (Database Migration Services) : You can use AWS Database Migration Service to consolidate multiple source databases into a single target database. This can be done for homogeneous and heterogeneous migrations, and you can use this feature with all supported database engines. The source databases can be located in your own premises outside of AWS, running on an Amazon EC2 instance, or it can be an Amazon RDS database. For more info on AWS Migration, please visit my blog on Database Migration using AWS.
Amazon S3 (Simple Storage Service ) : One Amazon S3 Source can collect data from a single S3 bucket. However, you can configure multiple S3 Sources to collect from one S3 bucket. For example, you could use one S3 Source to collect one particular data type, and then configure another S3 Source to collect another data type.
Amazon DataSync : AWS DataSync makes it simple and fast to move large amounts of data online between on-premises storage and Amazon S3, Amazon Elastic File System (Amazon EFS), or Amazon FSx for Windows File Server.
Amazon SnowCone : You can use SnowCone to collect, process, and move data to AWS, either offline by shipping the device, or online with AWS DataSync (See point above). AWS SnowCone is the smallest member of the AWS Snow Family of edge computing and data transfer devices. SnowCone is portable, rugged, and secure.
Amazon Snowmobile : Snowball Edge Storage Optimised devices provide both block storage and Amazon S3-compatible object storage, and 40 vCPUs. They are well suited for local storage and large scale-data transfer
Amazon Kinesis : Amazon Kineses is used to collect data / Streaming data. With Amazon Kinesis, you can ingest real-time data such as video, audio, application logs, website clickstreams, and IoT telemetry data for machine learning, analytics, and other applications.
Amazon Managed Streaming of Kafka : With Amazon MSK, you can use native Apache Kafka APIs to populate data lakes, stream changes to and from databases, and power machine learning and analytics applications.

3. Store :

After collection process, now it’s time to store the Data. You can store your data in either a data lake or an analytical tools like a data warehouse. AWS provides several services to store your data. But before going into each of these services let’s understand the concepts of Data Lake, Date Warehouse and Data Mart concepts.

Data Lake : A Data Lake is a centralised repository for all data, including structured and unstructured. In a Data lake, the schema is not defined, enabling additional types of analytics like big data analytics, realtime analytics and Machine Learning. Data lakes can handle the scale, agility, and flexibility required to combine different types of data and analytics approaches to gain deeper insights..They give organisations the flexibility to use the widest array of analytics and machine learning services, with easy access to all relevant data, without compromising on security or governance.
Data Warehouse : A Data warehouse is a centralised repository of information coming from one or more data sources — or your Data lake, where data is transformed, cleansed and deduplicated to fit into a predefined data model. It is primarily designed for data Analytics, which involves reading large amounts of data to understand their relationships and help in finding trends in data.
Data Mart : A data mart is a simple form of a data warehouse focused on a specific functional area or subject matter and contains copies of a subset of data in the data warehouse. For example, you can have specific data marts for each division in your organisation or segment data marts based on regions. You can build data marts from a large data warehouse, operational stores, or a hybrid of the two. Data marts are simple to design, build, and administer.

Data Store — using Amazon S3, RDS and Database on Amazon EC2: Image — AWS

Amazon S3 : Amazon Simple Storage Service (S3) is the largest and most performant object storage service for structured and unstructured data and the storage service of choice to build a data lake. With Amazon S3, you can cost-effectively build and scale a data lake of any size in a secure environment.With a data lake built on Amazon S3, you can use native AWS services to run big data analytics, artificial intelligence (AI), machine learning (ML) and media data processing applications to gain insights from your unstructured data sets.
Amazon RDS : Amazon RDS is available on variety of Database engines. Data stored is highly secured, highly available and compatible.

4. ETL (Extract Transform Load) Or Process Data :

This process gathers or extracts data from data sources, transform the data, and stores the data in a separate destinations such as another database, a Data lake, or an analytics service like data warehouse (Amazon Redshift) where the data can be processed and Analysed.

ETL is the process of pulling or extracting data from multiple sources, transforming the data to fit a defined target schema (Schema-on-write), and loading the data into a destination data store. ETL is normally a continuous, on going process with the well-defined workflow that occurs at specific times, such as nightly. Setting up and running ETL jobs can be tedious task, and some ETL jobs may take hours to complete.

Similar to ETL, it is also important to understand ELT (Extract Load Transform). ELT is a variant of ETL where the extracted data is loaded into the targeted system before any transformations are made. The Schema is defined when the data is read or used (Schema-on-read). ELT typically works well when your target system is powerful enough to handle the transformations and when you want to explore the data in ways not consistent with the predefined format.

ETL — Using EMR, Lambda and KCL: Image — AWS

Here is the list of some of the Amazon Web services which can be used as an ETL.

Amazon EMR (Elastic Map Reduce) : EMR is an AWS tool for Big Data and Analysis. It uses big data frameworks like Apache Hadoop and Apache Spark. Amazon EMR can be used to quickly and cost-effectively perform the data transformation workloads (ETL) such as sort aggregate and join on large datasets.We can build an ETL workflow that uses AWS Data Pipeline to schedule an Amazon Elastic MapReduce (Amazon EMR) cluster to clean and process web server logs stored in an Amazon Simple Storage Service (Amazon S3) bucket.
Amazon Lambda : Lambda lets you run your data pipeline in server less mode. Serverless ETL is becoming the future for those who wants cost effective and at the same time wants to focus on the crux of the application without having to worry about the large infrastructure to power the data pipelines.
Amazon Kinesis Client Library : This is one of the methods of developing customer applications that can process the data from Kinesis Data Stream. Kinesis Client libraries (KCL) library is available in multiple languages such as JAVA, .NET, Python.
Amazon Glue : Amazon Glue is a server less, fully managed and cloud-optimised ETL service. You just need to point your data stored in AWS to the AWS Glue and it discovers your data and stores the associated metadata (Table schema and definition) on the AWS Glue data catalog. Once cataloged, your data is searchable, queryable and ready for ETL processing.

Amazon Glue — Steps for ETL : Image — AWS

5. Analyse Data :

Now, we have reached to a stage where we are ready to unveil the real value of data. Let’s unlock what is hiding behind your data. A modern analytics pipeline can utilise a variety of tools to unlock value hidden in the data. We know that One size does not fit all. Any analytics tool should be able to access and process any data from the same source — your data lake.

Data can be copied from your data lake into your data warehouse to
fit a structured and normalised data model that takes advantage of a high-performance query engine. At the same time, some use cases require analysis of unstructured data in context with the normalised data in the data warehouse. Here, extending data warehouse queries to include data residing in both the data warehouse and the data lake, without the delay of data transformation and movement, is essential to timely insights.

Other big data analytics tools should be able to access the same data in the data lake. So I am listing types of Analysis that is required by Data Scientists or Business users as per their use cases :

Interactive Analysis : Interactive analysis typically uses standard SQL query tools to access and analyse data. End users want fast results and the ability to modify queries quickly and rerun them.
Data Warehousing Analytics : Data Warehousing provides the ability to run the complex Analytics queries against volumes of data — Petabytes- using high performance, optimised scalable query engine.
Data Lake Analytics : A new breed of data warehouse is emerging that extends data warehouse queries to a data lake to process structured or unstructured data in the data warehouse and data lake and scale up to exabytes without moving data.
Big Data Analytics : Big data processing uses the Hadoop and Spark Frameworks to process vast amounts of data.
Operational Analytics: Operational analytics focuses on improving existing operations and uses data such as application monitoring, logs, and clickstream data.
Business intelligence (BI): BI software is an easy-to-use application
that retrieves, analyses, transforms, and reports data for business decision-making. BI tools generally read data that is stored in an analytics service like a data warehouse or big data analytics system. BI tools create reports, dashboards, and visualisations and enable users to dive deeper into specific data on an ad-hoc basis.

Based on above, Organisations are applying Machine learning processes to automates tasks, provide customised services to the end users and increase the efficiency of operations by Analysing their data. Generally at first, you need to collect and prepare your training data to discover which elements of your data set are important. Anyway this is a vast topic. My intention here is just to apprise you about the Machine learning processing related to Analysis of the data.

Data Analysis using Amazon Services. :Image — AWS

There are few Amazon Services which are used to Analyse the Data. I will go through each of the given services at a high level just to give you an overview :

Amazon Athena : Amazon Athena is an interactive query service that makes it easy to analyse data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. Because Athena is a server-less query service, an analyst doesn’t need to manage any underlying compute infrastructure to use it. A data analyst accesses Athena through either the AWS Management Console, an application programming interface (API). Here are some features of Amazon Athena :

1. It unites Batch and Streaming Data;
2. Query Data from Amazon S3 directly with ANSI SQL;
3. Use CREATE TABLE AS SELECT (CTAS) to create new tables using a result of SELECT query.
4. Serverless, no infrastructure to manage
5. Pay $5/TB scanned by your query

Amazon EMR : EMR can be used to quickly and cost-effectively perform the data transformation workloads (ETL) such as sort aggregate and join on large datasets.We can build an ETL workflow that uses AWS Data Pipeline to schedule an Amazon Elastic MapReduce (Amazon EMR) cluster to clean and process web server logs stored in an Amazon Simple Storage Service (Amazon S3) bucket.
Amazon Redshift : It is a relational, OLAP-style database. It’s a data warehouse built for the cloud, to run the most complex analytical workloads in standard SQL.
Amazon Redshift Spectrum : Amazon Redshift Spectrum is a feature of Amazon Redshift. Spectrum is a server-less query processing engine that allows to join data that sits in Amazon S3 with data in Amazon Redshift. Athena Athena follows the same logic as Spectrum, except that you’re going full-in on server-less and skip the warehouse.
Amazon Kinesis Analytics : Amazon Kinesis Data Analytics enables you to easily and quickly build queries and sophisticated streaming applications in three simple steps: setup your streaming data sources, write your queries or streaming applications, and setup your destination for processed data.

Amazon Kinesis Data Analytics — Image from AWS

6. Visualisation and Reporting :

Business Analytics & Data Visualisation are two faces of the same coin. You need the ability to chart, graph, and plot your data. Now, we have reached the end of the Data Analytics pipeline process which is to Create Visualisations, Dashboards and Insightful reports to be used by Data scientists, Business users and other engagement platforms.

A key aspect of our ability to understand what’s going on is to look for patterns.These patterns or insights are not delivered by just viewing data from the Data tables or logs. These patterns are made visible when we apply right tools and techniques to the data to make it more presentable to the end users.

Visualisation & Reporting Process : Image — AWS

One of the tools for Visualisation is Amazon QuickSight :

Amazon QuickSight : Amazon QuickSight lets you create interactive Dashboards, charts, and ML insights. These can be fit in your application or websites. It can be easily integrated with your cloud or on-premises setups.

ELK : You can also use Open Source tool ELK. ELK, which stands for (E)lasticsearch, (L)ogstash, (K)ibana. Kibana is an open source data visualisation plugin for ElasticSearch. It provides visualisation capabilities and open user interface that lets you visualise your ElasticSearch data and navigate the Elastic Stack.

Kibana used for Visualisation : Image Kibana from site

Please read and let me know if you have any questions surrounding this blog.

Enjoy Reading !!