SERVERLESS DATA LAKE

Serverless Data Lake: Storing and Analysing Streaming Data using AWS

Shafiqa Iqbal
The Startup
8 min readApr 11, 2020

--

Making an Amazon S3 Data Lake on Streaming Data using Kinesis, S3, Lambda, Glue, Athena and Quicksight

This article will cover the following:

  • Write a python producer which will send records to Kinesis Data Stream using KPL aggregator
  • Preprocess records using Kinesis Data Analytics Preprocessor Lambda
  • Run-time aggregation on streaming data using Kinesis Data Analytics
  • Store data in S3 and create catalog in Glue
  • Run queries in Athena and create Views
  • Import datasets in Quicksight and build charts

Note: In this article, I will go through step by step on how I built this pipeline so that anyone interested in replicating a similar workflow can use this as a resource.

In the next part of this series, a step-by-step guide on ETL Data Processing, Querying and Visualization in a Serverless Data Lake, check my article here.

High-Level Solution Overview

In this post, we will use Chicago Crimes Dataset, which contains 2GB data, ranging from 2001 to 2017. The dataset has 6M records, thus, perfect for our scenario. We will use Python Kinesis Aggregation Module for efficient transmission of records on Kinesis Data Stream. We will use preprocessing lambda to transform the records…

--

--