A Serverless Data Lake : Part1 — Data From Streaming

Published in

BRLink

5 min readMar 22, 2020

This article not massive and complex, but you’ll need to know a few things about some subjects that we will implement here, if you are not familiar, please read about Serverless, Data Lake and Data Streaming. Once you know the basic concepts, let's get our hands dirty 😎

The complete story consists in 3 parts from data ingestion, and the links will be updated according new articles I will create

from Streaming (this one)
from Apis
from Databases

First of all, which services will we use?

S3: AWS Object storage. We will store raw and analytical data;
Kinesis Firehose: Transform to columnar format and load near real-time data streams into S3;
Lambda: Run code without provisioning or managing servers. In this case we will use to produce fake data and test our lake;
Glue: Serverless tool to ETL data. We’ll use Glue Crawler to automatically catalog data and Glue Job (Spark) to do some simple transformations;
Athena: Querying data in S3 and Pay only for scanned data. (Presto, distributed SQL query engine);
Quicksight: AWS BI tool to visualize your data;

How do those services interact with others?

We produce data with Lambda, and put records to Kinesis Firehose, that load to s3 in blocks periodically (from 60s to 900s).
That data is crawled and cataloged using crawlers and we transform this data to analytical ready data using Glue Job and write again to s3 in parquet, that can be discovered and explored using Athena and visualized with Quicksight.

Kinesis Firehose — Ingest Data

Create a firehose with S3 bucket as destiny, in the field “Prefix” add the following code:

raw/year=!{timestamp:YYYY}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/

and on “S3 error prefix” this:

fherroroutputbase/!{firehose:random-string}/!{firehose:error-output-type}/!{timestamp:yyyy/MM/dd}/

with this code, we can easily identify partition our data by year, month, day and hour.

Set buffer size to 128Mb and buffer interval to 900s

The Lambda Code — Generate Sample Data

Here I’ve created a code that generate a random json with a few properties and put records to Firehose.

* In productive environment this will be replaced by real sensors and producers.

You can get my python code here, basically each execution create a 5k records of data

vehicles.py

*Don’t forget:

to create a IAM Role that can access a Firehose and Cloudwatch.
to change “DeliveryStreamName” on pasted code
increase lambda`s timeout

To enable auto-create records and test throughput capacity of firehose, a cloudwatch trigger as bellow:

Once done this step , it’s possible view the raw data on bucket in a few minutes.

Glue Job & Crawler — Catalog and Transform

Now we will create a crawler to catalog our data.

On creating our crawler set the output bucket as the path to be cataloged. After run, we will see under database/tables menu the following data on created table:

vehicles_table from glue database/tables

As we can see, the vehicle_status and gps_info are struct, this happens because out json is nested, let’s relationalize transform our data to easily visualize struct and arrays members.

To this, we’ll create a Glue Job using PySpark to transform our raw data into a flatten , compacted(gzip) and read optimized data (parquet)

Create a job with this source code:

*Don’t forget:

Grantee that your role can read from and write to S3
chance output bucket to your real path

Tip: You can use Jupyter Notebook to test your code, I’ve created a jupyter file to help you here:

vehicles.ipynb

After runs this job, check your s3 bucket, and something like that will appear:

This happens because we repartition by vendor on pyspark code.

Add the new path (stage) to the crawler and run it again, a new table will be available on database “vehicles_stage”.

Note that you can create a schedule or workflow on glue to automate this process and let your data ready to be consumed.

Athena

Athena is a Presto based service to query S3 data.

The data is already cataloged by glue, then we can query easily as below. The tables will be displayed on right side:

Run the following query to test your data:

SELECT * FROM "vehicles"."vehicles_stage" limit 10;

The result will be similar to that:

QuickSight

Now the last part of our story: visualize the data in dashboards. QuickSight has some visuals that can be explored. But first we need to create a datasource with the source from Athena, and this is easy as should be.

First, create your QuickSight account, go to “Manage Data” and “New data set” , select “Athena” and finally select the last created table.

Now we can play with some dashboards and visuals. As sample I’ve created a dashboard using latitude and longitude to point the vehicles on map as bellow, and the the distribution by vendor aside.

How much will it cost?

The cost starts by zero. Yes, as serverless you just what you use. The cost depends the service’s price on specific region.

The price to reproduce this demo was U$ 0.24, but depends how many data you input using the lambda “dummy generator”

I’ve created the resources in us-east-1, then I’ve created a summary with the cost that I checked today:

Kinesis Firehose: 0,029 USD /Gb
S3: 0.023 USD / GB
Glue Job: 0.44 USD / DPU-hour
Glue Crawler: 0.44 USD / DPU-hour
Athena: 5 USD / Tb scanned
QuickSight: Starts from 9 USD/month

Let’s assume that you produce 3Gb/day of raw data ~90Gb/month. You will pay:

Kinesis Firehose: 0,029 USD*3*30 = 2.31 USD
S3: 0.023 USD*3*30 = ~2.07 USD (cumulative)
Glue Job: 0.44 USD / DPU-hour * 10 DPU * 5 hours (10 minutes jobs) = 22 USD
Glue Crawler: 0.44 USD / DPU-hour * 1 DPU * 5 = 2.2 USD
Athena: 5 USD / Tb scanned = 5 USD (if you are a heavy user)
QuickSight: Starts from 9 USD/month

Total: 42.58 USD / month

If you just create a few Gb of data /day, your entire data workload will cost less then a m5.large instance per month.