Firsthand: Serverless Data Pipeline

Ewere Diagboya
MyCloudSeries
Published in
3 min readAug 31, 2018

--

It has been a great experience sifting through gigabytes of data and I have decided to share my first experience in data pipelines. I am used to the normal application delivery pipeline which involves CI (Continuous Integration) and then CD (Continuous Delivery/Deployment) which is one of the pillars of Devops. This time the process is ETL (extraction, transformation and loading). In this scenario the process is still a continuous process of Continuous extraction, Continuous transformation and Continuous loading of data.

An application deployment pipeline typically takes seconds to accomplish or minutes as the case may be (depending on scale, deployment might take longer though). But data pipelines can actually days. Also the tools used are totally different from those use in the application deployment facet.

This article will explain how I setup a fully-serverless data pipeline end-to-end, the tools used, and the configuration used too to make it happen.

Toolset

The tools used to setup the pipeline are AWS services. These services are:

  1. AWS Kinesis (agent, service)
  2. AWS Lambda
  3. AWS S3

AWS Kinesis

Kinesis is a scale-able and highly fault tolerant data streaming service. That is made up of Kinesis Stream, Kinesis Firehose, Kinesis Analytics and Kinesis Video Stream. Our focus here is on Firehose. Firehose is a data streaming service that takes data input, from an EC2 Instance using the Kinesis Agent. The Kinesis agent can be installed in any EC2 Instance with the given the appropriate role. In the situation you using a VM from other providers or an OnPrem VM, configure the AWS CLI to allow access to Kinesis Firehose.

Whe the kinesis agent is install, the next step is to configure the agent. The agent takes 4 basic parameters. For our case the four basic parameters we are using here are:

  1. Firehose Endpoint
  2. Input File pattern
  3. deliveryStream (Name of the Firehose Stream)
  4. initialPosition (this determines where the agent starts reading the file from, by default it does END_OF_FILE, so it is good to specify so it start from beginning of the file)

A sample configuration is below:

Kinesis Agent Config

Also note that the firehose.endpoint is dependent on your region. In our example our region is eu-west-1-Ireland.

This is the extraction process, next is transformation.

AWS Lambda
This is the most advanced form of compute provided by AWS. For this scenario we are using it to transform our data. Transformation process involves changing the content of the data. It could either mean normalizing the value of a particular row, or encrypting a particular row, or standardizing the date/time, adding or removing a column and more. Since this is a serverless pipeline, Lambda comes in handy for this job. The data coming in from Kinesis will be fed into the Lambda function that we create and then profile in the Kinesis configuration. Below is a sample code of the kinesis transformation:

Sample code for Lambda

This code is assuming the keys in the JSON file are date, traxid, discount, amount, currency. It automatically rounds up all the currency values

This handles the transformation which is the second stage, the final is Loading the transformed data

S3

Simple Scalable Storage that is what S3 stands for. We will be using it as our data lake where we shall dump/load all our transformed data from the Lambda function. S3 does not require any major configuration. Just ensure the bucket is created and the S3 Best Practices are followed. This bucket is also profiled in Kinesis.

Serverless Datapipeline Architecture

Addons

Our diagram adds Glue and Athena, which are also ETL and Analytics tools. Glue helps with further crawling of the data and loads it into Athena to allow for querying. Yes, Athena queries JSON, CSV, TSV data in S3. Athena uses the normal SQL query we all know to query data in S3 as configured and profiled by Glue. Since we are talking about the ETL pipeline here, we wont be going much into analytics.

Why Serverless Pipeline — BSOT

The basic essence of the whole serverless idea are:

1. reduce OpEX (Operational Expense).
2. reduce cost,
3. improve performance efficiency.

This is exactly what this serverless data pipeline brings to the table. It is very cost effective and scale-able. Kinesis and Lambda practically scales automatically when the flow of data increases to meet the demand.

Thanks

--

--