Spark Streaming & Real Time Analytics on AWS

Sivaprasad Mandapati
Feb 26 · 4 min read

This tutorial describes a real time analytics frame work using spark streaming and window functions on AWS real time streaming application Kinesis.

Amazon Kinesis Data Streams (KDS) is a massively scalable and durable real-time data streaming service. KDS can continuously capture gigabytes of data per second from hundreds of thousands of sources such as website clickstreams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events. The data collected is available in milliseconds to enable real-time analytics use cases such as real-time dashboards, real-time anomaly detection, dynamic pricing, and more

High Level Design :

Continues events are being sent to Kinesis , Spark Subscribes to Kinesis and reads events in real time performs aggregation on the fly and saves the results . This is quite a common scenario now a days for a data engineer

Steps

  1. Define Kinesis Data Stream
  2. Write Streaming data to Kinesis
  3. Create Spark Streaming Cluster
  4. Apply analytics on Spark streaming data using window functions

Define Kinesis Data Stream

Just with few clicks we can easily create data stream in AWS. Configured the stream with shard count 1.

Write Streaming Data into Kinesis

A simple python code snippet can act as Kinesis client application to write continues events to data stream . The below python code snippet publish messages to Kinesis for every 3 seconds .

Python has got excellent libraries to create test data and automate testing. Faker and Random are my preferred choice.

Python Client

Code Snippet

import os
import boto3
from faker import Faker
import json
import random
import datetime
import time
# Get AWS Access Key details
awsaccesskey = os.getenv('AWS_ACCESS_KEY_ID')
awssecretaccesskey = os.getenv('AWS_SECRET_ACCESS_KEY')
#Create Kinesis Client
KinesisClinet = boto3.client('kinesis',aws_access_key_id=awsaccesskey,aws_secret_access_key=awssecretaccesskey,region_name='us-east-1')
#Generate record with Faker and Random libraries
record = {}
faker = Faker()
while True:
record['first_name'] = faker.first_name()
record['last_name'] = faker.last_name()
record['personal_email'] = faker.email()
record['ssn'] = faker.ssn()
record['office'] = faker.city()
record['title'] = faker.job()
#columns that do not use faker
record['gender'] = 'M' if random.randint(0,1) == 0 else 'F'
record['org'] = random.choice(['Engineer','Sales','Associate','Manager','VP'])
record['accrued_holidays'] = random.randint(0,20)
record['salary'] = round(random.randint(50000,100000)/1000)*1000
record['bonus'] = round(random.randint(0,5000)/500)*500
record['event_time'] = datetime.datetime.now().isoformat()
#Write Data to Kinesis Stream
KinesisClinet.put_record(StreamName='Kinesis-Spark-Streaming-Analytics',Data=json.dumps(record),PartitionKey='part_key')
time.sleep(3)

Create Spark Streaming Application

So far we are done with Creating Stream and writing data to Stream. Now it’s Spark’s show.

We can launch either AWS EMR spark cluster or local spark cluster or any third party offerings like Databricks or Cloudera . I will go with Databricks Community edition.

Kinesis delivers event data in json format. So we have to infer the schema of event data in Spark. The below does an explicit schema definition for employee record

Schema definition

Define Streaming Record Schema

Spark Read Kinesis Stream

Read Kinesis Data Stream

Extract Data and apply Schema

Extract Record from data and Apply infer schema

Define Window Function

Our aim is to calculate moving counts of employees by organization per every minute. So grouping is done on org and window by 1 minute on event time.

Moving Counts using Window functions

Start Spark Stream

Trigger Spark stream by WriteStream. As you can see the results are saved in memory as table with name ‘Counts’. We can query counts just like a table.

Write Stream with Results

Query Results

Results

Hope you enjoy !!

Please let me know your feedback .

Thanks

Reference :

Structured Streaming — Databricks Documentation

Amazon Kinesis Data Streams — Data Streaming Service — Amazon Web Services

The Startup

Get smarter at building your thing. Join The Startup’s +786K followers.

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +786K followers.

Sivaprasad Mandapati

Written by

Data Engineer, Big Data and Machine Learning Developer

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +786K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store