Building Real-Time Streaming Data Pipelines With AWS Kinesis

Khanjanmarthak
Simform Engineering
10 min readMay 15, 2023

Enable your data to flow like a river with AWS Kinesis.

Imagine if you had a solution to produce and process massive amounts of data in real-time. By massive data, I mean the humongous data generated by millions of devices and applications across the globe. That’s exactly what Amazon Kinesis Services is designed to do — to provide a scalable and reliable platform for ingesting, processing, and analyzing streaming data in real time.

With this scalable and dependable platform, you can tap into the true potential of your data and get actionable insights faster than ever. Whether you’re dealing with clickstreams, sensor data, or social media feeds, Amazon Kinesis Services makes it easy to process data on the fly, so you can make real-time decisions and stay ahead of the curve.

Amazon Kinesis Services

Amazon Kinesis Services is a suite of services provided by AWS for easy extraction and processing of data. Here are the major services that make up Amazon Kinesis Services.

  1. Amazon Kinesis Data Streams: This service is used to collect, process, and analyze large amounts of streaming data in real-time. It enables you to create data streams and allows multiple applications to consume the same stream simultaneously. Data can be stored for up to seven days, and you can configure your stream’s shard capacity to meet your throughput requirements.
  2. Amazon Kinesis Data Firehose: This service is designed to capture, transform, and load streaming data into various AWS data stores and analytics services, such as Amazon S3, Amazon Redshift, and Amazon Elasticsearch. With Kinesis Data Firehose, you can easily prepare and transform your data before loading it to your desired destination.
  3. Amazon Kinesis Video Streams: This service allows you to securely stream video from various devices to AWS, where you can process, analyze, and store video data. Kinesis Video Streams also provides an easy way to integrate computer vision, machine learning, and analytics services to process video data in real-time.
  4. Amazon Kinesis Data Analytics: This service enables you to perform real-time analytics on streaming data using SQL or Java without the need for managing servers or infrastructure. It can automatically detect schema changes and can scale to handle changes in data volume and throughput.

Why choose Kinesis to process streaming data?

There are a number of reasons why Kinesis is ahead of its competitors. Check out its distinguished features below:

Integration with AWS ecosystem: Amazon Kinesis integrates seamlessly with other AWS services, making it easy to build complete data processing pipelines.

High scalability: Amazon Kinesis is highly scalable and can handle large amounts of streaming data with ease.

Multiple data processing options: Amazon Kinesis offers different ways to process data, including real-time SQL analytics, Java-based processing, and integration with Apache Flink.

Security and compliance: Amazon Kinesis is designed with security in mind and supports data encryption in transit and at rest.

Cost-effective: Amazon Kinesis offers a pay-as-you-go pricing model, where users only pay for what they use.

Performing real-time analysis of the data generated by your producers involves many stages. These stages include ingesting the data at the desired throughput and simultaneously manipulating it for the desired output. The next step is to store the data in some form of a data lake for persistence. So we must create a workflow in the form of a data pipeline to make the most of this data analysis process.

Before we showcase an example of the whole workflow, let’s quickly go through the fundamentals of ‘producers’ and ‘consumers’ needed for Kinesis services.

  • Producers: These are the sources that generate and send data to Kinesis. These sources could be applications, devices, sensors, or any other source that generates streaming data. Producers can send data to Kinesis Data Streams or Kinesis Data Firehose, depending on their requirements and use case.
  • Consumers: These are the applications or services that read and process data from Kinesis. Consumers can be applications running on EC2 instances or other serverless computing services like AWS Lambda. Consumers can read data from Kinesis Data Streams or Kinesis Data Firehose and process the data in real-time using various AWS services like AWS Lambda, Amazon Kinesis Data Analytics, or Amazon EMR.

The How to Do It part

Consider a situation where you have huge incoming data in your application/database 24x7, and you want to perform real-time data analysis to provide app insights to end users as well as user behavior insights to app owners. To achieve this, we have setup Realtime Streaming Data Pipeline using AWS Kinesis. Below is the pipeline workflow and we will be going through each step in the latter part of the blog post.

Data Pipeline

The above architecture uses DynamoDB as the driving force for a real-time application. Deeper analysis and valuable transformations are done using the rest of the pipeline.

Understanding the pipeline and configuring Amazon Kinesis

  • First, set up the Producer for Amazon Kinesis; in this case, we have DynamoDB which provides an integration option to configure the Kinesis data stream. If you want your application to directly place the incoming data into Amazon Kinesis, you may have to configure an Amazon Kinesis agent or set up the required AWS SDK for configuring the application directly.
  • After setting up the Kinesis data stream, we can set up a Firehose with Lambda to do the required data transformations (if needed). There’s also this new feature of data analytics with Apache Flink that you can configure as a Consumer for the data stream. Thereafter, you can choose to move the data from Firehose to dump it into a data lake for further analysis, visualization, or just to persist.
  • We can also set up Redshift for the use cases as an alternative to the Data Analytics service and transform the data accordingly. We can later visualize the transformed data with the Quicksight service (we have used this on our exemplar).

Let’s get to doing some practical work

1) Creating data streams

  • Open the Amazon Kinesis console in your web browser and sign in to your AWS account. Click on the “Create data stream” button on the Kinesis dashboard.
  • Enter a unique name for your data stream in the “Stream name” field.
  • We have selected provisioned capacity with minimum shard capacity for this whole exemplar which gives 1000 records/second and 2 MiB/second.
  • By default, the Retention period is set to 1 day but you can modify it as per your needs.
  • After creating the data stream, you can see the Producers and the Consumers sections. We’ll be setting up Firehose as the Consumer later. For now, let’s set up DynamoDB as the Producer for this data stream.

2) Configuring DynamoDB

  • To set up DynamoDB, go to DynamoDB’s console, select Create Table, and create one with a partition key. We currently have raw JSON data with us which, on populating the table, will create five tables with Id being the primary key. That’s the reason we have also mentioned Id as the partition key. We have kept the other settings as default for this example, but you can obviously play with those settings as well.
  • After creating the table, let’s set up the Kinesis data stream and DynamoDB. We have an inbuilt DynamoDB feature to integrate the Kinesis data stream. You can configure that under the ‘Exports and streams’ section and select the ‘Amazon Kinesis data stream details’ option. Under that, you can turn on the stream and select the stream that we created before as shown in the images below.

Note: You can also configure your application directly with the Kinesis data stream but for that, you may have to use the AWS SDK, a KPL library or a Kinesis Agent to stream the data coming into your application.

3) Creating a Redshift cluster and configuring it

Before setting up Firehose as the Consumer of the data stream, let’s create a Redshift cluster since our destination will be Redshift and S3 of the data stream.

  • You can go to Redshift’s console, open ‘Provisioned clusters dashboard’, and select ‘Create a cluster.’ You can give it a unique identifier and select the node type.
  • We have selected dc2.large node type for this example with 1 node.
  • You must create an admin username and password to access your Redshift cluster. You can also select the IAM role and select the ‘ServiceRoleForRedshift’. For VPC specific details, you can uncheck ‘use defaults’ in Additional Configurations and fill in the details according to your environment. We have selected the default configurations for this example.

Note: Make sure your Redshift is publicly accessible, so that Firehose can communicate with it.

  • After your Redshift Cluster is available, we need to configure it by creating tables and schemas according to your data. This data will be loaded into Redshift for analytics and transformations. Since your Firehose will be copying the data from S3 to Redshift, it is recommended that the Redshift table’s column matches incoming data. You can select any of the query editors (we have used the v1 editor in our example) to add the tables and alter them.
  • Once you hit the query, you can see the tables and columns thus created. I suggest creating another user as well with appropriate access to tables so that Firehose can use that specific user to add the data to Redshift.

4) Creating a delivery-stream bucket

5) Configuring Firehose with intermediate S3 and Redshift

  • Start by selecting the source and destination to set up Firehose.
  • Here, we are selecting data stream as our source and Redshift as our destination. While setting the destination of Firehose, we will also need to set up an intermediate S3 bucket where we’ll select the S3 bucket that we created in the previous step.
  • You can also select a Lambda function that will transform the data before loading it into the bucket and Redshift in the Transform Records section (you would probably want specific data to be extracted only). In our case, we have kept it unchecked.

Selecting Source:

Selecting Destination:

Configuring intermediate S3 bucket:

  • Hit on ‘Create delivery stream’ button and wait for the stream to be available.

Note: There is a list of “public IPs” that you need to allow in your security groups that is attached to Redshift. These IPs are for Firehouse to communicate and connect with Redshift.

6) Test the data pipeline

  • You are all set to test your real-time streaming data pipeline and for that, you need to populate your DynamoDB. The simplest way to do it is to just check this with the inbuilt feature and see if Firehose is able to deliver things into Redshift or not.
  • To load the incoming data from DynamoDB, we have to extract particular data to load into Redshift seamlessly. For that, I recommend using the Lambda function transformation and then loading the data into S3 and Redshift.

Winding Up 👋

In this post, we went over what Kinesis has to offer when it comes to streaming data processing. With its cost-effective pay-as-you-go pricing model, powerful security features, and compliance with industry and regional regulations, Amazon Kinesis is an ideal choice for businesses looking to leverage real-time data to drive business growth and success. In addition to that, we also built a real-time streaming data pipeline with Kinesis which streams data from DynamoDB and drops it into S3 and Redshift.

The flexibility of Kinesis makes it adaptable to a wide range of use cases, from IoT sensor data to website clickstreams. With its ability to scale and handle massive amounts of data, Kinesis can help organizations gain valuable insights and make data-informed decisions.

See you in the next read! Until then, follow Simform Engineering to keep up with the latest trends in the development ecosystem.

--

--