Introduction to AWS Kinesis
In today’s world, data is almost everything. At the same time, data sources are growing a massive rate. In that scope, a “How we will handle this amount of data” question is coming to our minds. Data streaming solutions are one of the answers. In this article, we will talk about the AWS Kinesis service.
Before going forward, firstly we need to understand the concept of streaming data.
Streaming data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes). Streaming data includes a wide variety of data such as log files generated by customers using your mobile or web applications, eCommerce purchases, in-game player activity, information from social networks, financial trading floors, or geospatial services, and telemetry from connected devices or instrumentation in data centers.
Read more
Today Kinesis has three different types of data streaming.
Kinesis Data Streams
Collect gigabytes of data per second and make it available for real-time processing and analysis.
In this type, the data retention period is quite a bit longer when we compare it with others. Also in that model, produced data is immediately delivered to consumers. More suitable for log, event, mobile, gaming, and IoT data collection.
Kinesis Data Firehose
Prepare and load real-time data streams into data stores and analytics tools.
In this type, the data retention period is short. More suitable for storing data in data stores, analytics tools, etc.
Kinesis Data Analytics
Get actionable insights from streaming data in real time.
This type is more suitable for real-time analytics or monitoring.
So, let’s create a pipeline from real life.
Let’s assume, we have an e-commerce website. We want to collect some user interactions like which products are more visited. Depending on popularity, we want to make discounts on our products.
First of all, we need a service for handling product visit events.
{
"userId": "123456",
"product": "foo",
"tags": ["technology", "mobile", "iphone"],
"eventTime": 1652225241,
"pageURL": "https://foo.bar/product/foo"
}
Lambda is a good solution for that. Because this event will send almost every product page visit.
const AWS = require('aws-sdk')
const kinesis = new AWS.Firehose()
exports.handler = (event, context, callback) => {
let payload = event.body
return new Promise((resolve, _) => {
kinesis
.putRecord({
Records: payload,
DeliveryStreamName: 'my-kinesis-delivery-stream',
})
.promise()
.then(() => {
resolve({
success: true,
})
})
.catch((err) => {
console.log('Kinesis putRecords Error:', err)
resolve({
success: false,
})
})
})
}
The next step is storing/analyzing this data. For storage purposes, we will use S3. For data analysis purposes, we will use Athena.
Depending on the requirement storage/analysis part can be changed like storing data in Elasticsearch or MongoDB etc. It can be doable with another lambda.
End of the day, our overall flow will look like this:
So, let’s go step by step:
Creating delivery stream on firehose:
Make configuration for saving our events to S3.
For the destination part, we will store data with dynamic partitioning. Whenever a new event is consumed, it will store to s3 pattern like this:
s3://test-space/product-visit/{{year}}/{{day}}/{{month}}
This will be reasonable for aggregating specific date ranges.
/product-visit/!{partitionKeyFromQuery:year}/!{partitionKeyFromQuery:month}/!{partitionKeyFromQuery:day}/
Now our pipeline is ready.
The next step is how we will query and analyze it.
In Athena query editor:
After that step, we need to provide our S3 path and column specification for preparing our data to query:
After those steps, we can aggregate our results with basic SQL queries
Hopefully, this article gives some insight into AWS Kinesis usage.
See ya next articles :]