Kinesis warm-up

Piero Bozzolo
Claranet-CH
Published in
3 min readAug 4, 2022

In the big-data era, how can I collect data in real-time? How can I track users behaviour on my site? Hundreds of security cameras are in my facility. How do I notice intrusions? I have thousands of sensors in my plants that monitor electric engines' RPM, input voltage and torque. How can I ingest and analyse all of this data to prevent anomalies?

A standard unmanaged tool that can be used is Apache Kafka which is an open-source distributed event streaming platform. Kafka is a great tool, but it must be installed and maintained. Let’s see the alternatives in the AWS cloud.

Kinesis

Kinesis is a service family that includes Stream, Analytics, Firehose and Video Stream. Every service has a specific purpose with one thing in common: they can manage massive data without fatigue. Let’s see the main differences.

We can start to ingest data on any scale with Data Stream. Those data can be user clickstream, IoT devices metrics, and logs.

Data can be analysed with Kinesis Analytics. For example, you can set an SQL-like query to highlight data. Then, if you need to persist data you can use Kinesis Firehose to save the streams into S3 or a data warehouse database like Redshift. Let’s focus on Kinesis Data Stream which is useful for ingesting data at a scale.

When you use this service, data records are streamed to shards (or partitions) with a predefined ingestion capability of 1 mb/s. One MB is also the maximum size of one record before base64 encoding. This means that if you have only records of 1 MB you can ingest only one record at a time per shard per second; if you have 1000 records of 1KB you can send 1000 records per second per shard. If you need more ingestion capability you can add more shards, but keep attention on the partition key you are using. The Kinesis stream uses partition keys to route records in the correct shard. When you want to send a record on Kinesis you should use the PutRecord API with the following payload:

{     
"Records": [
{
"Data": "dGkgaW50ZXJlc3NhIGxhdm9yYXJl=",
"PartitionKey": "partitionKey1"
},
{
"Data": "bmVsIGNsb3VkPyBjb250YXR0YW1pIQo=",
"PartitionKey": "partitionKey2"
},
{
"Data": "Gi4sEdd08HypA",
"PartitionKey": "partitionKey3"
}
],
"StreamName": "streamName"
}

In the data field, you’ll add your base64 encoded data, and in the partition key, you will set a value that will determine the destination shard.

Once data is ingested it can’t be changed, data is immutable in kinesis. Moreover, it is kept for 24 hours by default, but you can increase the retention period by up to one year.

An interesting feature is the innate fan-out capability: you can attach one or multiple consumers to the output stream. As a result, many applications can analyse the stream simultaneously. When you want to read data from Kinesis, you can have shared or advanced fan-out consumers. With shared fan-out consumers all of the shard output capacity of 2MB/s is shared among multiple consumers. With an advanced fan-out, you will have the whole 2MB/s reading capability for your consumer.

How will we pay for this IO capacity?

Kinesis Stream comes with two capacity modes: provisioned or on demand.
In provisioned capacity, you will pay for each shard per hour. Remember: each shard will have 1mb/s input capacity and 2 mb/s output capacity, so if you need more IO you will scale manually the number of shards.

When you use Kinesis Stream with On Demand mode, no provisioning is required. The default capacity will be 4mb/s or 4000 records per second. If based on the last 30 days of usage, the ingestion capacity needs changes, it will be adjusted to fulfil the new requests. You will pay per stream per hour plus each ingress or egress gigabytes.

In the following articles, we will see how to use Kinesis in practice.

--

--

Piero Bozzolo
Claranet-CH

Cloud Architect and developer at Claranet CH, AWS Trainer Champion