Think about Data ingestion requirements when you need to rapidly move data off source systems or applications (producers) as it gets generated and waiting for data batches/accumulation is not an option.
Following are some of those scenarios that often occurs within various operations (Back office, IT etc..). You are most likely looking for a data ingestion solution that deals with streaming data before any processing, transformation steps & storage, real-time metrics and analytics, or need to derive more complex data streams for further processing.
- Log processing and analysis — System and application logs that can be continuously added to a data stream and be available for processing within seconds. This is required to avoid loss of logs during any server failures (producers).
- Real-time dashboards — Extract metrics and generate reports in real-time.
- Real-time data analytics: Run real-time streaming data analytics on clickstreams, logs, social media feeds that enable you to gain insights within minutes.
For the above mentioned and similar use cases, Amazon Kinesis Data Streams (KDS) service becomes useful. It’s ability to continuously capture data per second from several hundred sources makes it suitable for website clickstreams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events. You can build custom applications that process or analyze the incoming streaming data within seconds for specialized needs. It comes with following salient features —
- Scalable — Each KDS stream can be scaled by increasing the number of shards to significantly higher numbers to meet the volume requirements.
- Durable — Synchronously replicates data across three availability zones, providing high availability and data durability.
- Real-time data streaming — data is available to consumer applications such as S3, Lambda or Kinesis Data Analytics in < 70 ms.
Most of the aspect of IaaS such as infrastructure, storage, networking, and configuration needed to stream your data at desired throughput level are managed by KDS. It takes away the pain of provisioning, deployment, ongoing-maintenance of hardware, software, or other services for your data streams. . But still, Kinesis Data Streams is not a fully managed service.
Components of a Data Streaming Solution
- Data producers — to continuously add data to your data stream. You can add data to an Amazon Kinesis data stream via PutRecord and PutRecords operations, Amazon Kinesis Producer Library (KPL), or Amazon Kinesis Agent.
- Data Stream — To setup the data stream, you can either use the AWS console or CreateStream operation.
- Data Consumers — Use Amazon Kinesis Applications (consumers) to read and process data from your data stream, using either Amazon Kinesis Data Analytics, Amazon Kinesis API or Amazon Kinesis Client Library (KCL).
Building blocks of KDS solution
- Shard — is the base throughput unit of KDS. It requires pre-planning and estimation to decide and configure the number of shards needed when you create a data stream. Monitoring of shard-level metrics is possible through Amazon Kinesis Data Streams or Cloudwatch.
- Record — is the unit of data stored in an Amazon Kinesis data stream. A record is composed of a sequence number, partition key, and data blob (Data generated by source system and immutable sequence of bytes).
- Partition key — is used to segregate and route records to different shards of a data stream. A partition key is specified by source application before adding data to KDS.
- Sequence number — is a unique identifier for each record. Sequence number is assigned by Amazon Kinesis when a data producer calls PutRecord or PutRecords operation to add data to an Amazon Kinesis data stream. Sequence numbers for the same partition key generally increase over time; the longer the time period between PutRecord or PutRecords requests, the larger the sequence numbers become.
- Amazon Kinesis Producer Library (KPL) — is an easy to use and highly configurable library that helps you put data into an Amazon Kinesis data stream. KPL presents a simple, asynchronous, and reliable interface that enables you to quickly achieve high producer throughput with minimal client resources.
- Amazon Kinesis Agent — is a pre-built Java application that offers an easy way to collect and send data to your Amazon Kinesis data stream. You can install the agent on Linux-based server environments such as web servers, log servers, and database servers. The agent monitors certain files and continuously sends data to your data stream.
- Resharding — is the process used to scale your data stream using a series of shard splits or merges. In a shard split, a single shard is divided into two shards, which increases the throughput of the data stream. In a shard merge, two shards are merged into a single shard, which decreases the throughput of the data stream. This step is required as Shard capacity doesn’t change with Auto Scaling.
- Amazon Kinesis Client Library (KCL) — is a pre-built library that helps to build Amazon Kinesis Applications for reading and processing data from an Amazon Kinesis data stream. It lets you focus on business logic and handles several complex issues such as adapting to changes in data stream volume, load-balancing streaming data, coordinating distributed services, and processing data with fault-tolerance.
- Enhanced fan-out — is an optional feature for Kinesis Data Streams consumers. Enhanced fan-out allows developers to scale up the number of stream consumers (applications reading data from a stream in real-time) by offering each stream consumer its own read throughput. It provides logical 2 MB/sec throughput pipes between consumers and shards. This allows customers to scale the number of consumers reading from a data stream in parallel or allows data delivery at <200 ms , while maintaining high performance. KCL v2.x take care of automatic registration of consumers and enables enhanced fan-out. After registration, all registered consumers will have their own logical enhanced fan-out throughput pipes provisioned for them. Consumers use the SubscribeToShard API to retrieve data inside of these throughput pipes.
Important design considerations while building streaming solution
- Produced data size — The maximum size of a data blob (the data payload before Base64-encoding) is 1 megabyte (MB).
- Retention Period — Default retention period is 24 hours which can be increased to meet specific requirements as mentioned below.
- Approach to decide the throughput of KDS (Initial Capacity Planning during provisioning) — Throughput of KDS is a function of shards. It is critical to decide the number of shards in the advance during initial provisioning as well as any changes that are required during runtime (using resharding) due to change in streaming volume.
As a baseline, one shard provides a capacity of 1MB/sec data input, 2MB/sec data output and 5 transactions/API calls per second for reads. One shard can support up to 1000 PUT records per second.
To calculate required shard capacity below steps can be followed —
- Estimate the average size of the record written to the data stream in kilobytes (KB), rounded up to the nearest 1 KB. (A)
- Estimate the number of records written to the data stream per second. (B)
- Decide the number of Amazon Kinesis Applications consuming data concurrently and independently from the data stream. (C )
- Calculate the incoming write bandwidth in KB (D = A*B)
- Calculate the outgoing read bandwidth in KB (E = D * C)
- Calculate the number of shards ( F = Max ( D/1000, E/2000))
When to use KDS over other AWS services ?
It often gets confusing between Amazon Kinesis Data Streams and Amazon SQS due to similarity in some of the features of these services. However, each of this service are distinct in nature and applicable for a completely different set of use cases. Following points helps to uncover the difference —
Amazon Kinesis Data Streams
- Helpful when routing of related records to the same record processor is important. For ex. Counting, aggregation, average etc.. Using MapReduce
- Maintains order of records. for ex. keeping the order of log entries as per the timestamp.
- Ability for multiple applications to consume the same stream concurrently.
- Ability to consume records in the same order a few hours later.
- Good to use when tracking of individual work items is important. For ex. Amazon SQS tracks the ack/fail of work items in a queue so the application does not have to maintain a persistent checkpoint/cursor. Amazon SQS will delete acknowledged messages and redeliver failed messages after a configured visibility timeout.
- Individual message delay (up to 15 mins) or scheduling .
- Dynamically increasing concurrency/throughput at read time. For example, you have a work queue and want to add more readers until the backlog is cleared.
- Leveraging Amazon SQS’s ability to scale transparently without any provisioning instructions from you.
The content written in this article is a collection of personal learning and experience. To share any feedback or comments, please use the clap/response feature on Medium platform. You can also reach out to me on Linkedin.