AWS Data Analytics — Kinesis Part-2

Kemalcan Bora
BilgeAdam Teknoloji
3 min readMay 24, 2020

Kinesis Producer

The first one is the SDK (Software development kit) allows the write you code or use CLIA to directly send data into Kinesis. Second one is Kinesis Producer (KP) thirt one is Kinesis agent and this agent basically allows you to get a log file. Last part is 3th lib. spark, flume etc. all these things will send data to kinesis.

Kinesis Producer SDK

  • Used PutRecord (one) and PutRecords (many)
  • PutRecords used batching you can send records as part of one HTTP request and therefore you’re saving in the HTTP
  • ProvisionedThoughputExceeded if we go over limit.

Managed:

  • AWS IoT
  • CloudWatch
  • Kinesis Data Analytics

AWS Kinesis Exceptions

  • ProvisionedThoughputExceeded: Exceeding MB/s or TPS for any shard
  • Bad key in partition

Basic Solutions for Exception

  • Retries with backoff
  • Increase Shards

Kinesis Producer Library

Synchronous and Asynchronous API but butter perfomance in async, you can watch with CloudWatch for monitoring. Very important mechanism is batching so this batching divided two part first part is Collect and second is Aggregate.

Collect records and write to multiple shards in the same PutRecords API

Aggregate increased latency capability to store multiple records in one record go over 1000 record per a sec. limit max limit is 1 mb /sec.

Important things: For example you have 3 part data 20KB- 30 KB- 200KB and all this data come in respectively 3 ms, 5 ms, 6 ms actually all this datas <1mb and you can use PutRecords API called one times. But how? introducing some delay with RecordMaxBufferdTime(default 100ms). Basically you’re saying I’m willing to wait 100ms so that’s adding a little bit of lateceny at the tradeoff of being more efficient.

Last part is Kinesis Agent so this agent monitoring log files and sends them to kinesis data streams its built on top KPL (Kinesis Producer Library) write from multiple directories, the agent handles file rotation, checkpoints and rety upon failures and emits metrics to CloudWatch for monitoring

Kinesis Consumer Classic

Steams goes everwhere for example Lambda, Firehose, Spark, SDK,client etc.

Classic Kinesis — Records are polled by consumers from a shards. Easch shard has 2 MB total aggregate. How it’s work? Consumer talking to for example shard-1 and say hey dude! GetRecords() and shard-1 saying ok buddy this is for you “data”. GetRecords return up to 10 mb of data then throttle for 5 sec. or you can get up to 10.000 records. Maximum of 5 getrecords api calls per shard per second 200ms.

Important: If 5 client consume frome same shard means every consumer can poll once a sec. less than 400kb/s.

Kinesis Client Library

Read records from Kinesis Producer.

Share multiple shards with multiple consumers in one group it’s shards discovery.

Checkpoint future to resume progress! So if application goes down and comes back up it’s able to remember exactly where it was consuming last in order to resume the progress.

So how does checkpoint works and all the shard discovery?

Well basically uses an DynamoDB. DB table to check point the progress over time and synchronize to see who is going to read which shard. So DynamoDB will be used for check pointing and it will have one row in the table for each shard to consume from.

Kinesis Connector Library

Kinesis data streams outputs is used to write data to S3, DynamoDb, Redshift or ES. This connector runining on EC2 it’s be must!

Kinesis Enchanced Fan Out

It’s work on KCL 2.0 and lambda. What is kinesis enchanced fan out so each consumer will get 2 mb per second per a shard. It’s look similar as before but it’s not exactly same thing. Producer goes to Kinesis Data Streams and consumer use SubscribeToShard() so we just start and that’s all. Not pulling anymore. That’s means if we have 20 consumer overall we’ll get 40mb/s per a shard. No more 2mb limit!! reason is kinesis pushes data to consumer over http/2. Reduce latency 70ms.

Enchanced vs Standart Consumer

Standart consumer use low number 1–2–3 bla bla, can tolerate 200ms latency, minimize cost

Enchanced is multiple consumer applications for the same Streams, low latency 70ms. Default limit of 5 consumers using enchanced fan-out per data stream.

--

--