AWS Data Analytics — Kinesis Part-3

Kemalcan Bora
BilgeAdam Teknoloji
3 min readMay 25, 2020
Kinesis Scaling

Kinesis Operations -> Adding Shards

Also called shard splitting, can be used increase the streams capacity(1mb/s data in per shard) if you have 10 shard you have 10mb/s.

Can be used divide a “hot shard”

So what happens when you split a shard? Well the old shards is closed and will be deleted one the data.

For example we have 3 shard and all same space.

| Shard-1| Shard-2| Shard-3|

and let’s imagine shard-2 very hot and we want the split it to increase throughput on this key space of shard-2. So we’re going to do split operation what’s going to happen is that gonna be shard-4 and has beedn created and shard-5.

| Shard-1| Shard-4|Shard-5| Shard-3|

So shard-2 will be available as long as the data in it is not expirend but when it’s expired it will be gone.

Kinesis Operations -> Merging Shards

Decrease the stream capacity and save cost and can be used to group two shards with low traffic.

Again old shards are closed and deleted based on data expiration.

| Shard-1| Shard-4|Shard-5| Shard-3|

for example we merge shard-1 and shard-4 ’cause did not get so much traffic so we can merge them together and save some cost.

| Shard-6|Shard-5| Shard-3|

So what about say auto-scaling.

  • Auto scaling is not native feature of kinesis
  • The API call to change the number of shards in kinesis with UpdateShardCount
  • We can implement AutoScale with AWS Lambda

Notes:

  • Resharding can’t be done in parallel. You need plan and capacity in case.

And don’t do that list!

  • Scale up to more than double your current shard count for a stream
  • Scale up to more 500 shards in a stream
  • Scale up to more than the shard limit for your account

That’s scale resharding can not be done in parallel and you need to basically resharding takes a few second for each shard.

Kinesis Security

  • Control access / auth using IAM
  • Encryption in HTTP endpoint
  • Encryption at rest using KMS
  • VPC available for kinesis to access within VPC

AWS Kinesis Data Firehose

  • Fully managed service, no administration
  • Near- real time! (60 sec latency) ok but why? when you see the batching there is a 60 sec. latency minimum if your batch is not full so we don’t have the guarantee the data will be right away in a destination.
  • Load data into S3, ES, Splunk, Redshift
  • AutoScaling
  • Data Conversions from json to parquet (only s3)
  • Data Transformation AWS Lambda csv -> json
  • Compression GZIP, ZIP, SNAPPY
  • Only GZIP is the data is further loaded into Redshift
  • Spark or Kinesis Client Library can’t read from Kinesis Data Firehose

SDK, Kinesis Agent, Kinesis Data Streams,Cloud Watch, Iot send data to kinesis firehose and firehose is able to do some transformation and we can use lambda function after that we store on s3,ES,splank,redshift

Buffer size is 32 mb and time is 2 min if reached it’s flushed.

Kinesis Data Streams vs Firehose

Streams

  • Going to write custom code(producer/consumer)
  • Real time
  • Must manage scaling (merging/splitting)
  • Data store 1–7 day

Firehose

  • Fully managed send to s3,es,splkunk,redshift
  • Serverless data tranformation with lambda
  • Near realtime
  • AutoScaling
  • No data store

--

--