Scalability and reliability with Kinesis

Scalability and reliability don’t always go hand in hand — especially when you don’t want to buy enormous amount of money for the solution. We’re investigating new AWS tools Kinesis, Firehose and Lambda to improve our real time analytics processing. First results look very promising.

To check the viability of the solution we decided to start with following use case: a simple solution where clients can send 5000 analytics events / second which we validate and store in S3. New data should arrive to S3 bucket in real time fashion (in minutes) and in our tests we should not lose any of the events.

The first step in this flow is to store events to Kinesis stream. Kinesis is a Amazon service that can handle huge amount of writes (and also reads) per second. All the records are persisted for 24 hours (can be extended to 7 days). From reliability point of view this is excellent. When we get data to Kinesis we know it is safe — we have always enough time to process it even within peak hours.

Mechanism for scaling in Kinesis streams is sharding. With one shard stream can handle 1000 writes / second. At the moment you can specify 1–1000 shards per a stream. For our case we chose 10 (10000 writes/s) which should be more than enough. To distribute load between shards you have to specify a distribution key. We didn’t want to group our events by shard so we just used random value for distribution key which worked very well.

At the moment our API for analytics events is written in Ruby show we used AWS Ruby SDK for sending (or putting in Kinesis terms) the events. Client supports both sending events one by one or then with a batch of multiple events. We hoped that we’d have been able to go with the former option. However, it turned out that this way it wasn’t easy enough to get the performance we wanted. Kinesis stream itself can take in a lot of traffic but when sending events one by one we have to have too many machines for sending them. So we decided to buffer the events a bit (10–20 events) and then send them away. That seemed to be reliable enough and that way we need only few machines in the front of Kinesis.

Next step is read the data from Kinesis and send it to S3. We chose to use Lambda and Firehose for this purpose. Lambda is a hosted technology where you can basically have a function that takes some piece of input and then processes it and sends it forward. Amazon can run these functions in parallel as many as needed to process the data in real time fashion.

In our case we configured the event source of Lambda to be the Kinesis stream with buffer size 500. Every time we get 500 events to Kinesis stream our function is launched with those events as a input. In our scenario there should be always around 10 functions running in parallel. Our Lambda function’s job is to write those events to Firehose which is again a hosted Amazon service (of course).

Firehose is a stream where you can send records and then it takes care of writing them to S3 or Redshift. In our case we need S3. We chose to use Firehose because it does all the bulk work related to S3; buffering for writing reasonable sized files, creating folder structure based on time and gzipping them. Our job was just to configure Firehose and send the events to stream. The configuration part and sending of the events was easy. However, at the moment Firehose has a limitation of 5000 events / second. During our tests we noticed that we hit a limit sometimes. We got errors with a message “Slow down.” which probably were related to this. Firehose does not support sharding but you can ask more capability using a request form (as usual with Amazon). However, it’s easy to just add more Firehose streams. We just decided to create two streams and distribute events evenly in our lambda like this:

Worked like a charm. The other problem was how to handle the errors. In case of of temporary connection error to Firehose we can just throw an exception. If Java (or in our case Clojure) based Lambda function throws an exception it is marked as failed and the data it was processing is retried until it will succeed (within 24 hours). Amazon does not provide insight on how this retry logic works but in our tests it worked really nicely. If connection succeeds we always get a response. Response can succeed, fail totally or then succeed partially. Total failure case is easy to handle, just throw an exception and the retry mechanism will take care of the rest. Partial exception is a bit tough. We got that sometimes during our tests. In our tests it was this “slow down” error where part of the events went ok and some parts not. We chose to use following logic:

With partial failures we just log the failed events. We have a distributed logging solution where we can monitor these and can reprocess the events if needed. Another option would’ve been to put those failed events back to same Kinesis stream. However, one reason for trying out Kinesis was to avoid hand-coding our own retry mechanisms. That’s why we decided to not go with that option. In all of our tests the reason for partial failures was Firehose write limitation. Firehose pricing model is not based on number of streams but the amount of data ingested. So to keep things simple it was easier to add enough streams to be sure we don’t get those errors. Now we monitor the solution to check if we do get those partial failures in production ever. If not we have saved a lot of error-prone code.

And that was all. Setting this up was rather easy and it’s more real time than I expected. We’ve been using clj-gatling for load testing the solution and with our tests the events have been always in S3 after an minute. This also makes testing a lot easier because we don’t have to wait ages for the results. We haven’t yet calculated the exact price but it looks that this kind of solution costs just few hundred bucks a month. And the solution scales linearly: you get double amount of capability by just paying double amount of money. It seems we can get the both the reliability and scalability with this…