Simple serverless data pipeline using AWS Kinesis and AWS Lambda

TL;DR: Source code of example and instructions how to run could be found at https://github.com/chaliy/play-datapipeline-kinesis.

Few months ago I was building small service for production company to keep track on their product lifecycle starting from production and up to retirement. Production, Serialization (process of giving serial number to product), Inventory, Dealer Registration, Service, Theft, Refabrication to name few lifecycle entries. Of course all of them have some metadata attached and this gives an ability to abuse data for further data analytics. There are few non-functional requirements like throughput for about 20 millions lifecycle entries per month, it should support spikes when batches for 100k lifecycle events will be sent in really short period and eventually we should minimalize time to process, because in some use cases system will be used interactively.

We built few PoC to showcase possible features and technical decisions and one of them I found particularly interesting is fully serverless implementation of data pipeline. This article is basically reimplementation of PoC we built, with few more details and with data faked.


Technical Overview

What we need is classical data pipeline implemented on AWS. So technology choices are quite obvious.

AWS Kinesis — AWS managed service for highly available and scalable but dumb queues. You can think of this as of Apache Kafka by Amazon.

AWS Lambda — AWS Function as a Service. We will use this process entries.

AWS DynamoDb — AWS NoSQL Document oriented database.

Cartoon of the system looks like this:

Poducer module is simple NodeJS app to feed system with fake data. It uses AWS SDK to push records to AWS Kinesis.

Kinesis module. Simple Kinesis stream without any specific configuration.

Process module. AWS Lambda Function that is configured with Kinesis stream as event source. Function loads product from database and add lifecycle entry.

Db module. AWS DynamoDb Table with serial number as hash key. For this demo access by key is the only use case, so having only hash key is enough.

Well.. Technically that is it. Source code and instructions how to run could be found at https://github.com/chaliy/play-datapipeline-kinesis.

Dashboard of pipeline.

Interesting Findings

  1. There is funny relation between AWS Lambda memory limit and AWS DynamoDB latency. Records shows that for 128MB it could take up to 250ms even for hot requests. For 1024 latency is just about 20ms.
  2. Terraform has an issue how it handles AWS Kinesis Stream shard count. Basically when you change shard_count terraform recreates stream, effectively wiping your data. Issue for reference.

Conclusion

It was quite easy to setup basic demo. However Devil in details. There is no automatic elasticity, so you somehow you need handle this. Thresholds are everywhere: AWS Lambda timeouts, AWS DynamoDb — “ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API”. All this limits requires attention that I would rather not spend time on. In our PoC we handled few edge cases like this, so I hope to publish more details later on.

Fork my example at https://github.com/chaliy/play-datapipeline-kinesis :).

Enjoy.