Last year, in my current company, we spent three months to re-architect and implement a new data capture pipeline based on some ideas how a good dataflow system should be, such as immutable rawest data, log-based message broker, recomputation, and etc. The new capture pipeline is much more scalable and reliable. With the log-based message broker, it has the ability to reprocess all raw data and generate derived data when needed.
We chose to use AWS Kinesis as our log-based message broker which is an easy-to-use and reliable service. More importantly, we don’t need to operate it by ourselves. Then, we use KCL Python to consume messages from a kinesis stream. The library deals with many complexities for us, such as dynamically managing record processors when adding or removing shards, load-balancing shards when running stream processors on multiple machines. We only need to write the logic of record processor.
Issues with using Kinesis Client Library in Python
However, we have some issues with this solution:
- Logging to stdout is impossible
- Extra code and management effort, e.g., ECS service and DynamoDB
- Slow down the speed of building docker image
The major issue is that we couldn’t directly print out logs to stdout and then stream JSON formatted logs to CloudWatch for further debugging and tracing issues (we follow 12 factors to design our logging strategy). KCL is written in Java and a technique, called MultiLangDaemon, is used for supporting different languages, like Python. When we start a KCL application, it starts a master process (in Java) and a sub-process (the record processor written in Python), and the communication is based on stdin/stdout. It doesn’t support JSON format output from the record processor. So we cannot print out the log to stdout and we use an alternative way to dump logs into a file, rotate it to avoid too big, and start a container to watch its changes (tail -f).
A serverless solution
I was thinking a way to elegantly solve this issue until I read Scaling to 200 million users with 3 engineers. The author introduced a way to use Lambda function to read from Kinesis stream and putting the messages into a queue in SQS (refer to Implementing real-time analytics using AWS Kinesis & Lambda). This is a fantastic way to solve this issue, so I create a demo by provisioning AWS resources with CloudFormation and troposphere. The architecture is shown as follows:
The code is in my github repo. I used the code of lambda function from the previous article (refer to Implementing real-time analytics using AWS Kinesis & Lambda) and fixed a few issues with python 3. What I have done is to use CloudFormation template (generated by using troposphere) to provision a kinesis stream, a lambda function and a SQS queue. The details of lambda function can be referred to the original article which has explained clearly.
This is a really nice approach to solve the issue of consuming from kinesis, a few lines of code and no effort to build images and deploy containers. But only be aware of the cost model of lambda function how AWS is going to charge you.