Building a serverless data analytics pipeline
Earlier this year, we were hit by the news that our data analytics service ‘Keen.io’ was bought and they were changing their pricing structure. The new pricing structure just wasn’t gonna work for us. This service alone would take up about 30% of our cloud costs. As a young startup we always try to be resourceful. So, naturally, I figured we would be better off saving that money and investing it in other areas.
From the get-go, the majority of our infrastructure and APIs have run on AWS with API Gateway + Lambda, so we thought: “Why not design and implement a robust data analytics pipeline to replace Keen using only serverless technologies?” Check out the overall solution design:
You might have thought, “Why didn’t they use AWS Kinesis streams instead of a SQS queue?” Well, we figured it would be an overkill at this point. The SQS solution can handle millions of events per day and that’s definitely enough for us at this stage.
This solution provides analytics in real-time, a great feature, but if we don’t need it, then all we have to do is change the collector lambda function ‘reserved concurrency’ and it will serve as a throttle so our internal Elasticsearch cluster doesn’t get overwhelmed during high spikes.
The Kibana plugin is a great tool for data visualization, enabling us to build real-time dashboards and visualizations very quickly, here’s an example of what we can visualize with minimal effort:
Monitoring & Observability
So, since now we’re relying on an internal solution for our data analytics pipeline, we need to continually monitor the health and reliability of the system. We decided to go with the IOPipe tool to observe and trace our collector lambda execution and get notified in our Slack if ever something’s up. We’re now using IOPipe not only for our data analytics pipeline but for all of our lambda functions. The great thing about their service is that we can inspect the lambda events to search for specific executions, analyze execution duration, memory consumption and trace external calls… all in real-time:
This solution is live and has been running in production at Arena for four months now and has processed more than 500 million events, with no downtime. Bottom line? The total cost (API Gateway, SQS, Lambda and Elasticsearch) came up to less than 15% of the projected cost of using a third-party service like Keen.io. Needless to say, we’re happy with this solution! What’s more, it required minimal development time after it was done. Anyways guys, thanks for sticking around! As always, if you have any questions or feedback, please do feel free to share them in the comments!