Building a serverless data analytics pipeline

Earlier this year, we were hit by the news that our data analytics service ‘Keen.io’ was bought and they were changing their pricing structure. The new pricing structure just wasn’t gonna work for us. This service alone would take up about 30% of our cloud costs. As a young startup we always try to be resourceful. So, naturally, I figured we would be better off saving that money and investing it in other areas.

Part of what we do at Arena is install JavaScript code and widgets into our customers’ websites. That means we need to track usage (page views, clicks and interactions) and then later, report it to them on their account dashboard. Keen.io was useful to us for mainly three purposes: ‘collect’, ‘aggregate’ and ‘query’. We never used their dashboard and graph features. This made the transition easier.

The ‘collect’ part of the flow was done entirely on the server side; JavaScript client code calls up our REST API, API calls up Keen.io to store the event. Since neither the JavaScript code nor the REST API was coupled in any way to the Keen service, we knew for certain that this transition was going to have minimal impact on the client side.

From the get-go, the majority of our infrastructure and APIs have run on AWS with API Gateway + Lambda, so we thought: “Why not design and implement a robust data analytics pipeline to replace Keen using only serverless technologies?” Check out the overall solution design:

Data flow and architecture

API Gateway is responsible for receiving the incoming events from our JavaScript clients and storing those events to a SQS Queue. SQS queue will then store the events for later processing. A Lambda collector function will then consume the queue events, do some data enrichment, and store the final event to an Elasticsearch collection. Elasticsearch serves as the permanent events data store. Kibana, now this is optional, is basically a great Elasticsearch plugin that provides data visualization and graphs.

Architecture highlights

You might have thought, “Why didn’t they use AWS Kinesis streams instead of a SQS queue?” Well, we figured it would be an overkill at this point. The SQS solution can handle millions of events per day and that’s definitely enough for us at this stage.

This solution provides analytics in real-time, a great feature, but if we don’t need it, then all we have to do is change the collector lambda function ‘reserved concurrency’ and it will serve as a throttle so our internal Elasticsearch cluster doesn’t get overwhelmed during high spikes.

The Kibana plugin is a great tool for data visualization, enabling us to build real-time dashboards and visualizations very quickly, here’s an example of what we can visualize with minimal effort:

Real-time dashboard built on Kibana

Monitoring & Observability

So, since now we’re relying on an internal solution for our data analytics pipeline, we need to continually monitor the health and reliability of the system. We decided to go with the IOPipe tool to observe and trace our collector lambda execution and get notified in our Slack if ever something’s up. We’re now using IOPipe not only for our data analytics pipeline but for all of our lambda functions. The great thing about their service is that we can inspect the lambda events to search for specific executions, analyze execution duration, memory consumption and trace external calls… all in real-time:

Lambda collector function monitoring dashboard on IOPipe
Lambda execution details: metrics, tracing, source event details

Final thoughts

This solution is live and has been running in production at Arena for four months now and has processed more than 500 million events, with no downtime. Bottom line? The total cost (API Gateway, SQS, Lambda and Elasticsearch) came up to less than 15% of the projected cost of using a third-party service like Keen.io. Needless to say, we’re happy with this solution! What’s more, it required minimal development time after it was done. Anyways guys, thanks for sticking around! As always, if you have any questions or feedback, please do feel free to share them in the comments!