Monitoring Streaming Data Using AWS Kinesis Data Analytics

Published in

Insider Engineering

7 min readSep 30, 2020

Streaming data can be a hustle to manage. Sometimes it’s a peaceful river to look at, other times it can be a huge flood you just look at in despair. In any case, it always helps to look into the stream and see what it is composed of. We have built a monitoring system for our streaming clickstream data exactly for this reason. It’s helping us both to make sense of our data, and also to debug problems related to the streaming data.

The goal of this monitoring system is to see the number of different events in the clickstream data in a meaningful way. The clickstream data is first partitioned for logically connected events. Then, the count of each event is calculated and displayed with respect to time. We had been using AWS Kinesis to collect the clickstream data, so we decided to use AWS Kinesis Data Analytics to process the stream for monitoring. This was a very quick and easy solution, and it worked out really well for us.

The end result is a pair of graphs for logically connected events with respect to time.

Online Clickstream Event Counts of an E-commerce Website

The streaming data we have at Insider is the clickstream data of our partners’ websites. The clickstream data of a website is basically a detailed log of how the users navigate through that website. An event in a clickstream data could be users navigating to a product page, looking at their cart, or going to sign up page. We are pouring all these user movement events of our partners’ websites into a large data lake. We feed these data into our Machine Learning models, where online visitors are segmented based on their online behavior for targeted advertisements. We call this product suite Predictive Audiences. Insider works with a large number of globally well-known brands, so we have the opportunity to work with very large clickstream data. How large? 500 million events per day (or 460 GB per day) large.

Our clickstream data consists of different types of online click events, like product page click, register button click, purchase click, cart page click, etc., which are individually defined for each partner depending on their need. The online audience size of each partner varies as well, so the click event counts change dramatically from one partner to another. Thus, we needed a monitoring dashboard, where the stream is partitioned first based on a partner, and then based on event type, so that data is grouped logically, and we can see the periodic flow of related events with respect to time. The flow of events gives a lot of insights about the streaming data, and it’s also a great help for debugging or detecting anomalies.

We’re collecting the clickstream data via an API, which forwards all the incoming data into an AWS Kinesis stream. This stream is consumed by a variety of different applications, one of them being AWS Kinesis Data Analytics application, which we created for monitoring. Kinesis Data Analytics application’s only job is to execute an SQL query on the stream. It’s a fully managed service, so no servers and no framework. We only had to write the SQL query, which aggregates the streaming data every minute based on our partitioning logic and counts the events. It takes literally 5 minutes to set up an application, paste the query in the SQL editor, and start seeing the aggregated output in tabular format in real-time.

A Running Kinesis Data Analytics Application

We wanted to visualize the aggregated event counts, so we decided to store the event counts in a relational database, and connect it to a Grafana dashboard for visualization. The output of the Kinesis Data Analytics application can be forwarded to a new Kinesis stream or to an AWS Lambda function. We forwarded the output of our Kinesis Data Analytics application to an AWS Lambda function, whose only job was to get the output and insert it into our relational database. Kinesis Data Analytics application invokes the Lambda function, whenever there is data in the output stream. In our case, this is once every minute. The output data is then ready to be used in a relational database by any visualization tool, like Grafana.

Here is our complete architecture for this system:

The cost of using Kinesis Data Analytics depends on how many Kinesis Processing Units (KPUs) your application requires. AWS automatically determines how many KPUs it needs, so there is no need to select a number. Our application ended up using only one KPU, occasionally scaling up to two KPUs to process a 20 GB per hour stream. The output processing Lambda function was invoked once every minute and took about 3–4 seconds of execution time.

All this sounds nice, but what if the monitoring system itself fails? Who’s going to monitor the monitoring system? We have set two CloudWatch alarms to monitor the monitoring system. We’re watching the ‘MillisBehindLatest’ metric of our new Kinesis Data Analytics application, to make sure that the application does not fall behind its input stream. ‘MillisBehindLatest’ metric indicates how far behind your application is reading from its input source. If it’s falling behind, there is a problem on the input side of your application. We’re also watching the duration of the output Lambda function, to make sure that output processing is working well and does not cause any throttling in the application.

Some key points we have learned along the way:

You MUST define an input schema for your Kinesis Data Analytics application. Otherwise, the application will guess the schema, and if it’s wrong, you’ll have lots of parsing errors and thus missing data.
There are three datetime options you can use when you’re processing a stream. Look at this AWS documentation page to understand them well.
The Kinesis Data Analytics application can put pressure on your input stream. If your input stream is already being used by other applications, then track the ‘ReadProvisionedThroughputExceeded’ metric of your input stream, to make sure that your input stream is keeping up with all the GetRecords requests. ‘ReadProvisionedThroughputExceeded’ metric will tell you what percentage of GetRecords requests are throttled in your input stream. If this metric is increasing, you should consider the parallelization of your input stream.

Here is a great Best Practices documentation from AWS, which talks about the key points I mentioned here, monitoring the application, and more. Make sure to go through it before taking your Kinesis Data Analytics application to live.

We wanted a quick visualization of our clickstream data, with minimal maintenance and easy setup. This architecture gave us exactly that. There are still many more components that can be added to this system like statistical data profiling (e.g. mean, median, min/max, variance, skewness, grouped aggregations, etc.), anomaly detection, or forecasting.

We started to use Kinesis Data Analytics in other architectures, as a preprocessing step to transform the streaming data for a destination application. Previously, each destination application had to transform the whole clickstream data for its own usage, which meant that the data transformation was done in the destination application (in Python or Scala for us) using third party libraries and lots of memory. When the data transformation step is done in SQL with Kinesis Data Analytics, the destination application doesn’t have to transform the data. This reduces the computation cost, memory cost, and the amount of code required in the destination applications. Some data transformations are done much easier in SQL compared to Python or Scala.

We loved the fact that it was so easy to set up the Kinesis Data Analytics application and get the results quickly. This architecture is helping us to gain better insights about our clickstream data and is also serving us as a base to create a more advanced monitoring system.

Check out our career page to see open positions.

Monitoring Streaming Data Using AWS Kinesis Data Analytics

Written by Omer Bayram