Performing Real-time Anomaly Detection using AWS

Gautam Krishna
Slalom Data & AI
Published in
7 min readJun 8, 2020

A serverless approach to detect anomalies in real-time data — by Gautam Krishna, Reuben Hilliard, and Preeti Modgil

Anomaly Detection (Source: www.thedatascientist.com)

Introduction

Anomaly detection in real-time streaming data from a variety of sources has applications in several industries. Devices that generate such streaming data are varied and can include vehicle sensors, manufacturing equipment sensors, GPS devices, medical equipment, building sensors, etc. Streaming data for anomaly detection have been generated in some applications for many years, such as pressure and temperature sensors on manufacturing equipment. However, such anomaly detection has been limited to real-time hard-coded rules-based detection rather than true machine learning of what might be an anomaly.

Applications for detecting anomalies in real-time streaming data exist in many industries like finance, IT, medical, manufacturing, social media, e-commerce, etc. Anomaly detection can provide useful, actionable real-time information as well as data for longer-term use. However, implementations of anomaly detection workflows are still limited and there is potential for this to become an industry differentiator for any company looking to implement. In this blog, we will leverage Amazon Web Services to build a real-time anomaly detection application.

The AWS services that we will be leveraging in this post are:

  • Amazon SageMaker — A Machine Learning service to develop and deploy predictive models at scale through a Jupyter notebook environment
  • Amazon Athena — A serverless interactive querying service to analyze data in Amazon S3 using standard SQL
  • AWS Lambda — A serverless code execution service that can connect to almost all Amazon services programmatically
  • Amazon S3 — The all popular object storage store on Amazon Web Services
  • Amazon Kinesis — Amazon Kinesis offers key capabilities to cost-effectively process and analyze streaming data at any scale
  • Amazon Simple Notification Service — A fully managed pub/sub messaging service that enables the mass delivery of messages
  • Amazon Elasticsearch Service — A managed, highly scalable real-time full-text search and analytics engine
  • Kibana — Search, view, and visualize data indexed in Elasticsearch

AWS Cloud Architecture

AWS Cloud Architecture leveraged

Process flow:

For the purpose of this demo, we leveraged the use-case of collecting On-Board Diagnostics II (OBD II) sensor data from a vehicle. The OBD II device of automated vehicles collects data from various sensors and sends it to a cloud-based server in real-time for detailed analysis. However, this area won’t be in-scope for this blog. We will be simulating OBD data based on distributions fitted on the OBD data available publicly from Kaggle. We randomly introduce outlier datapoints by scaling the OBD data points by a factor of 2 (arbitrarily chosen). The following are the major steps in our AWS based anomaly detection framework.

  1. Data Generation from OBD II Sensor: Step for simulating OBD II data as mentioned above.
  2. Streaming data consumption into Amazon Kinesis Data Streams: The data from the OBD II Sensor is consumed by a Kinesis stream.
  3. Realtime analytics and anomaly detection in Kinesis Data Analytics: The data from the kinesis stream serves as the input to our kinesis analytics application. Kinesis Analytics leverages Random Cut Forest algorithm to baseline the real-time data first to come up with anomaly scores for new data points.
  4. Processed data from Kinesis Analytics to a Kinesis stream: Kinesis Analytics doesn’t necessarily take any action for the anomaly data points other than adding the anomaly score to the JSON payloads outputted from the same. In-order to make actions, the output is fed into another Kinesis Steam for consumption
  5. Real-Time Notification System: In-order to notify downstream systems (or users) of potential outliers flagged by Kinesis analytics upstream, we create an SNS topic and use a lambda function (triggered by the Kinesis stream in the previous step) to push notifications into the SNS topic. The Lambda function triggers an SNS notification only if the anomaly scores are greater than a (manually) set threshold.
  6. Real-Time Visualization: To visually demonstrate the anomalies in data, we also set-up an ElasticSearch and Kibana based real-time dashboard. The processed records with anomaly scores are inserted into the Amazon managed ES cluster and we indexed the records as time-based documents. We leveraged Timelion and other Kibana visualization tools to develop a real-time dashboard.
  7. Loading the data into the object store (S3) using Kinesis Firehose: The processed data from our Kinesis analytics application is buffered and loaded to S3 every 5 minutes. The data is partitioned in a HIVE compatible way.
  8. Analytical reporting via Amazon Athena: The partitioned data in S3 is made available for querying to the end-users via Amazon Athena — where the users can access the data using standard SQL queries. It always paves ways to external data visualization apps like Tableau to access the data in S3.
  9. Machine Learning via Amazon SageMaker — As a more advanced use case, we can leverage Amazon SageMaker to access the data stores in S3 to train and learn patterns within the data using Machine Learning algorithms, so that it can make predictions for new data points in the future.

When steps 4 through 6 above speak to the real-time handling of OBD data, the steps 7 through 9 below speak about persisting the data into an object store for future deeper dive analysis.

Tracking the anomalies

As we discussed in steps 5 and 6 above, we developed a couple of ways to track anomaly data points — one using a real-time dashboard via ElasticSearch + Kibana and second, an email notification via Amazon Simple Notification Service (SNS).

A snapshot of the real-time dashboard is shown below. We had set a threshold of 2.0 — meaning, anytime we get an anomaly score 2 or more, it will show up in the dashboard.

Real-time dashboard via ElasticSearch + Kibana

Also, we configured a lambda function (as mentioned in step 5 above) that triggers a notification via Amazon SNS. The users who subscribe to the SNS Topic will get a notification that looks like below:

A sample SNS notification sent out to the user

Algorithm Overview

The Random Cut Forest algorithm leveraged in Kinesis Analytics is based on the Robust Random Cut Forest, first implemented by a team of researchers at Amazon. The algorithm is focused on speed and specifically designed and optimized for streaming data. It is also quite robust, with no assumptions on the distribution of the underlying data and no preprocessing required.

The algorithm samples several random data points then cuts them to the same number of points and creates trees. It then looks at all the trees together to determine whether any single data point is an anomaly. The fewer times needed to subdivide the data before isolating the target data point, the more likely it is that this data point is an anomaly for that sample.

At this point, it is worth noting that Amazon SageMaker also supports anomaly detection using Random Cut Forest — but it is primarily designed for batch predictions. The SageMaker implementation requires the users to train the model using a static dataset and the model outputs anomaly scores for each data-point in the input dataset. A sample implementation of Random Cut Forest on Amazon SageMaker can be found in this blog.

Conclusion

The architecture for setting up streaming data, performing real-time analytics on it, serving up the results as dashboards or alerts, and then saving off the relevant data for downstream uses can all be achieved within the AWS framework. Each of these components can be customized based on the industry application and use case. For example, in the case of medical devices, real-time monitoring and anomaly alerting can provide vital and valuable information to front-line medical workers to prioritize taking action. This can be especially helpful in resource-stressed environments like the COVID epidemic. Yet other use cases for anomaly detection and real-time dashboards can add up to providing longer-term cost savings, for example, with building sensors and associated energy consumption patterns. In many cases, the data sourced for real-time analytics for operational purposes (e.g. manufacturing equipment sensor anomalies) can then be aggregated and stored for further downstream applications like executive dashboards (e.g. equipment failures and resulting production outages) as well as future machine learning applications (e.g. predicting what type of failure and length of outage might result from a sensor anomaly).

Indeed, the applications and ROI for implementing real-time anomaly detection systems can be varied. However, depending on the industry and specific use-case, it should be possible to determine a reasonable ROI prior to embarking on such an initiative. For further information on how this might benefit your organization, please contact us here.

--

--