Perform real time anomaly detection using Google Cloud’s Timeseries Insights API — Part I

Overview of an easy to use API to scale billions of timeseries with low latency anomaly detection and forecasting

Nishit Patel
Google Cloud - Community
14 min readSep 20, 2022

--

Photo by Aron Visuals on Unsplash

This is first of the two part article on real time anomaly detection using Google Cloud’s Timeseries Insight API. In this article, I’ll cover how to create dataset for anomaly detection and how to query for anomaly. The part-II will focus on how to add new events in streaming fashion and get anomaly on newly added data and how to delete unwanted timeseries datasets.

Time series forecasting and anomaly detection are common use cases in ML and used widely across different industries for example traffic forecasting, demand planning, stock inventory etc. Most of these existing systems run forecasting and anomaly detection as batch jobs. In this tutorial, I’ll explain and demonstrate a use case to detect anomalies in near real time (sub second latency) using Google Cloud’s Timeseries Insights API.

So what is Timeseries Insights API and why and where would you use it?

Google Cloud’s Timeseries Insights API is a general purpose anomaly detection framework which provides low cost, low latency and high scalability solution.

There are multiple anomaly detection methods exists today. These range from simple ones, based on observing a threshold over time to more complex ones which dynamically adjust the resolution interval. The key limitation of these systems is the cost of analyzing a timeseries and picking interesting ones. In addition to this, forecasting and anomaly detection over billions of timeseries is computationally expensive when runs in batch fashion. This is the critical limitation for doing online and real time analysis for example: send an alert to users based on event dimension value fluctuations.

The main goal of doing timeseries forecasting and using it for anomaly detection with timeseries Insights API are:

  • Scale to billions of timeseries (timeseries => series of counts over time for a given event)
  • Real time latency (sub second) for forecasting and anomaly detection
  • Provide comparatively cheaper batch inference for timeseries forecasting

Let’s understand how Timeseries Insights API works with an example and see it in action. There are 4 main methods to interact with API and get things going. These are:

  • Create and load dataset
  • Query dataset
  • Update dataset (append streaming/new events in an existing datasets)
  • Delete dataset

We will cover first methods in this article. You can checkout the official API documentations for setup instructions for the API and set up appropriate permissions. Now let’s get to the fun part:

We are using a fictitious IoT dataset. The data comes from a single sensor and an event is streamed every 25 seconds or so with readings of attributes like light, temperature, hydrogen, humidity values within a warehouse. The dataset consists roughly 3 months of data and has 202K rows. We will use this dataset in this exercise. Here are some sample rows from dataset:

Sensor data sample in tabular form (we will not be using last column)

and here is how the values of each attributes looks like over the period of time:

Figure 1 — time series plot with sensor readings

As you can see in the plot above, the values of light fluctuated a lot and values for light and temperature were more or less consistent except some days where there was a sharp increase in temperature values and sharp decline in hydrogen values around the same time. For analysis in this article, I am going to consider only temperature values and we will try to detect the anomalies in the temperature values.

If you look closely to only temperature values, it spiked very quickly between 06/21/21 to 06/28/21. For this tutorial, this timeframe will be used to query to check anomalies in temperature values

Figure 2 — temperature variation in given data

1. Creating dataset

Before we query for anomaly using Timeseries Insight API, we will need to create dataset within API using historical dataset. a dataset is a collection of events and queries are performed using this dataset. API expects this data to be in certain json format. A sample of an event in this format for given dataset looks like below:

Here this one event represents one row from tabular dataset shown in image previously. Let’s first understand the different components for creating dataset in Timeseries Insights API. The resource section of this article has the link for the notebook that contains the full code to convert tabular data to json required format.

groupId here represents each unique event in your dataset. Think of this as an event identifier and since each row in our dataset is one event, we have a unique groupId for each event in above transformed json format. I used FARM_FINGERPRINT hash function from BigQuery to generate this. The purpose of group is to compute correlation among events from same group. Note that if you do not have this in your dataset then a groupId will be autogenerated using timestamp and other content from your dataset therefore its better to create it yourself for complete control of schema.

eventTime (case sensitive) is nothing but timestamp of your event.

dimensions are the properties for given event. It can be categorical or numerical. You can also think of this as different attributes of your timeseries data. Note that each dimension will have name and its values using stringVal, doubleVal or longVal

Note: notice that json event contains a dimension with key value pair named “measure” and “LTTH” but our original dataset does not have column with same name.

This has to do with a concept called slice within timeseries insights API. A slice is subset of all events within dataset that have some values across some categorical dimension but since we do not have any such attribute/column in our dataset (all attributes are numerical), we need to create a dummy one of type categorical and since each event in our dataset is unique and there is no hierarchy, we can use some arbitrary string value for this dummy dimension which will have same value for each event.

In future, slicing using numerical dimension will be supported by the API

That’s all you need to know to create dataset in required format. The transformed json file now has 202k rows (same as original dataset). Now before this dataset can be created, we’ll need to move this json file to a cloud storage bucket. You can use following command to move file in a bucket in cloud storage:

The next step is to issue a create dataset command to timeseries insights API to create dataset with all historical records.

In the code above, file data is json payload used to create dataset. First, provide a dataset name (sensor-data here), Then ttl is set which is a parameter set that determines how long the dataset data should be stored for before being discarded. This states that the events being appended (for new incoming data) must be newer than current time minus ttl value. Then dataNames parameter is a list of all the dimensions present in your dataset. Essentially, you want to include all dimension here that you wish to query for anomaly and forecasting. Lastly, dataSources is gcs location of your transformed json file that contains the historical data.

This create request returns success if API server accepts it. you can submit another request to see or list the dataset. Initially the job will be “LOADING” status until indexing completes for all the dimensions, then status becomes as “LOADED”, and this indicates that dataset can start accepting the queries and updates for anomaly and forecasting.

Checking status and list all datasets in API server by running following code

List all datasets in API

Following is the response you’ll see. You can see the dataset name, all indexed dimensions, dataset uri of your json file in cloud storage and number or rows in the dataset. Note that you can only query dataset when the status is LOADED and depending on the size of your data, it can take a while before it indexed and ready for query.

2. Query for anomaly

Once the dataset has been successfully created and is in the Loaded status, API is ready to accept the queries for anomalies or forecasting. Similar to creating dataset, there is query data json payload that will be used to make HTTP request to API. Following is an example payload for querying for anomaly in temperature value.

At high level, Timeseries Insights API anomaly detection query verifies if there are any slices in dataset that have an expected value at given specific point in time and this is called detection time. There are four top level parameters that you need to be aware of:

  • detectionTime: detection time indicates the moment in time we want to analyze and raise an anomaly if the expected value if different than the actual value of metric you selected for.
  • slicingParams: Slicing indicates how the events are grouped into slides and controlled at query time. dimensionNames parameter takes a list of dimensions that you can use to slice your data. In query above, we are using dummy dimension measure. Remember, currently you can only use dimension of type string to slice your data. In future, there will be support to slice data using numeric dimension.
  • timeseriesParams: These parameter controls how much data is used during anomaly detection and process of aggregating events into time series for each slice. forecastHistory parameter is amount of time in seconds and indicates how many points we include in time series during anomaly query. granularity parameter represents the fixed distance between consecutive time series points. This can be adjusted to lower values to capture recurring patterns. It also implicitly represents the width of aggregated window period in seconds. metric is the parameter of type numeric and its value is used for aggregation. Essentially this is the dimension that user is querying anomaly for.
  • forecastParams: These parameters controls setting like sensitivity, seasonalities and horizon window etc. sensitivity specifies how sensitive the anomaly detection process is. It’s value must be in (0.0, 0.1] interval. Lower value indicates less sensitivity and results in lower anomalies and increasing its value results in high anomaly values as process becomes very sensitive. noiseThreshold represents the minimum difference between expected and actual values at detection time for given slice.

Note: There are more parameters available for creating your query payload. For full list of available parameters checkout resource section. I have only described some of these that have been used in code above.

Below is how a raw events timeseries is transformed to a new timeseries form where attribute values have been aggregated by time period 360 seconds. For the illustration purpose, it shows all the attributes but in reality time series will only contains the dimension used in metric field in the query payload.

an example of raw event to aggregate timeseries conversion

Here is the query result for anomaly look like and in this you can see the internal time series that has been created. You can also turn off this history and just output the anomaly result in payload. These values at timestamps are aggregated value for temperature with window of granularity value set in the query payload. This can be customized and altered for each anomaly query by users to accommodate the seasonality and business requirements.

The result output has a lot of information. First, we see the timeseries built by the API based on your query payload. We also see the forecasted value and actual value of the metric at the detection time. Two main piece of information are expectedDeviation and anomalyScore which are describe below. You can use these to finally calculate the anomaly for given detection time for your data. Since the anomaly score is greater than 1.0 and difference between expected and actual temp values are significant enough, we will consider this as an anomaly for given detection time.

For each query, there are few things in the result.

  • detectionPointActual this is the value of timeseries attrbitue that you queried for (metric = temperature in this case). This is the actualy value (aggregated) at the detectionTime.
  • detectionPointForecast is the forecasted value at the `detectionTime` for timeseries attribute.
  • expectedDeviation indicated the confidence level in the forecasting values. It specifies the absolute deviation from the forecsted value by the API.
  • anomalyScore you can think of this as actual deviation between forecasted value and actual value at detetionTime. If this value is higher then we consider a slice to be an anomaly. In other word, this score indicates how far actual deviation is from expected deviation.

In general, scores lower than 1.0 reflect variations that are common considering the history of the slice, while scores higher than 1.0 should require attention, with higher scores representing more severe anomalies.This result can be consumed in downstream application for business purpose.

Again interpreting the anomaly score and expected deviation is relative and can depend on use case therefore a particular value of these that result in anomaly for one use case might not be an anomaly for other use case.

As you can see how easy and quick it is to do real time anomaly detection without any training effort or custom modeling. Once the dataset is ready, users can quickly analyze it for anomaly detection and large scale and low latency forecasting.

In this tutorial, we have covered how to create dataset from historical dataset and how to query for anomaly and time series forecasting. In the next part of this blog I will cover how to add data to an existing dataset in real time streaming fashion and perform detection on same. Stay tuned for part — II

Resources

--

--