Anomaly Detection on Transactional Data Streams

Stefanos Doltsinis
Kaizen-Gaming
Published in
6 min readApr 14, 2020
vectors from freepics by Freepic, brgfx, macrovector

In the online gaming industry, we receive a vast amount of all sorts of transactions every day. The core business is rounded by transactions that can get up to tens of thousands every minute, generated by several sport events. Offering a crisp product without any incidents (i.e. game malfunctions, wrong settlements) in such a large transaction volume flow, is an everyday challenge for the industry, that requires skilled and focused people constantly monitoring every operation. To support this process, we use machine learning as an alarming mechanism that identifies abnormal situations on transactional streams.

Why Machine Learning?

If ten experts are asked to describe the characteristics of a transactional incident, it is very much probable they will all provide ten different answers. This shows the subjectivity on how someone identifies an abnormal situation and the difficulty to define an incident with specific rules. Even though everyone has a different perspective of the process, they all identify more or less the same incident, and most importantly, they all share the same data; The same data we should also use to train a machine learning model.

In machine learning the aim is to generate a mathematical model (training) based on previous experience (data), that can be used to characterize (inference) new cases. Machine learning models can very well automate repetitive tasks with high accuracy. However, its success depends on the data quality, since a trained model reflects the data dynamics and can perform as good as its data is.

Let’s be more specific…

At Stoiximan, we have developed an Anomaly Detection System that can monitor data streams and alert for abnormalities. The mechanism is common across streams and combines an unsupervised based model for multivariate anomaly detection, a data drift monitoring system to minimize false alarms and a retraining process to adjust in incoming data variations.

  • Anomaly detection

Anomaly detection is an area of machine learning where the aim is to identify abnormalities in “normal” data. This is a challenging task since the only information one can be sure of is that of normality. The abnormality is often unknown or only partially known. The sparsity of abnormal data makes classification approaches difficult to implement when class imbalance can get to very low margins. Even if the balancing problem is solved, abnormalities are in nature unexpected situations and full knowledge of the incident class is difficult to obtain.

  • Learning the unknown

By definition, machine learning is based on historical data; Data that characterise something already known. Can we however, learn something that is not represented within our data set? Can we learn to identify the unknown? The answer is as simple as a straight yes, but as tricky as the question itself!

If we can learn something very well, we can distinguish with certainty anything dissimilar. In other words, anything different to normal is abnormal.

If we train a binary classifier while only having partial knowledge of the abnormal class (incidents), this might clearly distinguish the two classes (Figure 1 — a) in the training set, but will not generalize well and will generate several false alarms from incoming abnormalities not seen before (Figure 1 — b)). A way to solve such problems is to use one class based algorithms, fit a model in normal transaction data and identify anything outside those boundaries as abnormal (Figure 1 — c).

Figure 1 2D feature space in a) binary classification set-up, b) with partial knowledge of the incidents (abnormalities) and c) one class fit.
  • Delving into the structure — Isolation Forest

In our approach we use Isolation Forest (IF), a one-class type algorithm based on decision trees. In IF the generated trees aim to isolate every data point by partitioning the feature space. It is assumed that an anomaly data point will highly differ from normal data and therefore will be more susceptible to isolation in the feature space. Therefore, a shorter distance path is required to isolate an anomalous data point, in comparison to normal data.

Figure 2 -D feature space and the required partitions to isolate a normal data point(xi) and a data point (x0) representing an abnormality. (image source)

Several trees are fitted into the training data set until all points are reached, and an anomaly score is generated based on the mean number of splits that every tree requires to reach each point. Normal data will form a dense space and therefore reaching one will need several splits, as opposed to anomalous cases that will be reached in less splits. IF has the advantage of a global view, therefore it can be robust to normal subclasses.

Time series and data streams

Time has a special meaning in our world. Same applies in machine learning. A time series has significant properties that affect the way one should implement machine learning algorithms; Trend, seasonality, stationarity etc. are issues that need to be considered before a modelling process.

  • Chasing the stream

It is common to train and test models in a static set-up assuming a well-known normal class. However, time series and streams evolve in time and present time varying characteristics that can affect predictions in different ways. Training in a long period aims at learning those characteristics and can solve such issues if a large historic data set exists. The time dependence can be transformed into time-based features and incorporate it into a feature vector. Instances of the stream can be then statically processed as independent data points in the feature space.

  • Being outdated — Data drift

It is also important to cater for upcoming changes in the normal class, in case normality is defined differently as time goes by, or if there is a change in the incoming stream. Those two concepts are known in machine learning as concept and data drift (covariate shift), respectively. A concept drift will lead to several false alarms unless a model reevaluation and retraining takes place. Covariate shift is defined as the situation where the features’ distributions have changed. Statistical approaches can be used to identify such cases using hypothesis testing on the features’ distributions of the incoming stream, but this can be tricky if distribution types are not known in advance.

To identify a potential covariate shift we use an empirical approach. A classifier is trained to distinguish between training and test data. First we use a time series split to train our anomaly detection model and tune it on known incidents. When the model goes live, an incoming window of data is periodically checked for a potential shift. A binary classifier is trained between the dataset used to set up the anomaly detection model and the incoming data. If the two distributions are similar the classifier should show no separability between the two data sets and will generate poor classification performance (AUC score). If a covariate shift is detected, then a retraining process is triggered. This is a process that can be automated and reduce false positives caused from a data drift.

Figure 3 Data drift concept

Challenges / Takeaways

  • Automating a monitoring task is something that can be done well using machine learning. However, this is not as straightforward when this is done in real-time. A monitoring mechanism should always be implemented so that the model can be retrained and kept up to date.
  • Anomaly is a matter of perspective. Something that seems normal today might become abnormal tomorrow and vice versa. Regular checks should be included to understand if a case that ostensibly seems abnormal is due to a data drift or not.
  • Labelling is a major issue in such cases. Nobody is usually certain of the time of occurrence and the exact duration. While incidents are common, it is difficult to identify and assign a crisp label with accuracy on the time of emergence and their duration. Unsupervised based approaches, such as Isolation Forest, train on normal operation data and can identify any diverting state.
  • Every stream has different properties, such as class imbalance, univariate/multivariate anomaly, etc. form different problems to solve. One solution cannot suit all cases, nevertheless we are building our anomaly detection suite to quickly identify unexpected situations.

References

F. T. Liu, K. M. Ting and Z. Zhou, “Isolation Forest,” 2008 Eighth IEEE International Conference on Data Mining, Pisa, 2008, pp. 413–422.

--

--