Anomaly Detection in Finance #2

Mehrdad Mamaghani
Swedbank AI
Published in
5 min readSep 25, 2019

--

Image rights

In the previous post we discussed some of our thoughts on anomaly detection and how aspects such as business needs or compliance can influence the final choice of modeling technique.

In this post, we will have a closer look at one-class classification techniques and how one particular idea behind such models can help to detect anomalies in a set of observations. More specifically, we focus on penalization — in the literature also referred to as regularization — as a means to facilitate anomaly detection.

Having a somewhat counter-intuitive name, one-class classification models can be of great use in anomaly detection. The idea is to train a model based on a single class, the normal samples, and measure the goodness-of-fit when predicting new labels using this model. Clearly, having been subjected to only one class during the training process, most goodness-of-fit metrics will fail to be particularly informative when facing deviating observations.

To this end, penalization techniques, often a remedy to the overfitting problem, can help us to create models with more informative fitness measures. We should add that our preference to use the term penalization instead of regularization stems from its more specific and less generic association to the task herein.

There are two well-known types of penalization in statistical models: the L1 and the L2 penalization. Generally, the goal of penalization is to create an extra cost in the training process via the loss function. A straightforward training process is focused solely on estimating model parameters that best mimic the provided training data. Penalization, however, introduces a cost on the extent to which parameters should be fit to the training data. The main difference between the L1 and the L2 penalization types is their linear and quadratic nature, respectively.

To get more concrete, we have a loss function below where using a sum of squares we measure the deviation between the empirical values (y) and the function/model output:

Penalization adds an extra term to the loss function so that:

As seen above, the influence of the penalization term is governed by the penalization parameter lambda. In its simplest form, where the beta parameters below represent coefficients in a regression model, each type of penalization adds the following penalties to the loss term so that the larger the coefficients, the larger the total loss:

Very well-known applications of penalization as demonstrated above are shrinkage and feature selection techniques. Lasso (L1), ridge (L2) and the elastic net regression (the latter being a weighted mix of the former ones), are widely used to prevent overfitting and guiding variable selection (especially in presence of correlated variable sets).

Circling back to anomaly detection, it is now probably clear how a penalization regime can be of use in one-class classification models. By introducing penalization, one-class models are faced with a cost when fitting parameters to a trivial one-class problem, hence losing the ability to deliver a “perfect” fit. This means that we create a meaningful distribution of residuals/goodness-of-fit metrics that can be used to delineate whether an observation is an anomaly or not.

Concretely, penalization for one-class models can be employed in logistic regression, decision trees or deep learning models. A few related regression methods have already been mentioned above. As for tree-based models, the objective of penalization can be to determine a limit on tree depth. For the remainder of this post however, we will have a look at how penalization can be of help in one-class deep learning models.

Penalization in one-class deep learning model APIs

Beside dropout functions as a tool to fight overfitting, the PyTorch, TensorFlow and Keras APIs provide a number of ways to incorporate penalization. For the latter two, these capabilities exist for both of the API libraries in Python and R and will keep on becoming more and more homogenized in future TensorFlow releases. For example, the Keras API provides the l1, l2and l1_l2 regularization functions which can be called in the following way:

regularizers_l1(l = 0.01)
regularizers_l2(l = 0.01)
regularizers_l1_l2(l1 = 0.01, l2 = 0.01)

These functions can then be inserted as values for the regularizer arguments in a layer call:

layer_dense(units = …,
activation = …,
name = …,
activity_regularizer = …,
kernel_regularizer = …,
bias_regularizer = …)

The existence of three different types of regularizer arguments can be a little overwhelming. Let us take a step back and try to dissect what they stand for using a setup similar to the formulae above:

The kernel_regularizer(in previous versions called the weight_regularizer) influences the complexity of the layer weights, the beta parameters: the larger the penalization on the weights, the closer they will be to zero (or flat out zero depending on the type of regularization). This is likely to lead to exclusion of some features as well as larger biases.

The bias_regularizer intends to penalize the size of biases, the epsilon term. Seen through the bias-variance tradeoff, this is likely to lead to increased model variance at the cost of decreased bias.

Finally, the activity_regularizer exercises penalization on both the weight and bias terms, the entire right hand side of the equation above. The likely outcome of this penalization is to create a more compact, homogeneous distribution of output values.

The question of which sort or which combination of penalizations one should use depends largely on the type of data, anomalies, pre-processing steps, as well as the modeling pipeline. As penalization impacts the bias-variance tradeoff, good performance in one-class models is about finding the right balance between different elements in the fitting process.

As always, good modeling practice guides us to start with a simple architecture and parsimonious addition of hyperparameters, i.e. limited amount of data, one or two layers and one type of conservatively weighted regularizer at a time.

This was a quick take on penalization and how it can be used in one-class classification models designed to detect anomalies. For a comprehensive understanding of the bias-variance tradeoff, penalization and its usage in statistical model we refer the reader to Elements of Statistical Learning (Hastie et. al).

Be sure to tune in for the coming posts where we continue our discussion on this topic from other viewpoints!

--

--