Stories by Christiaan Defaux on Medium

An overview of fraud/anomaly detection algorithms (machine learning and numerical methods)

Christiaan Defaux — Sun, 10 Oct 2021 19:18:41 GMT

Outlier/Anomaly Detection for Fraud

Generally, the term anomalies are defined as the data pattern which does not conform to normal behaviors. Hence anomaly detection is nothing but the problem of finding patterns in data that do not conform to expected behavior. Anomaly detection can be used in a wide variety of applications such as fraud detection for credit cards, insurance or health care, intrusion detection for cybersecurity, military surveillance, and many others. Anomalies can be classified into three categories:

Point anomaly: If an individual instance of data can be considered as an anomaly with respect to the rest of the data.
Contextual anomaly: If a data instance is considered anomalous inside a specific context, then it can be classified as an anomaly.
Collective anomaly: If a collection of data instances is anomalous with the respect to the entire data set. Individual data instances in this type of anomaly might not be considered anomalous.

So, to make it simple, when working with anomaly detection the goal is to define a region representing normal behavior and declare any observation in the data which does not belong to this normal region as an anomaly. Although it may sound easy, anomaly detection has certain challenges which make anomaly detection problems not so easy to solve. Some of the challenges are as follows:

Defining a normal region that encompasses every possible normal behavior is very difficult since the boundary between normal and anomalous data is often not precise.
Malicious applications often make anomalous data appear normal, thereby making the task of defining normal data more difficult.
Availability of labeled data for training/validation purposes is often an issue.
Data often contains noise that tends to be similar to actual anomalies, therefore, it is difficult to distinguish and remove.

There are other challenges in detecting money laundering patterns:

Massive class imbalance — transactions labeled as suspicious may be less than 0.0001% of total historical transactions.
Non-stationarity — new money-laundering schemes are constantly being invented. To identify new patterns as they appear, techniques must be able to adapt themselves or be easily adapted.

Anomaly Detection Approaches

Rules-based

Existing approaches to identifying fraud and money laundering rely on databases of human-engineered rules that attempt to match patterns that are indicative of fraud. As new fraud schemes are identified, new rules are added to the rule engines. For example, in money laundering, there are well-known patterns of smurfing money at lots of accounts and aggregating that money using small, under-the-radar transactions at hubs for later spending

Given enough historical financial transaction data, model-based approaches are better at pattern matching than rule-based approaches, as they can generalize to learn fraud schemes that are like existing fraud schemes

Distance-based

Points that are farther from others are considered more anomalous:

K-nearest neighbor (k-NN) is a simple, non-parametric lazy learning technique used to classify data based on similarities in distance metrics such as Euclidean, Manhattan, Minkowski, or Hamming distance.
Local Outlier Factor (LOF): This concept is based on a distance metric called reachability distance.

Density-based

Points that are in relatively low-density regions are considered more anomalous:

K-means: a widely used clustering algorithm. It creates ‘k’ similar clusters of data points. Data instances that fall outside of these groups could potentially be marked as anomalies.
DBSCAN and HAC — variations of clustering algorithms that could potentially be tested.

Density-based algorithms assume that neighbors of a data point have similar density. If some neighbors located in one cluster and the other neighbors located in another cluster have different densities, then comparing the density of the data point with all of its neighbors may lead to a wrong conclusion and recognition of real outliers may fail.

The notion of density does not work well for special datasets, for example, if all data points lie on a single straight line. Even if each instance of dataset has equal distances between itself with its closest neighbors, it may still have different density depending on its placement in the dataset.

Rank-based

The most anomalous points are those whose nearest neighbors have others as nearest neighbors:

RBDA
NC-Clustering

The idea of rank is borrowed from the research literature on social networks. Detecting outliers in a given data set resembles finding the most unpopular person in a given social network. Consider a social network to discover the relative popularity of a person Bob, we may ask Bob, “who are your k best friends?”. Then we may ask all “friends” of Bob the same question. If Bob is not listed among the close friends of his friends, clearly, he is not a popular person. A summary of answers from all k persons whom Bob identifies as his friends allow us to draw a clear conclusion about his popularity. Just as we can use the concept of relative popularity in friend networks, we can use a similar notion of rank to capture the outlierness of a point.

The performance of an outlier detection algorithm based on rank alone is highly influenced by cluster density variations. Furthermore, by definition, ranks use the relative distances and ignore the ‘true’ distances between the observations. This motivates the development of new outlier detection algorithms that utilize rank as well as distance information.

Types of Anomaly Detection

Unsupervised

Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set.

Examples of unsupervised anomaly detection algorithms include:

K-Means
Self-organizing maps (SOM)
C-means
Expectation-maximization meta-algorithm
One-class support vector machine
Isolation Forests

Supervised

Supervised anomaly detection techniques require a data set that has been labeled as “normal” and “abnormal” and involves training a classifier (the key difference to many other statistical classification problems is the inherently unbalanced nature of outlier detection).

Examples of supervised anomaly detection algorithms include:

Supervised Neural Networks
Support Vector Machines
K-Nearest Neighbour
Bayesian Network
Decision Trees

Semi-Supervised

Semi-supervised anomaly detection techniques construct a model representing normal behavior from a given normal training data set and then test the likelihood of a test instance to be generated by the learned model.

Generative Adversarial Networks (GANs)

Evaluation Metrics

An important aspect of any anomaly detection technique is the manner in which the anomalies are reported. Typically, the outputs produced by anomaly detection techniques are one of the following two types:

Scores — Scoring techniques assign an anomaly score to each instance in the test data depending on the degree to which that instance is considered an anomaly. Thus, the output of such techniques is a ranked list of anomalies. An analyst may choose to either analyze the top few anomalies or use a cut-off threshold to select the anomalies.
Labels — Techniques in this category assign a label (normal or anomalous) to each test instance. Scoring-based anomaly detection techniques allow the analyst to use a domain-specific threshold to select the most relevant anomalies. Techniques that provide binary labels to the test instances do not directly allow the analysts to make such a choice, though this can be controlled indirectly through parameter choices within each technique.

When evaluating the performance of anomaly detection algorithms, the difficulty is often encountered when deciding on metrics. This problem is equally confounded when applying supervised vs. unsupervised methods. As there is no labeled data, the classic supervised metrics of AUC, ROC, precision-recall, etc., will not apply.

When evaluating unsupervised models, there aren’t any labels involved, therefore it isn’t possible to measure the outcome on the basis of accuracy or other related metrics. Most unsupervised algorithms are clustering-based which lend themselves to an entirely different raft of metrics.

Internal validation methods make it possible to establish the quality of the clustering structure without having access to external information (i.e., they are based on the information provided by data used as input to the clustering algorithm). In general, two types of internal validation metrics can be combined: cohesion and separation measures. Cohesion evaluates how closely the elements of the same cluster are to each other, while separation measures quantify the level of separation between clusters.

Figure 1: illustration of cohesion (left) and separation (right).

Metrics and Measures of Outcome

When an anomaly detection algorithm is applied, three possible cases need to be considered:

Correct Detection: Detected abnormalities in data do correspond exactly to abnormalities in the process.
False Positives: The process continues to be normal, but unexpected data values are observed, e.g., due to intrinsic system noise.
False Negatives: The process becomes abnormal, but the consequences are not registered in the abnormal data, e.g., due to the signal of the abnormality being insufficiently strong compared to the noise in the system.

Figure 2: confusion matrix for fraud detection purposes.

Figure 2 shows the confusion matrix of a financial fraud binary classifier. For problems such as money laundering, false negatives should be weighed significantly higher. Use a variant of the F1 score to evaluate models: precision, recall, and fallout should not be weighted equally.

Evaluation of Approaches

Supervised modeling has the drawback that it requires “absolute certainty” that each event can be accurately classified as fraud or nonfraud. In addition, any models of fraud can be used to detect only types of fraud that have been identified previously. Therefore, supervised learning may not be the most efficacious approach to this problem.

Unsupervised techniques do not require labeled training data, and thus are most widely applicable. The techniques in this category make the implicit assumption that normal instances are far more frequent than anomalies in the test data. If this assumption is not true then such techniques suffer from a high false alarm rate. Statistical classification as fraud by unsupervised methods does not prove that certain events are fraudulent but only suggests that these events should be considered as probably fraud suitable for further investigation.

Semi-Supervised anomaly detection techniques that operate in a semi-supervised model, assume that the training data has labeled instances for only the normal class. Since they do not require labels for the anomaly class, they are more widely applicable than supervised techniques. For example, in spacecraft fault detection, an anomaly scenario would signify an accident, which is not easy to model. The typical approach used in such techniques is to build a model for the class corresponding to normal behavior, and use the model to identify anomalies in the test data. A limited set of anomaly detection techniques exist that assume the availability of only the anomaly instances for training. Such techniques are not commonly used, primarily because it is difficult to obtain a training data set which covers every possible anomalous behavior that can occur in the data.

Distance Metrics

Distance measures depend greatly upon the data and the space in which the data exists. Certain coordinate spaces lend themselves to certain distance metrics and can change a badly performing algorithm into one that accurately models the data. Some useful distance metrics are:

Euclidean Distance — The most common distance metric. It uses the Pythagorean theorem to find the distance between two points in the cartesian coordinate space.

Equation 1: formula for Euclidean distance between two points x and y

Disadvantages

Euclidean distance is not scale in-variant which means that distances computed might be skewed depending on the units of the features.
One typically needs to normalize the data before using this distance measure.
As the dimensionality of your data increases, the less useful Euclidean distance becomes. This has to do with the curse of dimensionality which relates to the notion that higher-dimensional space does not act as we would, intuitively, expect from 2- or 3-dimensional space.

Advantages

Euclidean distance works great when you have low-dimensional data and the magnitude of the vectors is important to be measured.
Methods like k-NN and HDBSCAN show great results out of the box if Euclidean distance is used on low-dimensional data.
It is incredibly intuitive to use, simple to implement, and shows great results in many use-cases.

Manhattan Distance — The Manhattan distance, often called Taxicab distance or City Block distance, calculates the distance between real-valued vectors. Imagine vectors that describe objects on a uniform grid such as a chessboard. Manhattan distance then refers to the distance between two vectors if they could only move right angles. There is no diagonal movement involved in calculating the distance.

Equation 2: formula for Manhattan distance between two points x and y.

Advantages

Useful when the dataset has discrete and/or binary attributes.
Manhattan seems to work quite well since it takes into account the paths that realistically could be taken within values of those attributes.
Works on high-dimensional data

Disadvantages

Although Manhattan distance seems to work okay for high-dimensional data, it is a measure that is somewhat less intuitive than Euclidean distance, especially when using in high-dimensional data.
Likely to give a higher distance value than Euclidean distance since it does not give the shortest path possible.

Minkowski Distance — is a bit more of an intricate metric than most. It is a metric used in Normed vector space (n-dimensional real space), which means that it can be used in a space where distances can be represented as a vector that has a length.

Equation 3: formula for Minkowski metric between points x and y

The Minkowski metric can be seen as a generalization of some of the measures mentioned above. Using the parameter p, the equation will decompose into the more fundamental measures. We can use this parameter to manipulate the distance metrics to closely resemble others. Common values of p are:

p= 1 — Manhattan distance
p= 2 — Euclidean distance
p=∞ — Chebyshev distance

Disadvantages

Minkowski has the same disadvantages as the distance measures they represent, so a good understanding of metrics like Manhattan, Euclidean, and Chebyshev distance is extremely important.
Moreover, the parameter p can actually be troublesome to work with as finding the right value can be quite computationally inefficient depending on your use-case.

Advantages

The upside to p is the possibility to iterate over it to find the distance measure that works best.
It allows you a huge amount of flexibility over your distance metric, which can be a huge benefit if you are closely familiar with p and many distance measures.

Hamming Distance — is the number of values that are different between two vectors. It is typically used to compare two binary strings of equal length. It can also be used for strings to compare how similar they are to each other by calculating the number of characters that are different from each other.

Figure 3: image illustrating the method of calculating the Hamming distance between two vectors

Advantages

Typical use cases include error correction/detection when data is transmitted over computer networks. It can be used to determine the number of distorted bits in a binary word as a way to estimate error.
Moreover, you can also use Hamming distance to measure the distance between categorical variables.

Disadvantages

Difficult to use when two vectors are not of equal lengt
Does not take the actual value into account only measures if they are different or equal. Therefore, it is not advised to use this distance measure when the magnitude is an important measure.

Mahalanobis Distance — is a measure of distance between a point P and a distribution D. It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D. This distance is zero if P is at the mean of D, and grows as P moves away from the mean along each principal component axis. If each of these axes is re-scaled to have unit variance, then the Mahalanobis distance corresponds to the standard Euclidean distance in the transformed space. The Mahalanobis distance is thus unitless, scale-invariant, and takes into account the correlations of the data set.

Equation 4: formula for the Mahalanobis measure of distance from observation x to a set of observations with a mean µ and covariance matrix S.

Advantages

Automatically accounts for the scaling of the coordinate axes inherent in Euclidean distance
Corrects for correlation between the different features
It can provide curved as well as linear decision boundaries

Disadvantages

Although Mahalanobis distance is included with many popular statistics packages, some authors question the reliability of results.
A major issue with the MD is that the inverse of the correlation matrix is needed for the calculations. This can’t be calculated if the variables are highly correlated.

Neighbor/Distance-based Algorithms

Distance-based models run into trouble when there are various scales involved, wherein certain clusters may be more or less tightly packed. This can cause problems when evaluating which points are the most anomalous. A visual example of this can be seen below:

Figure 2: diagram illustrating the difficulty associated with distance-based approaches

In the distance-based paradigm, point b would be classed as more anomalous than point a. This difference in magnitude would be entirely based on absolute distances, of which a is smaller than b. On inspection of the clusters, one can easily see. that a is much more of an outlier given the clusters. This nuance can be solved by using density-based algorithms which analyze points based on measures of relative density.

Advantages

These methods are unsupervised in nature. They are purely data-driven. Also, these techniques can be used on different data types, one only needs to define an appropriate distance measure for the given data.

Disadvantages

If the data has normal instances that do not have enough close neighbors, the technique fails to label them correctly. Also, the computational complexity of the testing phase is a significant challenge since it involves computing the distance of each test instance with all instances belonging to either the test data or to the training data.

Clustering/Density-based Algorithms

Density-based approaches attempt to solve this problem by measuring the degree to which a datum is anomalous using differences in density as a measure of outlierness.

The underlying assumption of a distance-based anomaly detection algorithm is that the relative distances between other points in a neighborhood are irrelevant to anomaly detection; hence such an approach is assured to work well only when the different neighborhoods (populated by points) are characterized by roughly equal densities. But this assumption is often violated, i.e., in many practical situations, data points have different densities in different neighborhoods.

A density-based approach looks at the “local” density of a point and compares it with density associated with its neighbors. In density-based approaches, the main idea is to consider the behaviors of a point with respect to its neighbors’ density values. The neighborhood is conceptualized by considering k nearest neighbors, where k is either iteratively estimated or is a preassigned integer. The underlying assumption is that if the density at a point p is ‘smaller’ than the densities of its neighbors, it must be an anomaly. The main difference between the approaches described below is in how they define the ‘local’ behavior and related density. Density may be computed in many ways, some of which are computationally expensive.

The local density of a point p is defined as the reciprocal of the average distance among the k points nearest to p.
The relative anomalousness or outlier ‘score’ of p varies inversely with the local density at p.

Thus, one can calculate the outlier score of each point in the dataset, and sort the scores in decreasing order of magnitude; points at the top of the sorted list and with significantly ‘large’ scores are declared outliers.

Advantages

They can operate in an unsupervised mode. Also, the testing phases in these techniques are fast because each data instance needs to be compared to a small number of clusters.

Disadvantages

They are computationally very complex and they are only effective if the anomalies don’t form significant clusters among themselves.

Unsupervised Models

Local Outlier Factor

The local outlier factor is based on a concept of a local density, where locality is given by k nearest neighbors, whose distance is used to estimate the density. By comparing the local density of an object to the local densities of its neighbors, one can identify regions of similar density, and points that have a substantially lower density than their neighbors. These are considered to be outliers.

The local density is estimated by the typical distance at which a point can be “reached” from its neighbors. The definition of “reachability distance” used in LOF is an additional measure to produce more stable results within clusters.

Advantages

Due to the local approach, LOF is able to identify outliers in a data set that would not be outliers in another area of the data set. For example, a point at a “small” distance to a very dense cluster is an outlier, while a point within a sparse cluster might exhibit similar distances to its neighbors.
While the geometric intuition of LOF is only applicable to low-dimensional vector spaces, the algorithm can be applied in any context a dissimilarity function can be defined. It has experimentally been shown to work very well in numerous setups, often outperforming the competitors, for example in network intrusion detection and on processed classification benchmark data.
The LOF family of methods can be easily generalized and then applied to various other problems, such as detecting outliers in geographic data, video streams, or authorship networks.

Disadvantages

The resulting values are quotient-values and hard to interpret. A value of 1 or even less indicates a clear inlier, but there is no clear rule for when a point is an outlier. In one data set, a value of 1.1 may already be an outlier, in another dataset and parameterization (with strong local fluctuations) a value of 2 could still be an inlier. These differences can also occur within a dataset due to the locality of the method.

Isolation Forests (iForests)

At the basis of the Isolation Forest algorithm, there is the tendency of anomalous instances in a dataset to be easier to separate from the rest of the sample (isolate), compared to normal points. In order to isolate a data point, the algorithm recursively generates partitions on the sample by randomly selecting an attribute and then randomly selecting a split value for the attribute, between the minimum and maximum values allowed for that attribute. From a mathematical point of view, recursive partitioning can be represented by a tree structure named Isolation Tree, while the number of partitions required to isolate a point can be interpreted as the length of the path, within the tree, to reach a terminating node starting from the root.

This method is highly useful and is fundamentally different from all existing methods. It introduces the use of isolation as a more effective and efficient means to detect anomalies than the commonly used basic distance and density measures. Moreover, this method is an algorithm with a low linear time complexity and a small memory requirement. It builds a well-performing model with a small number of trees using small sub-samples of fixed size, regardless of the size of a data set.

Most other outlier detection algorithms seek to build a profile of “normal” instances then flag instances that don’t fit that profile of normality. iForests explicitly isolate anomalous records by taking advantage of inherent properties of anomalies: they have unusual values for the set of covariates.

As a tree ensemble that employs isolation trees iForest a) identifies anomalies as points having shorter path lengths, and b) has multiple trees acting as ‘experts’ to target different anomalies. Since iForest does not need to isolate all of normal instances — the majority of the training sample, iForest is able to work well with a partial model without isolating all normal points and builds models using a small sample size. Contrary to existing methods where a large sampling size is more desirable, the isolation method works best when the sampling size is kept small. Large sampling size reduces iForest’s ability to isolate anomalies as normal instances can interfere with the isolation process and therefore reduces its ability to clearly isolate anomalies.

To build an iTree, we recursively divide X by randomly selecting an attribute q and a split value p until either:

the tree reaches a height limit
all observations are isolated at their own external node
all data have the same values for all attributes.

Advantages

has a low linear (O(n)) time complexity and a small memory requirement
is able to deal with high dimensional data with irrelevant attributes
can be trained with or without anomalies in the training set
can provide detection results with different levels of granularity without re-training

Disadvantages

performance depends a lot on the window and model update policy choices
hard to adapt for categorical data

One-Class Support Vector Machines (OCSVM)

The one-class support vector machine is a very specific example of a support vector machine that is geared for anomaly detection. The one-class SVM varies from the SVM generic version in that the resulting problem of quadratic optimization includes an allowance for a specific small predefined outlier percentage, making it proper for anomaly detection. These outliers lie between the origin and the optimal separating hyperplane. All the remaining data fall on the opposite side of the optimal separating hyperplane, belonging to a single nominal class, hence the terminology “one-class” SVM.

The SVM outputs a score that represents the distance from the data point being tested to the optimal hyperplane. Positive values for the one-class SVM output represent normal behavior (with higher values representing greater normality) and negative values represent abnormal behavior (with lower values representing greater abnormality)

The one-class SVM is known to be sensitive to outliers and thus does not perform very well for outlier detection. This estimator is best suited for novelty detection when the training set is not contaminated by outliers. That said, outlier detection in high-dimension, or without any assumptions on the distribution of the inlying data is very challenging, and a One-class SVM might give useful results in these situations depending on the value of its hyperparameters.

Advantages

Find the optimal separation hyperplane.
Can deal with very high dimensional data.
Some kernels have infinite Vapnik-Chervonenkis dimension (a proxy for model complexity), which means that they can learn very elaborate concepts.
Usually work very well.

Disadvantages

Require both positive and negative examples.
Need to select a good kernel function.
Require lots of memory and CPU time.
There are some numerical stability problems in solving the constraint

K-Means

K-means algorithm is a traditional clustering algorithm. It divides the data into k clusters and guarantees that the data within the same cluster are similar, while the data in different clusters are dissimilar. K-means algorithm is first selected K at random as the initial cluster center, for the rest data add it to the cluster with the highest similarity according to its distance to the cluster center, and then recalculate the cluster center of each cluster. Repeat this process until each cluster center doesn’t change. Thus, data are divided into K clusters. Unfortunately, K-means clustering is sensitive to the outliers and a set of objects closer to a centroid may be empty, in which case centroids cannot be updated.

Advantages

Low complexity.

Disadvantages

The necessity of specifying k.
Sensitive to noise and outlier data points.
Clusters are sensitive to the initial assignment of centroids.

Fuzzy C-means

Fuzzy C-means is a clustering method, which grants one piece of data to belong to two or more clusters. It is used in applications for which hard classification of data is not meaningful or difficult to achieve (e.g., pattern recognition). C-means algorithm is similar to K-Means except that membership of each point is defined based on a fuzzy function and all the points contribute to the relocation of a cluster centroid based on their fuzzy membership to that cluster.

Advantages

Allows a data point to be in multiple clusters.
A more natural representation of the behavior of genes.

Disadvantages

Need to define C, the number of the cluster.
Need to determine membership cutoff value.
Clusters are sensitive to the initial assignment of centroids.

Neural Networks

Neural networks are a class of machine learning methods entirely of their own. They can be adapted to various uses catering to supervised, unsupervised, and even time-series data. NNs learn a feature representation of the distribution using various optimization algorithms in order to reduce the loss or error metric specified in training. They require large amounts of data for effective training so can be limited by that constraint.

Advantages

A neural network can perform tasks that a linear program cannot.
When an element of the neural network fails, it can continue without any problem with its parallel nature.
A neural network learns and does not need to be reprogrammed.
It can be implemented in any application.

Disadvantages

The neural network needs training to operate.
The architecture of a neural network is different from the architecture of microprocessors, therefore, needs to be emulated.
Requires high processing time for large neural networks

Self-Organizing Maps (SOM)

SOMs are the most popular neural networks to be trained for anomaly detection tasks. The SOM is trained by an unsupervised competitive learning algorithm. The aim of the SOM is to reduce the dimension of data visualization. That is, SOM outputs are clustered in a low-dimensional feature space. It usually consists of an input layer and the Kohonen layer, which is designed as the two-dimensional arrangement of neurons that maps n-dimensional input to two dimensions. The SOM associates each of the input vectors to a representative output. The network finds the node nearest to each training case and moves the winning node, which is the closest neuron (i.e., the neuron with minimum distance) in the training course. That is, SOM maps similar input vectors onto the same or similar output units on such a two-dimensional map, which leads to self-organize the output units into an ordered map and the output units of similar weights are also placed nearby after training.

Advantages

A simple and easy-to-understand algorithm that works.
A topological clustering unsupervised algorithm that works with nonlinear data set.
The excellent capability to visualize high-dimensional data onto 1- or 2-dimensional space

Disadvantages

Time-consuming algorithm to compute

Unsupervised Performance Measures (Cluster Validity Indices)

Silhouette Coefficient

This is a measure of how similar an object is to its own cluster compared to other clusters. The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.

The silhouette can be calculated with any distance metric, such as the Euclidean distance or the Manhattan distance. We can interpret a(i) as a measure of how well assigned i is to its cluster (the smaller the value the better the assignment). Mathematically it is the mean of the distance between i and all other data points in the same cluster.

We then define the mean dissimilarity b(i) between point i and some cluster to be the smallest mean distance of i to all points in any of the other clusters. The cluster with the smallest mean dissimilarity is the “neighbouring cluster”

The silhouette score s(i) is defined above as the difference between the mean distance and mean dissimilarity divided by the maximum value of the two functions for each point i.

Advantages

The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around zero indicate overlapping clusters.
The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.

Disadvantages

The Silhouette Coefficient is generally higher for convex clusters than other concepts of clusters, such as density-based clusters like those obtained through DBSCAN.
High computational complexity: O(n²)

Calinski-Harabasz Score

The Calinski-Harabasz index also known as the Variance Ratio Criterion, is the ratio of the sum of between-clusters dispersion and of inter-cluster dispersion for all clusters, the higher the score, the better the performances.

Where N is the number of data points, K is the number of clusters, B is the inter-cluster variance, W is the intra-cluster variance. The C-H index varies between 0 and +∞ (the best classification). It depends strongly on the value N (the number of points in the sample).

Advantages

The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.
The score is fast to compute.

Disadvantages

The Calinski-Harabasz index is generally higher for convex clusters than other concepts of clusters, such as density-based clusters like those obtained through DBSCAN.
Everything else being equal, the C-H index varies linearly with N. This can lead to vastly different scores between datasets.

Dunn Index

An internal evaluation scheme, where the result is based on the clustered data itself. As do all other such indices, the aim is to identify sets of clusters that are compact, with a small variance between members of the cluster, and well separated, where the means of different clusters are sufficiently far apart, as compared to the within-cluster variance. For a given assignment of clusters, a higher Dunn index indicates better clustering.

Various measurements of the cluster diameter can be used in the equation above, this value is denoted by δ, the intercluster distance metric between clusters Ci and Cj. It can be seen qualitatively as the minimum separation divided by the maximum diameter Δk.

Advantages

Gives a ratio value of intracluster closeness to intercluster separation which can give a unique perspective of clustering performance.

Disadvantages

The computational cost grows as the number of clusters and dimensionality of the data increase.
This formulation has a peculiar problem, in that if one of the clusters is badly-behaved, where the others are tightly packed, since the denominator contains a ‘max’ term instead of an average term, the Dunn Index for that set of clusters will be uncharacteristically low.

Davies-Bouldin Index

This index signifies the average ‘similarity’ between clusters, where the similarity is a measure that compares the distance between clusters with the size of the clusters themselves. A lower Davies-Bouldin index relates to a model with better separation between the clusters. In the context of this index, the similarity is defined as a measure Ri,j that trades off:

Si is the average distance between each point in cluster I and the centroid of that cluster — also known as cluster diameter.
Mi,j is the distance between cluster centroids i and j.

Ri,j is defined above and is used to calculate Di:

When the number of clusters is N we have:

Advantages

The computation of Davies-Bouldin is simpler than that of Silhouette scores.
The index is computed only quantities and features inherent to the dataset.

Disadvantages

The usage of centroid distance limits the distance metric to Euclidean space.
The Davies-Boulding index is generally higher for convex clusters than other concepts of clusters, such as density-based clusters like those obtained from DBSCAN.
A good value reported by this method does not imply the best information retrieval.

Supervised Models

Decision Trees

Decision Trees are powerful and common tools for classification and prediction. A decision tree is a tree that has three main components: nodes, arcs, and leaves. Each node is labeled with a feature attribute, which is most informative among the attributes not yet considered in the path from the root. Each arc out of a node is labeled with a feature value for the node’s feature, and each leaf is labeled with a category or class. A decision tree can then be used to classify a data point by starting at the root of the tree and moving through it until a leaf node is reached. The leaf node provides the classification of the data point.

Algorithms for constructing decision trees usually work top-down, by choosing a variable at each step that best splits the set of items. Different algorithms use different metrics for measuring “best”. These generally measure the homogeneity of the target variable within the subsets. These metrics are applied to each candidate subset, and the resulting values are combined (e.g., averaged) to provide a measure of the quality of the split. Two commonly used metrics for decisioning are:

Information Gain
Gini Impurity

Advantages

Simple to understand and interpret.
Requires little data preparation.
Able to handle both numerical and categorical data.
Uses a white-box model.
Possible to validate a model using statistical tests.
Robust.
Perform well with large data in a short time.

Disadvantages

The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts.
Decision-tree learners create over-complex trees that do not generalize the data well.
There are concepts that are hard to learn because decision trees do not express them easily.

K-Nearest Neighbor

K-nearest neighbor (k-NN) is one of the more modest and conventional nonparametric techniques for classifying samples. It calculates the approximate distances between various points on the input vectors, and then assigns the unlabeled point to the class of its K-nearest neighbors. In the process of creating k-NN classifier, k is an important parameter and various k values can cause various performances. If k is very large, the neighbors, which are used for prediction, will consume large classification time and affect the prediction accuracy.

Advantages

Very easy to understand when there are few predictor variables.
Useful for building models that involve non-standard data types, such as text.

Disadvantages

Have large storage requirements.
Sensitive to the choice of the similarity function that is used to compare instances.
Lack a principled way to choose k, except through cross-validation or similar.
Computationally-expensive technique.

Bayesian Networks

A Bayesian Network (BN) is a model that encodes probabilistic relationships among variables of interest. This technique is generally used for intrusion detection in combination with statistical schemes. It has several advantages, including the capability of encoding interdependencies between variables and of predicting events, as well as the ability to incorporate both prior knowledge and data.”

Supervised Performance Metrics

Supervised machine learning models are trained on labeled data which allows a precise description of the performance of the model. In this case, different metrics are used than under the unsupervised paradigm. A variety of different measures of success are used during analysis as each measure provides a unique evaluation of performance.

Evaluation Statistics

True Positives (TP): These are cases in which you predicted Yes, and were actually labeled Yes.
True Negatives (TN): You predicted No, and they were actually labeled No.
False Positives (FP): You predicted Yes, but they were labeled as No (also known as a Type I error)
False Negatives (FN): You predicted No, but they were actually labeled Yes (also known as a Type II error)

Accuracy: Perhaps the most commonly used metric is accuracy. Mathematically defined as (TP+TN)/Total. It tells you how often the classifier is correct in making the predictions. Generally, it is not advised to judge your model on accuracy in the case of imbalanced class datasets as you can get high accuracy just by predicting all the observations as the dominant class.
Precision: It answers the question: When the classifier predicts positive, how often is it correct? Mathematically calculated as TP/predicted positive.
Recall: It answers the question: When it’s actually positive, how often does the classifier predict positive? Mathematically calculated as TP/actual positive.
False Positive Rate (FPR): It answers the question: When it’s actually negative, how often does the classifier predict positive? Mathematically calculated as FP/actual negative.
Specificity/True Negative Rate (TNR): measures the proportion of negatives that are correctly identified (i.e., the proportion of those who do not have the condition (unaffected) who are correctly identified as not having the condition).
Sensitivity/True Positive Rate (TPR): measures the proportion of positives that are correctly identified (i.e., the proportion of those who have some condition (affected) who are correctly identified as having the condition).
F1 Score: This is a harmonic mean of the Recall and Precision. Mathematically calculated as (2 x precision x recall)/(precision+recall). There is also a general form of F1 score called F-beta score wherein you can provide weights to precision and recall based on your requirement.
Matthew’s Correlation Coefficient (MCC): Takes into account true and false positives and negatives and is a balanced measure that can be used even if the classes are of very different sizes. The MCC is in essence a correlation coefficient between the observed and predicted binary classifications. It returns a value between −1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement between prediction and observation.
RankPower: Both the precision and AUC criteria do not consider characteristics of anomaly ranking. Intuitively, an anomaly ranking algorithm will be regarded as more effective if it ranks true anomalies in the top and normal observations in the bottom of the list of anomaly candidates. The rank power is such a metric and evaluates the comprehensive ranking of true anomalies. The formal definition is:

where n is the number of anomalies in the top potential objects and Ri is the rank of the i-th true anomaly. For a fixed value of t, a larger value indicates better performance. When the t anomaly candidates are true anomalies, the rank power equals one.

Confusion Matrix

A confusion matrix is a table that is used to describe the performance of a classification model, or a classifier, on a set of observations for which the true values are known (supervised). Each row of the matrix represents the instances in the actual class while each column represents the instances in the predicted class (or vice versa). For example, here is a dummy confusion matrix for a binary classification problem predicting positive or negative from a classifier:

Receiver Operator Characteristic (ROC)/Area Under the Curve (AUC)

ROC or Receiver Operator Characteristic curve is a plot of the Recall (True Positive Rate) (on the y-axis) versus the False Positive Rate (on the x-axis) for every possible classification threshold.

The reason why you should check the ROC curve to evaluate a classifier instead of a simpler metric such as accuracy is that a ROC curve visualizes all possible classification thresholds, whereas accuracy only represents performance for a single threshold. Typical ROC curve looks like the image shown below:

The ROC displays an effective classification result when the AUC is closest to 1 (in the image above, the AUC is 0.99). The ROC is large if the curve closely follows the axes as we can see above. This result illustrates a good classifier. I bad classifier’s ROC would follow the dotted line and the AUC would be a value close to 0.

Precision-Recall Curve

Another curve that is used to evaluate the classifier’s performance as an alternative to a ROC curve is a precision-recall curve (PRC), particularly in the case of imbalanced class distribution problems. It is a curve between precision and recall and a good classifier will produce a PR-curve that is close to the upper right corner. A diagram of a typical PR curve can be seen below in Figure 4:

Semi-Supervised Models and More

Generative Adversarial Networks (GANs)

GANs are a natural choice for financial fraud prediction as they can learn the patterns of lawful transactions from historical data. For every new financial transaction, the model computes an anomaly score; financial transactions with high scores are labeled as suspicious transactions.

GANs are difficult to both train and deploy in production, needing lots of GPUs and parallel hyperparameter search as well as distributed training support. Great care must be exercised and extensive machine learning experience is certainly required. One of the GAN implementations describes an architecture for anomaly detection that tolerates small amounts of incorrectly labeled samples and supports parallel encoder training.

Link Analysis

Link analysis is the most common unsupervised method of fraud detection. The process of performing link analysis is known as link discovery (LD). This discipline has its origin in discreet mathematics, graph theory, social science, and pattern analysis. The object of LD is to find hidden links among patterns that appear to be unrelated. The approach is to relate groups and activities to some behavior, such as fraud. LD is related in a broader context to the recent emergence of social network analysis.

In traditional data mining, entities modeled are variables, which may be correlated (linked) to other variables in their effect on a target variable. In LD, entities are not variables, but rather are relationships between entities. LD evaluates the likelihood that a given pattern in a data set (expressible in a specific graphic data structure) matches some target pattern. In this regard, LD is very “platonic” in its search for truth, compared with the more Aristotelian approach of supervised methods of fraud detection.

Benford’s Law — Numerical Methods

Another common unsupervised method is the application of Benford’s Law to the detection of fraudulent financial reports. Benford’s Law states that in numerical lists involving real-life processes and events, the leading digit is not distributed in a uniform manner. The digit 1 appears about a third of the time, and the digit with the lowest frequency is 9. This principle is attributed to Benford, but it was published earlier by Newcomb. As Director of the Nautical Almanac Office, Newcomb observed that pages of logarithm books were unevenly worn. Logarithms were used extensively in the calculation of nautical chart values. The earlier pages of the logarithm books were more worn than the later pages.

This observation led him to form the general principle that any list of numbers taken from any set of data will contain numbers beginning with the digit 1 more frequently than any other number. Benford’s Law can be applied to check the “normalcy” of street numbers, bill amounts, stock prices, or expense reports. This principle was derived from observations in the real world, but it remained unproven mathematically until Hill offered formal proof. Checks against the relative frequencies of initial digits presented by Benford can be used to flag suspicious numerical lists. If the frequency of initial digits in a list is significantly different from the frequencies listed by Benford, then the list can be flagged as probable fraud.

Rank-Based Outlier Detection

RBDA

An approach for outlier detection, based on a ranking measure that focuses on the question of whether a point is “important” for its nearest neighbors; using our notations low cumulative rank implies the point is central. For instance, a point centrally located in a cluster has a relatively low cumulative sum of ranks because it is among the nearest neighbors of its own nearest neighbors. But a point at the periphery of a cluster has a high cumulative sum of ranks because its nearest neighbors are closer to the points. The use of ranks eliminates the problem of density calculation in the neighborhood of the point and this improves performance. Our method performs better than several density-based methods, on some synthetic data sets as well as on some real data sets.

The key idea of the algorithms is to use rank instead of distance. These ranks are relative to a point p ∈ D; that is, given a point p ∈ D, the largest rank is assigned to the point q ∈ D that is farthest away from p and smallest rank, 1, to the point closest to it. To understand mutual proximity, consider p ∈ D and q ∈ Np(k), where Np(k) is the k-neighborhood of p. Here q is “close” to p because it belongs to Np(k). In return, we ask “how close is p to q?”. If p and q are ‘close’ to each other, then we argue that (with respect to each other) p and q are not anomalous data points.

However, if q is a neighbor of p but p is not a neighbor of q, then with respect to q, p is an anomalous point. If p is an anomalous point with respect to most of its neighbors, then p should be declared to be an anomaly. When measuring the outlierness of p, instead of distance, we use the ranks calculated based on neighborhood relationships between p and Np(k). Low cumulative rank assigned to p by its neighbors implies that p is a central point because it is among the nearest neighbors of its own closest neighbors. However, a point at the periphery of a cluster has a high cumulative sum of ranks because its nearest neighbors are closer to each other than the point.

NC-Clustering

In NC-clustering a cluster is required to have the minimum number of objects, denoted as l. The main difference between this approach and DBSCAN is in interpreting ‘reachability’. In NC-clustering, reachability is defined using k-neighborhood, i.e., the main idea is to declare data objects p and q in D to be close if q is one of the k-nearest neighbors of p and vice-versa. One distinct advantage of this approach is that k is easier to guess. Another difference is that, in DBSCAN, a border point is considered to be in the cluster; this is not the case in NC-clustering. Formally, the following definitions are used in the proposed clustering algorithm; all definitions are parameterized with a positive integer parameter ` intended to capture the notion of cluster tightness

Summary

Machine learning techniques have received considerable attention in anomaly detection to address the weaknesses of knowledge base detection techniques. Anomaly detection comprises supervised techniques, semi-supervised and unsupervised techniques. Many algorithms were used to achieve good results for these techniques. There exist many machine learning techniques for anomaly detection. Supervised learning methods significantly outperform the unsupervised ones if the test data contains no unknown attacks. Among the supervised methods, the best performance is achieved by the non-linear methods, such as SVM, multi-layer perceptron and the rule-based methods. Techniques for unsupervised such as K-Means, SOM, and one class SVM achieve better performance over the other techniques although they differ in their capabilities of detecting all attacks classes efficiently.

There are many approaches to detecting outliers and anomalies in data. The specific approach chosen depends greatly on the data and the inherent qualities therein, the computing power available and the amount of time allowed. By managing these constraints, the algorithms can be optimized to learn any feature representation and consequently identify any anomalous data points.

The Sparse Autoencoder (SAE) for Dummies

Christiaan Defaux — Wed, 05 Feb 2020 21:20:01 GMT

If you’ve landed on this page, you’re probably familiar with a variety of deep neural network models. I will assume that you have a grounding in mathematics and machine learning but I will still provide some context and intro to the concept.

Autoencoders are a form of unsupervised learning which means that they are used in cases when data is unlabeled. This is in contrast to supervised learning which learns features in relation to labels and attempts to match the inputs x to the labels y. One problem with supervised learning is the labeling of the data which can be tedious and monotonous; often requiring PhD students to sit for hours to accurately do a task that will hopefully take a machine seconds to do (after training).

Suppose now that we have a set of this unlabeled data x where we set the target values y equal to the inputs x. Meaning that y = x. This means that the autoencoder is trying to learn the identity function that will allow the outputs y to approximate as closely as possible, the inputs x. By placing constraints on our network, the model will be forced to prioritize the most salient features in the data.

So, what is an autoencoder, you might ask. Well generally speaking it is a deep neural network (meaning with more than 2 layers) that contains some kind of bottleneck layer/layers between the input and output layers. For a deeper (qualitative) understanding please see one of my previous posts on the subject here. By adding the sparsity constraint, we are in effect inducing regularization in our autoencoder, which can aid in training and computational efficiency.

Figure 1: A simple single-layer sparse auto encoder with equal numbers of inputs (x), outputs (xhat) and hidden nodes (h). The darker yellow nodes are firing whereas the lighter nodes are constrained.

Usually, an autoencoder will have a bottleneck, meaning that the hidden layer/layers have fewer nodes than the input and output. However, this is not a requirement if there is an added constraint on the network. The particular constraint we are considering is sparsity, which we will impose on the hidden units. By constraining the activations of the hidden nodes, we are in effect reducing the number of firing neurons. Therefore, despite equal or greater number of hidden nodes than input/output nodes, we will have a compression of the data necessary for learning the latent features.

We would like to constrain the neurons to remain inactive most of the time, meaning with an output close to 0. If we recall from notation, the activation

where the superscript 2 refers to the second layer and the subscript j, refers to the jth hidden unit in that layer given a specific input x. We can then find the average activation of unit j in the hidden layer using,

In order to create a constraint, we try to enforce that the average activation ρj is equal to an specified parameter close to zero (often 0.05). To satisfy that constraint, the hidden activation must be close to 0 as required.

In order to create this penalty, we add an extra term to the cost function which will induce a penalty when the average activation above deviates far from the specified parameter. A popular choice for the penalty is shown below. This is also known as the Kullback-Liebler divergence (KL),

where summation is summing over the hidden units in the hidden layer and ρ is the specified integer 0.05. The KL divergence can also be written as follows,

which is equivalent to the expression above. For a deeper understanding of the KL divergence refer to this article. For a brief understanding, the KL divergence measures the divergence between two Bernoulli random variables with means ρ and ρj. It is a measure of the difference between two functions and gives a function similar to the one below.

In the case above, ρ is specified as 0.2, therefore there is no penalty when ρj is equal to ρ. However, when ρj moves away from this value, the KL divergence increases monotonically. When minimizing the cost function with penalty, we will cause ρj to be close to ρ. The cost function be written as such:

where J(W,b) is the original cost function and β is an integer that controls the strength of the penalty. Overall, by adding this penalty to the cost function it is possible to regularize the output of a standard autoencoder or an autoencoder that contains relatively many hidden units. An implementation guide for the SAE will follow shortly.

Phi Coefficient A.K.A Matthews Correlation Coefficient (Binary Classification)

Christiaan Defaux — Mon, 27 Jan 2020 18:50:21 GMT

In machine learning/data science, we often run into problems where we’re trying to classify binary (two-class) data. In this case, you might’ve fit the data to a model which successfully (or so it appears) classifies your data. However, one common issue might be a class imbalance problem in the data, this can learn to a misuse or misunderstanding of the evaluation metrics.

Enter the phi coefficient (mean square contingency coefficient, φ), otherwise known as Matthews correlation coefficient (MCC). As a brief aside, the phi coefficient was first introduced by Karl Pearson but is equivalent to the MCC which was introduced by Brian W. Matthews in 1975 and is often used in bioinformatics. They are measures of association between two binary variables where:

at least one of the variables is nominal (named, e.g. black/white)
both variables are dichotomous (two opposing categories, e.g. dead/alive or levels (continuous variable) e.g. pass/fail above/below 60%)

The measures are often used in contingency tables to assess the correlation of the variables. The benefits of these are that they are balanced measures, meaning they can provide insight even when there is a class imbalance. For example, the accuracy measure (proportion of correct predictions) is not effective when there is a large class imbalance.

Calculating the Statistic

Whether using a confusion matrix or a contingency table, the calculation is pretty much identical. When reading off a confusion matrix, the following calculation can be done:

where TP is the number of true positives, TN the number of true negatives, FP the number of false positives and FN the number of false negatives. This is the mathematical formula for MCC, however, for added efficiency, scikit learn contains an MCC function which is contained in the metrics framework. This can easily be accessed in python by using:

from sklearn.metrics import matthews_corrcoef sklearn.metrics.matthews_corrcoef(y_true, y_pred, sample_weight=None)

where the parameters y_true and y_pred are ground truth (correct) target values and estimated targets respectively. There is support in the function for both binary and multiclass labels, however only in the binary case does the information relate to true and false positives and negatives.

Interpreting the Statistic

The range of these metrics is from -1 to 1, where:

0 indicates no relationship
1 indicates a perfect positive relationship
-1 indicates a perfect negative relationship

In the case of an MCC of 1, we can assume that FP and FN equal 0, whereas in the case of an MCC of -1 we get a classifier that always misclassifies, hence we get TP and TN equal to 0.

For the range of values between -1 and 1, there are come crude estimates or rules of thumb which are as follows:

.70 and higher — very strong positive relationship
.40 to .69 — strong positive relationship
.30 to .39 — moderate positive relationship
.20 to .29 — weak positive relationship
.01 to .19 — no or negligible relationship
0 — no relationship
-.01 to -.19 — no or negligible relationship
-.20 to -.29 — weak negative relationship
-.30 to -.39 — moderate negative relationship
-.40 to -.69 — strong negative relationship
-.70 and lower — very strong negative relationship

*note these values are estimates and should be used with respect to the intrinsic qualities of the data being analyzed.

Unsupervised Learning: The autoencoder in laymans terms

Christiaan Defaux — Wed, 22 Jan 2020 21:20:32 GMT

Sometimes we might be presented with data that lack labels. This requires a special set of machine learning models under the umbrella of unsupervised learning. When we have such data, we want to learn latent (hidden) features within the data, without trying to classify each data point.

An autoencoder is a type of deep neural network that uses backpropagation and sets the target variables equal to the input variables. An autoencoder requires that the number of output nodes is equal to the number of input nodes. It will take the peculiar diabolo or horizontal sand timer shape as shown below.

An autoencoder is built with an encoder-decoder pair

The encoder layer acts to compress the latent information, whereas the decoder layer acts to reconstruct this information as accurately as possible. This forces the autoencoder to learn features to encode useful information into the hidden layer nodes. Autoencoders may be used to automatically extract high-level features which may also be called embeddings.

In more analytical terms, the autoencoder learns an approximation of the identity function so that the output is as similar as possible to the input. Once trained, in order to extract features from an input, one just needs to feedforward the input data and gather the activations of the hidden layer (the values of the latent variables).

There are some variations of the autoencoder architecture that allow for regularization and optimization. Examples of these are the sparse autoencoder, denoising autoencoder, contractive autoencoder (all of which are regularized variations of the simple autoencoder. There is also the variational autoencoder, which is an example of a generative model. Finally, there is also the ability to make the autoencoder deep, this can be done by adding extra interstitial layers. These architectures offer many advantages depending on the use case.

Kullback-Leibler Divergence for Dummies

Christiaan Defaux — Thu, 16 Jan 2020 01:48:56 GMT

Kullback-Leibler Divergence for Machine Learning

If you’re diving deep into deep learning or machining a better grasp of machine learning then an understanding of the Kullback-Leibler divergence can be invaluable. Another name for this measure is relative entropy, which might give those with a physics or statistics background a hint about what is actually being measured.

In layman’s terms, the K-L divergence is a measure of how different a specific probability distribution is from a reference distribution. However, it is not a true statistical metric like variation of information which measures the distance between two clusterings.

Please don’t be put off by the following mathematics, for those who understand the notation, the following is the definition of the K-L divergence for two discrete probability distributions P and Q within the space χ.

formula for the K-L divergence

where D(K-L) is the divergence of Q from P. Again in layman’s terms, this can be seen as the expectation of the logarithmic difference between P and Q. If P and Q are absolutely continuously distributed over the entire space χ, then the above formula’s summation is replaced with an integral over the entire space. It must be noted that the log is set to base 2 when the information is measured in bits or to base e when measured in nats.

The K-L divergence is an important feature in a variety of machine learning models. One in particular is the Variational Autoencoder (VAE). This article assumes some background in machine learning. Briefly, a VAE is a generative model that uses a similar architecture to a standard auto encoder but adds a sampled vector which is decoded and is then analyzed by the loss function.

In a VAE, there are two components to the loss function: the reconstruction term and the regularization term. The reconstruction term serves as a measure of the efficacy of the encoder-decoder with respect to the initial data and takes place in the output layer. With 0 reconstruction loss, an autoencoder would perfectly reconstruct the input data. This would indicate extreme overfitting and a lack of interpretable latent features.

VAEs encode their inputs as normal (Gaussian) distributions rather than points. This is where the K-L divergence comes in. It is optimal for the distributions of the VAE to be regularized to increase the amount of overlap within the latent space. K-L divergence measures this and is added into the loss function. There is a tradeoff between reconstruction and regularization. If we want to reduce our reconstruction error, this comes at the expense of K-L divergence or regularization.

Another example of the application of K-L divergence in machine learning is in t-distributed Stochastic Neighbor Embedding, which is a dimensionality reduction technique which uses probabilistic methods to model complex non-linearities in data whilst also reducing the number of dimensions.

Music Generation using Deep Learning — Design Decisions

Christiaan Defaux — Wed, 08 Jan 2020 21:00:40 GMT

Music Generation using Deep Learning — Design Decisions

This post will be based mostly on information from Deep Learning Techniques for Music Generation — A Survey (Briot et al. 2019), which provides an overview of the various methods and architectures used in successful music generation. This assumes a previous background with deep learning and various types of architecture.

Global versus Time Step

Temporal scope is often an issue when deciding which data representation to use.

Global — in this case, the temporal scope of the representation is the whole musical piece. The neural network architecture will process the input and produce the output within a global single step.

Time Step — in this case (the most frequently used case), the temporal scope of the representation is a local time slice of the musical piece, corresponding to a specific moment in time. The granularity of the processing time is the time step itself. Time step is usually set to the shortest note duration.

Note Step — in this approach there is no fixed time step. The granularity of processing by the deep network architecture is a note. This strategy uses a distributed encoding of duration that allows to process a note of any duration in a single network processing step. Note that, by considering as a single processing step a note rather than a time step, the number of processing steps to be bridged by the network is greatly reduced.

One important decision is the choice between the two main types of temporal scope: Global and Time step.

The pros and cons are as follows:

Global

+ allows arbitrary output, e.g., for the objective of generating some accompaniment through the single-step feedforward strategy on a feedforward architecture, as, for example, in the MiniBach system.
+ supports incremental instantiation (via sampling), as, for example, by the DeepBach system.
– does not allow variable length generation;
– does not allow seed-based generation for a feedforward architecture
+allows seed-based generation through the decoder feedforward strategy on an autoencoder architecture, as, for example, in the DeepHear system

Time Step

+/– supports incremental instantiation, but only forward in time, through the iterative feedforward strategy on a recurrent network architecture;
+ allows variable length generation, as, for example, in the CONCERT system.

Convolution versus Recurrent

Convolutional architectures, while prevalent for image applications, are more seldom used than recurrent neural network architectures in music applications.

Let us try to list and analyze the relative pros of cons of using recurrent architectures or convolutional architectures to model time correlations:

recurrent networks are popular and accurate, especially since the arrival of LSTM architectures.
convolution should be used a priori only on the time dimension because, as opposed to images where motives are invariant in all dimensions, in music the pitch dimension is a priori not metrically invariant.
convolutional networks are typically faster to train and easier to parallelize than recurrent networks.
using convolution on the time dimension in place of using a recurrent network implies the multiplication of the number of input variables by the number of time steps considered and thus leads to a significant augmentation of the volume of data to process and of the number of parameters to adjust.
sharing weights by convolutions only applies to a small number of temporal neighboring members of the input, in contrast to a recurrent network that shares parameters in a deep way, for all time steps.
the authors of WaveNet argue that the layers of dilated convolutions allow the receptive field to grow longer in a much cheaper way than using LSTM units.
the authors of MidiNet argue that using a conditioning strategy for a convolutional architecture allows the incorporation of information from previous measures into intermediate layers and therefore considers history as a recurrent network would do.

Using the Durbin-Watson (DW) test for time-series data

Christiaan Defaux — Fri, 09 Aug 2019 15:44:51 GMT

Using the Durbin-Watson (DW) test for testing time-series autocorrelation.

First of all, one must be aware of autocorrelation in order to appropriately apply the DW test. Autocorrelation, also known as serial correlation is the correlation of a signal with a delayed copy of itself as a function of delay. In other words, it is the similarity between observations as a function of the time lag between them. This property makes the DW test useful for time-series data where the current state of the system depends heavily on prior data. Time-series refers to systems in which time is a factor in the progression of the system, for example: ocean tides, sunspot counts, stock-market prices, etc.

The process is basically a linear regression of the data in the current series against one or more past values in the same series. The value of the outcome variable (Y) at some point t in time is — like “regular” linear regression — directly related to the predictor variable (X). Where simple linear regression and autoregression models differ is that Y is dependent on X and previous values for Y. This an example of a stochastic process, which has degrees of uncertainty or randomness built in. The randomness means that you might be able to predict future trends pretty well with past data, but you’re never going to get 100 percent accuracy. Usually, the process gets “close enough” for it to be useful in most scenarios.

The DW test is named after James Durbin and Geoffrey Watson who first used the technique in the 50's. In statistics, the autocorrelation of a random process is the Pearson correlation between values of the process at different times, as a function of the two times or of the time lag. The Durbin-Watson Test is a measure of autocorrelation (also called serial correlation) in residuals from regression analysis. The DW test statistic is calculated using the following equation:

where T is the total number of observations and e is the residual error given by:

where rho is the sample autocorrelation of the residuals. From the second equation, one can see how the current residuals are related to the past state of the system. If one has a lengthy sample, then this can be linearly mapped to the Pearson correlation of the time-series data with its lags. It must be noted that the DW test is used to test for a specific type of autoregression (AR1), meaning a first order autoregression. The outcome variable in a first order autoregression process at some point in time t is only related to time periods that are one period apart (i.e. the value of the variable at t-1). A second or third order autoregression process would be related to data two or three periods apart respectively.

The hypotheses for the Durbin Watson test are:
H0 = no first order autocorrelation (rho = 0)
H1 = first order autocorrelation exists (rho != 0)

One assumes:

That the errors are normally distributed with a mean of 0.
The errors are stationary.

The Durbin-Watson test gives values that are between 0 and 4 with the following meaning:

2 is no autocorrelation.
0 to <2 is positive autocorrelation (common in time series data).
2 to 4 is negative autocorrelation (less common in time series data).

If using Python, there are excellent functions within the StatsModels package. The Durbin-Watson test statistic can be found by running the following code on an array:

statsmodels.stats.stattools.durbin_watson(resids, axis=0)

Once again, the test statistic is approximately equal to 2*(1-rho) where rho is the sample autocorrelation of the residuals. Thus, for rho = 0, indicating no serial correlation, the test statistic equals 2. This statistic will always be between 0 and 4. The closer to 0 the statistic, the more evidence for positive serial correlation. The closer to 4, the more evidence for negative serial correlation.

Upper and lower critical values, dU and dL have been tabulated for different values of k (the number of explanatory variables) and n.

If d
If d>dU — do not reject H0
If dL < d < dU — test is inconclusive

Using the following DW critical value table, by inputting sample size n, number of regressors k and acceptable alpha level one can find dU and dL.

The benefits of a growth mindset in data science

Christiaan Defaux — Fri, 09 Aug 2019 14:15:15 GMT

I’m currently in a data science immersive, read intensive. It’s a lot of information to download into the trusty old biological hard drive in a short amount of time. So far so good though, I attribute at least some of the ease with which I take on new information to my ‘growth mindset’. If I had a more limiting mindset, I feel certain that I’d be struggling immensely. Belief in oneself and one’s ability is fundamental for growth. In a field like data science where new techniques and information are always available, it’s especially necessary to keep a growth mindset.

For one thing, the problems that data scientists will face are apt to be varied and intricate. Routine will not exist in this field. This means that being comfortable being uncomfortable is critical. The answers we seek will rarely just present themselves. We will always be faced with new data and new insights to be gleaned. Therefore we must be willing to work through problems, learn new techniques and continue to grow.

Growth vs. fixed mindsets are the brainchild of Carol Dweck, a professor of psychology, now teaching at Stanford. In her work ‘Mindset: The New Psychology of Success’, she posits that “the view you adopt for yourself profoundly affects the way you lead your life. It can determine whether you become the person you want to be and whether you accomplish the things you value”. Furthermore, she sought to “understand why some students were so caught up in proving their ability, while others could just let go and learn. [They] realized that there were two meanings to ability, not one: a fixed ability that needs to be proven, and a changeable ability that can be developed through learning”.

In other words, when one holds a mindset of openness to difficulty and struggle in order to grow, one can grow far more easily than when holding a mindset of fixed and innate ability. Certainly innate ability has some part to play, but even those with profound intelligence still must exercise themselves in order to grow. If the geniuses of the past believed their intelligence was fixed, where would we be today? Einstein certainly didn’t exit the womb spouting the theory of general relativity. Dweck explains that by putting oneself into difficult learning situations and remembering that even if one doesn’t understand immediately they will eventually, one can achieve nearly anything.

Obviously within data science we are expected to be knowledgeable and accurate but we must be open to our mistakes so that we don’t perpetuate misinformation. Double checking the veracity of our work is essential. We must also be aware of Goodhart’s Law which states, ‘any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes’. Or in layman’s terms, ‘when a measure becomes a target, it ceases to be a good measure’. Sometimes we might find actionable insights which when implemented create unintended consequences. We are apt to err and mistakenly act on spurious correlations. Therefore, as data scientists we must maintain intellectual humility and be open to our own failures, but see them as opportunities for growth rather than setbacks which define our ability.

A particular excerpt I would like to share is as follows, “when you enter a mindset, you enter a new world. In one world — the world of fixed traits — success is about proving you’re smart or talented. Validating yourself. In the other — the world of changing qualities — it’s about stretching yourself to learn something new. Developing yourself.” As for me, I will keep this as a mantra and always try to cultivate a growth mindset. Even if a concept seems too difficult to understand, with time and effort I will surely grasp it. The most important message to take from this is to always have the perspective of a student, maintain a willingness to be wrong and never let anything set you back from your goals. I recommend watching this video where Dweck speaks on her work and findings. In my opinion it’s very inspirational: https://www.ted.com/talks/carol_dweck_the_power_of_believing_that_you_can_improve?language=en