Anomaly detection on a categorical and continuous dataset

Shreyash Shrivastava
7 min readJul 26, 2019

--

There are many Machine Learning algorithms available today for regression/cluster analysis on different types of datasets. I also started with the notion of a naive student of AI that a major part of my work would be focused around finding a popular ML algorithm. After reading dozens of journal papers, countless hours of coding and hitting numerous roadblocks, I have a story to tell!

What is an anomaly in a dataset?

An outlier is defined as a data point which is very different from the rest of the data based on some measure. Such a point often contains useful information on abnormal behavior of the system described by the data.

Photo by Will Myers on Unsplash

Let’s start with anomaly detection and its techniques.

Supervised Anomaly Detection: It describes the setup where the data comprises of fully labeled training and testing data sets. This scenario is similar to a supervised machine learning problem.

Semi-Supervised Anomaly Detection: This technique might be an ‘anomaly’ in the way traditional machine learning thinks about semi-supervised learning. In the anomaly detection scenario, the training data only consists of normal data, without any anomalies. The basic idea is, that a model of a normal class is learned and anomalies can be detected afterward by deviating from that model.

Unsupervised Anomaly Detection: This is a flexible setup of the detection system. This does not require any labeling. Furthermore, there is no distinction between training and test data.

Goldstein M, Uchida S (2016) A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data. PLo
Goldstein M, Uchida S (2016) A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data. PLo

In my opinion, if you are remotely unsure of the data you are using and its properties, blindly use the unsupervised technique.

This article is about a data set with almost all categorical features (and maybe some continuous too). Dealing with a categorical feature space is a unique skill you will acquire, while you steadily struggle with your data set.

Some tips on Data Preprocessing

Data Preprocessing is the crucial step that decides the difference between a bad, mediocre and awesome output.

You cannot compensate for domain knowledge with ML expertise.

  1. Drop all the redundant columns.
  2. Clean the null values, while making sure your data set does not lose a majority of its rows.
  3. Treat the continuous variables with suspicion. You might encounter the variables as (101,102,103 .. ). These types of variables should also be treated as categorical.
  4. You can also combine categories. For instance, if a feature has 20 categories and the top four occupy 99.9 percent of the instances, you can combine the rest 16 into one group.
  5. Drawing a bar graph of your categorical feature will always help in determining the span of the categories. You can use the code below for reference. This would help you drop some more features.
import pandas as pd
import matplotlib.pyplot as plt
for x in df.columns[df.dtypes == object]:
fig = plt.figure()
df[x].value_counts(normalize=True).plot(kind='bar')
fig.suptitle(x)

One Hot Encoding vs Label Encoding And Standardization vs Normalization

One hot encoding is the way to go if the data set you are using does not have any explicit instructions of the features being ordinal. Most categories you will encounter will be nominal. If you blindly apply one-hot encoding to the dataset, there is a high chance that you will run into a memory error.

Normalization typically means rescales the values into a range of [0,1]. Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).

I would recommend you to not apply standardization on the entire data set. You can apply sklearn.preprocessing.StandardScaler, if your continuous categories are not in the same units. Your data set might also lose its essence before you apply detection algorithms. Unless you are sure of applying iForest, do not standardize the entire continuous feature set.

Some tips on Data Analysis — Univariate and Multivariate

After preprocessing the data thoroughly, its time to analysis the dataset. Howsoever few continuous features you have, it is also a good idea to understand the aspects of those features.

In anomaly detection, you need to have an identification column of the data set. And you also need to remove the identification column beforehand. The easiest steps of univariate analysis are built-in pandas.

df.describe() 
df.corr()

A correlation matrix is also a good analytical tool. Use the heatmap code for reference.

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
heatmap = df.corr()
top_feat = heatmap.index
plt.figure(figsize=(10,10))
p = sns.heatmap(df[top_feat].corr(),annot=True,cmap ='RdYlGn')

If you do not see a very high or a very low correlation in the continuous variables you are good try out the detection methods.

Unsupervised Anomaly Detection

Unsupervised method detection is also subclassified

Distribution Based Detection

Standard distribution is used to fit the data set, outliers are identified based on probability distribution. (eg. Gaussian Mixture Model). The problem with this approach might be that the underlying data distribution is assumed to be known apriori. For many applications, this is an impractical assumption.

Kernel Density Estimation: You can draw a kernel density estimation graph if you have a final calculation column on the data.

On every data point xi, we place a kernel(function K.)

The kernel function K is typically; non-negative, symmetric and decreasing. The choice of the kernel function is not important. When the data kicks in, all estimates start to look similar.

Choice of Bandwidth (KDE): We use ‘b’ to control for the bandwidth of f(x). If b is large, it spreads the function.

If your KDE applied on the final calculation column is too wavy, it is probably overfitting your estimation. Each point in your data set is contributing a different kernel function, and if the KDE is too smooth, it is an underfit. To adjust for the fitting, you can try to adjust the bandwidth. If, even after adjusting the bandwidth by a factor of 100, your overfit does not decrease. Distribution based approaches might not work in your favor.

Depth Based Detection

Based on some definition of depth, data objects are organized in convex hull layers in data space, according to peeling depth. Outliers are expected to be found from data objects with shallow depth values, relying on the computation of k-d convex hulls.

Distance-Based Detection

The percentage of the objects in data having a distance of more than d-minimum (defined by the algorithm) away from it form the outliers.

Recent studies show what in high-dimensional space, the concept of proximity may not be qualitatively meaningful. The direct application of distance-based approaches on high dimensional data leads to poor performance and also possesses the curse of dimensionality.

Anomaly Detection in High Dimension

The sparsity of high dimensional data implies that every data point is an almost equally good outlier from the perspective of proximity-based definitions.

Detecting outliers in a large set of data objects is a major data mining task aiming at finding different mechanisms responsible for different groups of objects in a data set. All existing approaches, however, are based on an assessment of distances (sometimes indirectly by assuming certain distributions) in the full-dimensional Euclidean data space.

An outlier detection library PyOD, implemented by a doctoral student at CMU, Yue Zhao lays an excellent foundation of numerous machine learning algorithms you can apply to your data set. Be sure to read the documentation properly for hyper-parameter tuning. An arbitrary set of hyper-parameters setting may result in a very good algorithm perform very poorly on a data set.

Insights

There could be many ways in which you can check your work after getting the decision scores and labels from the algorithm you might have applied. I chose to perform an intersection of contaminated labels (see the documentation of PyOD) with some of the exceptions I knew existed.

In my problem statement, a random approach to choosing ‘X’ percent of the dataset for anomalies resulted in very few anomaly points in the above ‘X’ percent, by simple probability. For instance, if 100 data points have 2 anomalies (which usually will be the case for most anomaly detection problems), choosing 10 percent randomly will give you 0.2 anomalies, or 1 if you choose 10 percent five times. That is a very low probability.

After taking the above steps. I was getting more than 1 anomaly when I choose 10 percent in the above problem. This implies I am able to detect more than 50 percent of total (known) anomalies (2) present in the data.

--

--