DBSCAN Algorithm for Fraud Detection & Outlier Detection in a Data set

Dilip Valeti
7 min readOct 18, 2021

--

Overview

  • DBSCAN clustering is an underrated yet super useful clustering algorithm for unsupervised learning problems
  • Learn how DBSCAN clustering works, why you should learn it, and how to implement DBSCAN clustering in Python

Fraud/anomaly/outlier detection has long been the subject of intense research in data science. In the ever-changing landscape of fraud detection, new tools and techniques are being tested and employed every day to screen out abnormalities.

Today I’m going to introduce a technique called DBSCAN — short for Density-Based Spatial Clustering of Applications with Noise.

What Exactly is DBSCAN Clustering?

As the name suggests, DBSCAN is a density-based and unsupervised machine learning algorithm. It takes multi-dimensional data as inputs and clusters them according to the model parameters — e.g. epsilon and minimum samples. Based on these parameters, the algorithm determines whether certain values in the dataset are outliers or not.

It groups ‘densely grouped’ data points into a single cluster. It can identify clusters in large spatial datasets by looking at the local density of the data points. The most exciting feature of DBSCAN clustering is that it is robust to outliers. It also does not require the number of clusters to be told beforehand, unlike K-Means, where we have to specify the number of centroids.

DBSCAN requires only two parameters: epsilon and minPoints. Epsilon is the radius of the circle to be created around each data point to check the density and minPoints is the minimum number of data points required inside that circle for that data point to be classified as a Core point.

In higher dimensions the circle becomes hypersphere, epsilon becomes the radius of that hypersphere, and minPoints is the minimum number of data points required inside that hypersphere.

Sounds confusing? Let’s understand it with the help of an example.

Here, we have some data points represented by grey color. Let’s see how DBSCAN clusters these data points.

DBSCAN creates a circle of epsilon radius around every data point and classifies them into Core point, Border point, and Noise. A data point is a Core point if the circle around it contains at least ‘minPoints’ number of points. If the number of points is less than minPoints, then it is classified as Border Point, and if there are no other data points around any data point within epsilon radius, then it treated as Noise.

The above figure shows us a cluster created by DBCAN with minPoints = 3. Here, we draw a circle of equal radius epsilon around every data point. These two parameters help in creating spatial clusters.

All the data points with at least 3 points in the circle including itself are considered as Core points represented by red color. All the data points with less than 3 but greater than 1 point in the circle including itself are considered as Border points. They are represented by yellow color. Finally, data points with no point other than itself present inside the circle are considered as Noise represented by the purple color.

For locating data points in space, DBSCAN uses Euclidean distance, although other methods can also be used (like great circle distance for geographical data). It also needs to scan through the entire dataset once, whereas in other algorithms we have to do it multiple times.

Parameter Selection in DBSCAN Clustering

DBSCAN is very sensitive to the values of epsilon and minPoints. Therefore, it is very important to understand how to select the values of epsilon and minPoints. A slight variation in these values can significantly change the results produced by the DBSCAN algorithm.

The value of minPoints should be at least one greater than the number of dimensions of the dataset, i.e.,

minPoints>=Dimensions+1.

It does not make sense to take minPoints as 1 because it will result in each point being a separate cluster. Therefore, it must be at least 3. Generally, it is twice the dimensions. But domain knowledge also decides its value.

The value of epsilon can be decided from the K-distance graph. The point of maximum curvature (elbow) in this graph tells us about the value of epsilon. If the value of epsilon chosen is too small then a higher number of clusters will be created, and more data points will be taken as noise. Whereas, if chosen too big then various small clusters will merge into a big cluster, and we will lose details.

DBSCAN implementation in Scikit-Learn

Scikit-learn has a DBSCAN module as part of its unsupervised machine learning algorithms. This algorithm can be used out of the box for fraud detection in only a few simple steps.

Step 1: Import libraries

For this demo, we need three key libraries for data wrangling, visualization and modeling.

# data wrangling
import pandas as pd# visualization
import matplotlib.pyplot as plt# algorithm
from sklearn.cluster import DBSCAN

Step 2: Import & visualize data

I am using the famous Iris dataset from an online source, so you can practice along without worrying about where to get the data from how to clean that up.

# import data
df = pd.read_csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")print(df.head())

Let’s choose a subset of data for testing the algorithm and visualize them in a scatter plot.

Scatterplot of two-dimensional data

Step 3: Modeling

Now, it’s time to implement DBSCAN and see its power. Import DBSCAN from sklearn.cluster. Let’s first run DBSCAN without any parameter optimization and see the results.

from sklearn.cluster import DBSCAN
dbscan=DBSCAN()
dbscan.fit(df[[“sepal_length”, “sepal_width”]])

DBSCAN(algorithm='auto', eps=0.5, leaf_size=30, metric='euclidean',
metric_params=None, min_samples=5, n_jobs=None, p=None)

Here, epsilon is 0.5, and min_samples or minPoints is 5. Let’s visualize the results from this model:

# visualize outputs
colors = model.labels_
plt.scatter(data["sepal_length"], data["sepal_width"], c = colors)

Interesting! Most of the data points are now of yellow color which means they are treated good. It is because the value of epsilon is very small and we didn’t optimize parameters. Therefore, we need to find the value of epsilon and minPoints and then train our model again.

For epsilon, I am using the K-distance graph. For plotting a K-distance Graph, we need the distance between a point and its nearest data point for all data points in the dataset. We obtain this using NearestNeighbors from sklearn.neighbors.

from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(df[[“sepal_length”, “sepal_width”]])
distances, indices = nbrs.kneighbors(df[[“sepal_length”, “sepal_width”]])

The distance variable contains an array of distances between a data point and its nearest data point for all data points in the dataset.

Let’s plot our K-distance graph and find the value of epsilon. Use the following syntax:

# Plotting K-distance Graph
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.figure(figsize=(20,10))
plt.plot(distances)
plt.title('K-distance Graph',fontsize=20)
plt.xlabel('Data Points sorted by distance',fontsize=14)
plt.ylabel('Epsilon',fontsize=14)
plt.show()

The optimum value of epsilon is at the point of maximum curvature in the K-Distance Graph, which is 0.4 in this case. Now, it’s time to find the value of minPoints. The value of minPoints also depends on domain knowledge. This time I am taking minPoints as 10.

The two most important parameter values the model takes are (i) esp, which specifies the distance between two points i.e., how close the data points should be to one another to be considered part of a cluster; and (ii) min_samples, which specifies the minimum number of neighbors a point should have in a cluster.

# input data
data = df[["sepal_length", "sepal_width"]]# specify & fit model
model = DBSCAN(eps = 0.4, min_samples = 10).fit(data)

Visualizing

# visualize outputs
colors = model.labels_
plt.scatter(data["sepal_length"], data["sepal_width"], c = colors)

The most amazing thing about DBSCAN is that it separates noise from the dataset pretty well. Here, 0 is a good cluster, and -1 is the noise. Let’s plot the results and see what we get.

Outlier values detected in purple color.

Step 5: Creating an outliers data frame

# outliers dataframe
outliers = data[model.labels_ == -1]print(outliers)

End Notes

So in this article, I explained the DBSCAN clustering algorithm in-depth and also showcased how it is useful in comparison with other clustering algorithms. Also, note that there also exists a much better and recent version of this algorithm known as HDBSCAN which uses Hierarchical Clustering combined with regular DBSCAN. It is much faster and accurate than DBSCAN.

--

--

Dilip Valeti

Deep learning is one of the trending technology we can solve real world problems with high accuracy