How to Cluster and Detect Outlier at The Same Time

Muhammad Ryan
Dec 16, 2018 · 5 min read
source: https://img.buzzfeed.com/buzzfeed-static/static/2014-06/23/15/enhanced/webdr06/anigif_enhanced-19102-1403551654-26.gif

In the previous story, we discussed how we clustered using k-mean. The algorithm is very simple and converges quickly. The drawback is, we must set (or guess) how many clusters in the dataset. In addition, clusters formed for each running time will also be different a.k.a not guaranteed to reach global optimum.

In this story, we will explore clustering algorithms where we can find out the number of clusters automatically and we can detect outliers also depending on the data set. This algorithm is called Density-based spatial clustering of applications with noise (DBSCAN). What a long name just for a simple algorithm.

Let’s recall in the previous story, how we determined if a certain data point was part of a certain cluster? It’s because that data point is closer with that certain cluster than another cluster. The other characteristic is that there is a lot of other data adjacent to that data point. In the DBSCAN, we must define explicitly and clearly, what the meaning of “closeness” and “ a lot of other data adjacent to that data point”. That is why in this algorithm we set the neighboring radius (closeness) and minimum neighbor (how many other data that adjacent to that data). Here the detail of the algorithm:

  1. At the first time, there is no cluster.
  2. Randomly choose data in the dataset and measure the distance of the data to all other data. If the distance between the data and certain data is below the radius that we already set, assign that certain data as a neighbor of “the data”. If the number of neighbors of “the data” exceeds the minimum neighbors, then assign the data and it’s neighbors as 1 cluster.
  3. Do as in step 2 but the data is replaced by its neighbors. Neighbors of the neighbor are in the same cluster with previous data. Do this step until all detected neighbor is chosen.
  4. When all detected neighbor is chosen, construct a new cluster using data that has not been chosen. The new clusters are formed as in steps 2 and 3.
  5. The data that are not part of any cluster considered as an outlier.

To make it clearer, here is an animated illustration of the algorithm.

source: http://arogozhnikov.github.io/images/opera/post/clustering-dbscan-smiley.gif

In order to better understanding, here the code build using numpy instead of a fancy library like sklearn (where you can do this algorithm just with 2 lines).

#!/usr/bin/pythonimport numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
#setting
datasrc = 'datacluster.dat'
neighbor = 4.0
radius = 0.015
#read data
data = np.genfromtxt(datasrc, delimiter=[8, 8])
###############
#normalization#
###############
maxcnt = np.zeros((len(data[0])))
mincnt = np.zeros((len(data[0])))
datanorm = np.zeros(np.shape(data))
for i in xrange(len(maxcnt)):
maxcnt[i] = np.max(data[:,i])
mincnt[i] = np.min(data[:,i])
datanorm[:,i] = (data[:,i] - mincnt[i]) / (maxcnt[i] - mincnt[i])
#extra 2 column for: class (index 0) and current state (index 1)
#state-> 0 = not scanned, 1 = scanned, 2 = checked
#class-> 0 = class unknown
unkdatastat = np.zeros((len(datanorm), 2))
################
#real algorithm#
################
running = True
clusternumbering = 0
while running:
print 'number of cluster', clusternumbering
hitung = 0
for i in xrange(len(unkdatastat)):
if unkdatastat[i,1] == 2:
hitung += 1
print hitung
#scan all data, check if it already scanned or not
#modescan = True, clustering with already known data otherwise construct new cluster
modescan = False
for i in xrange(len(unkdatastat)):
if unkdatastat[i,1] == 1:
datacheck = i
modescan = True
break
if modescan:
cntdist = []
#assign data to cluster
for i in xrange(len(datanorm)):
#get all euclid distance to check neighbor
if i == datacheck:
continue
dist = np.sum((datanorm[datacheck] - datanorm[i])**2.0)**0.5
if dist < radius:
cntdist.append(i)
if len(cntdist) >= neighbor:
for j in xrange(len(cntdist)):
if unkdatastat[int(cntdist[j]), 1] != 2:
unkdatastat[int(cntdist[j]), 0] = clusternumbering
unkdatastat[int(cntdist[j]), 1] = 1
unkdatastat[datacheck,0] = clusternumbering
unkdatastat[datacheck,1] = 2
else:
unkdatastat[datacheck,1] = 2
#construct new cluster
else:
cntdist = []
running = False
clusternumbering += 1
#scan once again
for i in xrange(len(unkdatastat)):
if unkdatastat[i,1] == 0:
datacheck = i
running = True
break
if running:
#assign data to cluster
for i in xrange(len(datanorm)):
#get all euclid distance to check neighbor
if i == datacheck:
continue
dist = np.sum((datanorm[datacheck] - datanorm[i])**2.0)**0.5
if dist < radius:
cntdist.append(i)
if len(cntdist) >= neighbor:
for j in xrange(len(cntdist)):
if unkdatastat[int(cntdist[j]), 1] != 2:
unkdatastat[int(cntdist[j]), 0] = clusternumbering
unkdatastat[int(cntdist[j]), 1] = 1
unkdatastat[datacheck,0] = clusternumbering
unkdatastat[datacheck,1] = 2
else:
unkdatastat[datacheck,1] = 2
clusternumbering -= 1
##########
#plotting#
##########
#prepare color
colors = cm.hsv(np.linspace(0, 1, clusternumbering+1))
#plot the change of clustering
alllabelcol = []
for i in xrange(len(unkdatastat)):
col = colors[int(unkdatastat[i,0])]
alllabelcol.append(col)
alllabelcol = np.asarray(alllabelcol)
fig = plt.figure()
ax = fig.add_subplot(111)
scatter = ax.scatter(data[:,0],data[:,1], color=alllabelcol,s=5)
fig.savefig('clustering/DBSACN')

You can download script above, here. In the script below, I do some pre-processing which is normalization to make the scale of data is between 0 and 1. This is done so that it is easier to determine the neighbor’s radius. For each run in the same configuration, the output of this algorithm remains the same. Here the output of the script above with the minimum of neighbor = 4 and radius of the neighbor = 0.015.

As we can see, there are many red dots around clusters. These red dots are detected as outliers. When compared to k-mean the running time of DBSCAN is very long because DBSCAN iterates as much data as in the dataset. that’s the disadvantage of DBSCAN. See you in another story.

from confusion to clarity, not insanity

Muhammad Ryan

Written by

Meteorologist and hobbyist coder that passionate about artificial intelligence, modelling physics, and problem solving

Data Driven Investor

from confusion to clarity, not insanity

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade