# How to Cluster and Detect Outlier at The Same Time

In the previous story, we discussed how we clustered using k-mean. The algorithm is very simple and converges quickly. The drawback is, we must set (or guess) how many clusters in the dataset. In addition, clusters formed for each running time will also be different a.k.a not guaranteed to reach global optimum.

In this story, we will explore clustering algorithms where we can find out the number of clusters automatically and we can detect outliers also depending on the data set. This algorithm is called Density-based spatial clustering of applications with noise (DBSCAN). What a long name just for a simple algorithm.

Let’s recall in the previous story, how we determined if a certain data point was part of a certain cluster? It’s because that data point is closer with that certain cluster than another cluster. The other characteristic is that there is a lot of other data adjacent to that data point. In the DBSCAN, we must define explicitly and clearly, what the meaning of “closeness” and “ a lot of other data adjacent to that data point”. That is why in this algorithm we set the neighboring radius (closeness) and minimum neighbor (how many other data that adjacent to that data). Here the detail of the algorithm:

1. At the first time, there is no cluster.
2. Randomly choose data in the dataset and measure the distance of the data to all other data. If the distance between the data and certain data is below the radius that we already set, assign that certain data as a neighbor of “the data”. If the number of neighbors of “the data” exceeds the minimum neighbors, then assign the data and it’s neighbors as 1 cluster.
3. Do as in step 2 but the data is replaced by its neighbors. Neighbors of the neighbor are in the same cluster with previous data. Do this step until all detected neighbor is chosen.
4. When all detected neighbor is chosen, construct a new cluster using data that has not been chosen. The new clusters are formed as in steps 2 and 3.
5. The data that are not part of any cluster considered as an outlier.

To make it clearer, here is an animated illustration of the algorithm.

In order to better understanding, here the code build using numpy instead of a fancy library like sklearn (where you can do this algorithm just with 2 lines).

`#!/usr/bin/pythonimport numpy as npimport matplotlib.pyplot as pltimport matplotlib.cm as cm#settingdatasrc = 'datacluster.dat'neighbor = 4.0radius = 0.015#read datadata = np.genfromtxt(datasrc, delimiter=[8, 8])################normalization################maxcnt = np.zeros((len(data)))mincnt = np.zeros((len(data)))datanorm = np.zeros(np.shape(data))for i in xrange(len(maxcnt)): maxcnt[i] = np.max(data[:,i]) mincnt[i] = np.min(data[:,i]) datanorm[:,i] = (data[:,i] - mincnt[i]) / (maxcnt[i] - mincnt[i])#extra 2 column for: class (index 0) and current state (index 1)#state-> 0 = not scanned, 1 = scanned, 2 = checked#class-> 0 = class unknownunkdatastat = np.zeros((len(datanorm), 2))#################real algorithm#################running = Trueclusternumbering = 0while running: print 'number of cluster', clusternumbering hitung = 0 for i in xrange(len(unkdatastat)):  if unkdatastat[i,1] == 2:   hitung += 1 print hitung #scan all data, check if it already scanned or not #modescan = True, clustering with already known data otherwise construct new cluster modescan = False for i in xrange(len(unkdatastat)):  if unkdatastat[i,1] == 1:   datacheck = i   modescan = True   break if modescan:  cntdist = []#assign data to cluster  for i in xrange(len(datanorm)):   #get all euclid distance to check neighbor   if i == datacheck:    continue   dist = np.sum((datanorm[datacheck] - datanorm[i])**2.0)**0.5   if dist < radius:    cntdist.append(i)  if len(cntdist) >= neighbor:   for j in xrange(len(cntdist)):    if unkdatastat[int(cntdist[j]), 1] != 2:     unkdatastat[int(cntdist[j]), 0] = clusternumbering     unkdatastat[int(cntdist[j]), 1] = 1   unkdatastat[datacheck,0] = clusternumbering   unkdatastat[datacheck,1] = 2  else:   unkdatastat[datacheck,1] = 2 #construct new cluster else:  cntdist = []  running = False  clusternumbering += 1  #scan once again  for i in xrange(len(unkdatastat)):   if unkdatastat[i,1] == 0:    datacheck = i    running = True    breakif running:   #assign data to cluster   for i in xrange(len(datanorm)):    #get all euclid distance to check neighbor    if i == datacheck:     continue    dist = np.sum((datanorm[datacheck] - datanorm[i])**2.0)**0.5    if dist < radius:     cntdist.append(i)   if len(cntdist) >= neighbor:    for j in xrange(len(cntdist)):     if unkdatastat[int(cntdist[j]), 1] != 2:      unkdatastat[int(cntdist[j]), 0] = clusternumbering      unkdatastat[int(cntdist[j]), 1] = 1    unkdatastat[datacheck,0] = clusternumbering    unkdatastat[datacheck,1] = 2   else:    unkdatastat[datacheck,1] = 2    clusternumbering -= 1###########plotting############prepare colorcolors = cm.hsv(np.linspace(0, 1, clusternumbering+1))#plot the change of clusteringalllabelcol = []for i in xrange(len(unkdatastat)): col = colors[int(unkdatastat[i,0])] alllabelcol.append(col)alllabelcol = np.asarray(alllabelcol)fig = plt.figure()ax = fig.add_subplot(111)scatter = ax.scatter(data[:,0],data[:,1], color=alllabelcol,s=5)fig.savefig('clustering/DBSACN')`

You can download script above, here. In the script below, I do some pre-processing which is normalization to make the scale of data is between 0 and 1. This is done so that it is easier to determine the neighbor’s radius. For each run in the same configuration, the output of this algorithm remains the same. Here the output of the script above with the minimum of neighbor = 4 and radius of the neighbor = 0.015.

As we can see, there are many red dots around clusters. These red dots are detected as outliers. When compared to k-mean the running time of DBSCAN is very long because DBSCAN iterates as much data as in the dataset. that’s the disadvantage of DBSCAN. See you in another story.

Written by

#### from confusion to clarity, not insanity

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just \$5/month. Upgrade