# How to Cluster and Detect Outlier at The Same Time

In the previous story, we discussed how we clustered using k-mean. The algorithm is very simple and converges quickly. The drawback is, we must set (or guess) how many clusters in the dataset. In addition, clusters formed for each running time will also be different a.k.a **not guaranteed to reach global optimum**.

In this story, we will explore clustering algorithms where we can find out the number of clusters automatically and we can detect outliers also depending on the data set. This algorithm is called **Density-based spatial clustering of applications with noise** (**DBSCAN**). What a long name just for a simple algorithm.

Let’s recall in the previous story, how we determined if a certain data point was part of a certain cluster? It’s because that data point is **closer with that certain cluster** than another cluster. The other characteristic is that there is a **lot of other data adjacent to that data point**. In the DBSCAN, we must define explicitly and clearly, what the meaning of “closeness” and “ a lot of other data adjacent to that data point”. That is why in this algorithm we set the neighboring radius (closeness) and minimum neighbor (how many other data that adjacent to that data). Here the detail of the algorithm:

- At the first time, there is no cluster.
- Randomly choose data in the dataset and measure the distance of the data to all other data. If the distance between the data and certain data is below the radius that we already set, assign that certain data as a neighbor of “the data”. If the number of neighbors of “the data” exceeds the minimum neighbors, then assign the data and it’s neighbors as 1 cluster.
- Do as in step 2 but the data is replaced by its neighbors.
**Neighbors of the neighbor**are in the same cluster with**previous data**. Do this step until all detected neighbor is chosen. - When all detected neighbor is chosen, construct a new cluster using data that has not been chosen. The new clusters are formed as in steps 2 and 3.
- The data that are not part of any cluster considered as an outlier.

To make it clearer, here is an animated illustration of the algorithm.

In order to better understanding, here the code build using numpy instead of a fancy library like sklearn (where you can do this algorithm just with 2 lines).

#!/usr/bin/pythonimport numpy as np

import matplotlib.pyplot as plt

import matplotlib.cm as cm#setting

datasrc = 'datacluster.dat'

neighbor = 4.0

radius = 0.015#read data

data = np.genfromtxt(datasrc, delimiter=[8, 8])###############

#normalization#

###############maxcnt = np.zeros((len(data[0])))

mincnt = np.zeros((len(data[0])))

datanorm = np.zeros(np.shape(data))

for i in xrange(len(maxcnt)):

maxcnt[i] = np.max(data[:,i])

mincnt[i] = np.min(data[:,i])

datanorm[:,i] = (data[:,i] - mincnt[i]) / (maxcnt[i] - mincnt[i])#extra 2 column for: class (index 0) and current state (index 1)

#state-> 0 = not scanned, 1 = scanned, 2 = checked

#class-> 0 = class unknown

unkdatastat = np.zeros((len(datanorm), 2))################

#real algorithm#

################running = True

clusternumbering = 0

while running:

print 'number of cluster', clusternumbering

hitung = 0

for i in xrange(len(unkdatastat)):

if unkdatastat[i,1] == 2:

hitung += 1

print hitung

#scan all data, check if it already scanned or not

#modescan = True, clustering with already known data otherwise construct new cluster

modescan = False

for i in xrange(len(unkdatastat)):

if unkdatastat[i,1] == 1:

datacheck = i

modescan = True

break

if modescan:

cntdist = []#assign data to cluster

for i in xrange(len(datanorm)):

#get all euclid distance to check neighbor

if i == datacheck:

continue

dist = np.sum((datanorm[datacheck] - datanorm[i])**2.0)**0.5

if dist < radius:

cntdist.append(i)

if len(cntdist) >= neighbor:

for j in xrange(len(cntdist)):

if unkdatastat[int(cntdist[j]), 1] != 2:

unkdatastat[int(cntdist[j]), 0] = clusternumbering

unkdatastat[int(cntdist[j]), 1] = 1

unkdatastat[datacheck,0] = clusternumbering

unkdatastat[datacheck,1] = 2

else:

unkdatastat[datacheck,1] = 2

#construct new cluster

else:

cntdist = []

running = False

clusternumbering += 1

#scan once again

for i in xrange(len(unkdatastat)):

if unkdatastat[i,1] == 0:

datacheck = i

running = True

breakif running:

#assign data to cluster

for i in xrange(len(datanorm)):

#get all euclid distance to check neighbor

if i == datacheck:

continue

dist = np.sum((datanorm[datacheck] - datanorm[i])**2.0)**0.5

if dist < radius:

cntdist.append(i)

if len(cntdist) >= neighbor:

for j in xrange(len(cntdist)):

if unkdatastat[int(cntdist[j]), 1] != 2:

unkdatastat[int(cntdist[j]), 0] = clusternumbering

unkdatastat[int(cntdist[j]), 1] = 1

unkdatastat[datacheck,0] = clusternumbering

unkdatastat[datacheck,1] = 2

else:

unkdatastat[datacheck,1] = 2

clusternumbering -= 1##########

#plotting#

###########prepare color

colors = cm.hsv(np.linspace(0, 1, clusternumbering+1))#plot the change of clustering

alllabelcol = []

for i in xrange(len(unkdatastat)):

col = colors[int(unkdatastat[i,0])]

alllabelcol.append(col)

alllabelcol = np.asarray(alllabelcol)fig = plt.figure()

ax = fig.add_subplot(111)

scatter = ax.scatter(data[:,0],data[:,1], color=alllabelcol,s=5)fig.savefig('clustering/DBSACN')

You can download script above, here. In the script below, I do some pre-processing which is normalization to make the scale of data is between 0 and 1. This is done so that it is easier to determine the neighbor’s radius. For each run in the same configuration, the output of this algorithm remains the same. Here the output of the script above with the minimum of neighbor = 4 and radius of the neighbor = 0.015.

As we can see, there are many red dots around clusters. These red dots are detected as **outliers**. When compared to k-mean the **running time of DBSCAN is very long** because DBSCAN iterates as much data as in the dataset. that’s the disadvantage of DBSCAN. See you in another story.