Clustering with Javascript — Part 3: Clustering Algorithms in Practice

Prof. João Gabriel Lima
7 min readNov 26, 2017

--

This is the first part of a series of 3 complete articles with commented codes on Data Clustering using Javascript:

Clustering Algorithms

The algorithms can be classified into: hierarchical, partition (which are the two most traditional methods), model-based, grid-based and density-based (which are the most modern methods). The choice of method to be used in the analysis depends on the type of variable and the purpose of the application.

Partition Algorithms

Partition algorithms search directly for the division of n elements and serve exclusively to group them into k clusters.

The best-known method is k-means, which provides a classification of information according to the data itself. This algorithm is based on the analysis and comparisons between the numerical values of the data. With this, the algorithm will automatically send an automatic classification without the need for an existing pre-classification.

The k-means algorithm is quite scalar and reliable, but it has some peculiarities, the two main ones are:

1. Variables must be numeric or binary. In cases where we have categorical data, an alternative is to convert to numerical values. There are some variations of the available algorithm that have been adapted to work with non-numeric data, in order to extend its application on the most diverse problems.

2. It is an algorithm sensitive to outliers values, a single record with very extreme value can substantially modify the data distribution, therefore it is advisable to treat the data in order not to generate losses in the clustering process.

In summary, the K-means algorithm has its grouping flow through the following steps:

  1. Random Select of the initial centroids;
  2. Each object (row of the data set) is assigned to the group whose centroid has the greatest similarity to the object;
  3. Recalculate the centroid value of each group, as the average of the objects in the group;

Repeat steps 2 and 3 until the groups stabilize or a stop condition set in the algorithm is reached.

In our examples we will use the classic Iris database (https://archive.ics.uci.edu) and as main libraries we will use:

  1. simple-statistics: https://www.npmjs.com/package/simple-statistics
  2. shaman: https://www.npmjs.com/package/shaman
  3. data-forge: https://www.npmjs.com/package/data-forge

We will use the Plotly library to create graphs and store the access credentials inside a file called config.js, containing two variables: USERNAME and API_KEY that must contain the user credentials for access to the Plotly API.

/**
* Import Modules
*/
const stats = require(‘simple-statistics’);
const KMeans = require(‘shaman’).KMeans;
const dataForge = require(‘data-forge’);
const config = require(‘./config’);
const plotly = require(‘plotly’)(config.USERNAME, config.API_KEY);
const opn = require(‘opn’);

To load the data simply invoke the readFileSync function of the data-forge.

// reading base file and put dataframe on df variable
const df=dataForge.readFileSync(‘../datasets/iris.csv’).parseCSV();

In the process of reading the data in .csv it is possible that they are not formatted correctly, so it is highly recommended to treat these data by converting them to float (for this example).

For this we will get a subset only with the columns that we will use in the clustering process. Using the select function we will iterate each record, using the standard parseFloat() function to convert the data of each column, as shown below:

// create a subset and converting all values to float
const subset = df.subset([“sepallength”, “sepalwidth”,“petallength”, “petalwidth”]).select(function (row) {
return {
sepallength: parseFloat(row.sepallength),
sepalwidth: parseFloat(row.sepalwidth),
petallength: parseFloat(row.petallength),
petalwidth: parseFloat(row.petalwidth)
};
});

It is essential to know the properties contained in our data, for this, we will use the summary function in an implementation that includes the primary information about the data set, as shown below:

/**
* Summary function return the overview about the data serie
*
@param data array
*/
function summary(column) {
return {
min: stats.min(column),
max: stats.max(column),
sum: stats.sum(column),
median: stats.median(column),
mode: stats.mode(column),
mean: stats.mean(column),
variance: stats.variance(column),
stdDeviation: stats.standardDeviation(column),
quantile: {
q1: stats.quantile(column, 0.25),
q3: stats.quantile(column, 0.75)
}
}
}
// invoke and show summary function for sepalwidth serie
console.log(‘sepallength’);
console.log(summary(subset.getSeries(‘sepallength’).toArray()));
console.log(‘sepalwidth’);
console.log(summary(subset.getSeries(‘sepalwidth’).toArray()));
console.log(‘petallength’);
console.log(summary(subset.getSeries(‘petallength’).toArray()));
console.log(‘petalwidth’);
console.log(summary(subset.getSeries(‘petalwidth’).toArray()));

At this point, with the data ready, we will implement our clustering by instantiating a KMeans object. Notice that we pass a numeric parameter 3, this represents our K, that is, we want the algorithm to group our data into three clusters. Next, we invoke the cluster function passing our dataset and a callback.

The result of this process, if everything happens correctly, will be the return variables clusters and centroids, populated with their respective data resulting from the clustering process.

// build clustering model
const kmeans = new KMeans(3);
// execute clustering using dataset
kmeans.cluster(subset.toRows(), function (err, clusters, centroids) {
// show any errors
console.log(err);
// show the clusters founds
console.log(clusters);
// show the centroids
console.log(centroids);
}

The result stored in the cluster and centroids variables are:

  • cluster: Array <Array <number>, where the outermost level of the array represents the amount of cluster k, and the innermost number represents the number of variables used by each grouped object.
  • centroids: Array <Array <number>, where from the innermost to the outermost level, represent the values of the object, the cluster to which they belong and the set of clusters found.

We can generate graphs showing the formation of the grouping, for this, we will use the Plotly library.

The data order is according to the dataset, so we will use a dictionary (key-value) to assist in the mapping of data sets.

// dictionary for aux the indexes read
const indexes = {
sepallength:0,
sepalwidth:1,
petallength:2,
petalwidth:3
}

First, let’s map the data to plot the centroids in the graph. For this, in the x-axis, we will use the map function to go through all the objects of the centroids array in the index referring to the sepallength column, in the same way, we will map the y-axis with the data of the centroid objects referring to the petallength column.

It is important to note that if you need to create graphs with the other data, simply change the mapping, choosing other sets of variables.

For this graph, we chose the scatter type, and we defined the marker attribute only to guarantee a better visualization, according to this instruction the centroids will be black dots on the graph.

// build centroids graph model
const centroidTrace = {
x: centroids.map(function (c){
return c[indexes[“sepallength”]]; // 0 — sepallength
}),
y: centroids.map(function (c){
return c[indexes[“petallength”]]; // 2 — petallength
}),
mode: ‘markers’,
type: ‘scatter’,
name: ‘Centroids’,
marker: {
color: ‘#000000’,
symbol: ‘cross’,
size: 10
}
};

The data used to plot the chart will be stored in the plotData array, then we will use the data from the clusters to create the data points in the chart.

We will iterate the array of clusters and for each object, we will select the attributes in a similar way to the work done with centroids previously. Finally, each dataset of each clusters will also be stored in the plotData array.

// adding centroids data on the plotData array
const plotData = [centroidTrace];
// build clusters graph model
clusters.forEach(function (cluster, index) {
const trace = {
x: cluster.map(function © {
return c[indexes[“sepallength”]];
}),
y: cluster.map(function © {
return c[indexes[“petallength”]];
}),
jitter: 0.3,
mode: ‘markers’,
type: ‘scatter’,
name: ‘Cluster ‘ + index
}
// add cluster graph data on plotData
plotData.push(trace);
});

Now that we have the graph data, let’s define the layout settings, assigning the title information, and the X and Y axis labels, the file name and an option called fileopt that will receive the overwrite value, every time you run it, if you already have this chart, you will overwrite it.

// set plotly graph layout
const layout = {
title: ‘Iris Clustering’,
xaxis: {
title: ‘sepallength’
},
yaxis: {
title: ‘petallength’
}
};
//set graph options
var graphOptions = {
layout: layout, //set layout
filename: ‘Iris-clustering’, //set filename
fileopt: ‘overwrite’ // if exists just overwrite
};

Finally the plot function is invoked, passing the data, the options and the call-back function. This function will contain two parameters, the first is an error object, which will be null if the graph is successfully generated. The second parameter contains the basic information of the graph as the name and the link where it can be accessed. To open it automatically, used the opn library, which will open the URL in the system default browser.

/**
* Build Plotly graph
*/
plotly.plot(plotData, graphOptions, function (err, msg) {
if (!err) {
// if build without erros show the message and open browser with graph
console.log(`Success! The plot ${msg.filename} can be found at ${ msg.url}`);
opn(msg.url);
process.exit();
}
});
});

The result of this implementation is shown in the chart below, highlighting the quality of the clusters found, in different colors, clearly indicating the efficiency of the clustering process when grouping objects with high similarity.

Optionally we can visualize this clustering in a three-dimensional chart, the construction is similar, however, changing the type to scatter3d and adding a third variable Z.

Therefore, in our implementation the distribution of the variables will be: x = sepallength, y = sepalwidth and z = petallength, for centroids as well as for clusters, as shown below.

// build centroids graph model
var centroidTrace = {
x: centroids.map(function(c){
return c[indexes[“sepallength”]]; // 0 — sepallength
}),
y: centroids.map(function(c){
return c[indexes[“sepalwidth”]]; // 2 — petallength
}),
z: centroids.map(function(c){
return c[indexes[“petallength”]]; // 2 — petallength
}),
mode: ‘markers’,
type: ‘scatter3d’,
name: ‘Centroids’,
marker: {
color: ‘#000000’,
symbol: ‘cross’,
size: 20
}
};

The result of this process is the visualization of clustering in three dimensions, increasing the possibilities of analysis of the groups generated and their formation according to an additional variables in the Z axis.

Resultado da Clusterização apresentado em 3 dimensões

Note that clustering is not a complicated process. However, we have to be aware of the various configurations that will make the process much more efficient.

This series of articles had as primary objective to show the possibilities of working with Javascript on a domain of application different from the common one.

If you liked, let’s change an idea, leave your comment and add me to your social networks: Twitter e Linkedin

--

--