Data Clustering with Javascript — Part 2: Standardization and Distance Measurement

Prof. João Gabriel Lima
4 min readNov 26, 2017

--

This is the second part of a series of 3 complete articles with commented codes about Data Clustering with Javascript:

Standardizing Interval Variables

Before we apply any metrics to the data, it is important to ensure that they are standardized.

This step is fundamental and occurs over interval variables, as they are measured approximately in a linear scale, such as: weight, height, temperature and latitude and longitude coordinates.

The most significant problem found in this type of variable is that they have different units, for example, weight is expressed in kilograms, height in meters and temperature in degrees.

The use of variables with different units will affect our mathematical models because they are in different scalar, the degree of similarity will be affected by each magnitude.

These variables need to be standardized and for this, we use the z-score, which has its formula given by:

z-score formula

Each item in the list of values of the studied variable is subtracted from the mean and divided by the absolute mean deviation. It is important not to confuse with the standard deviation since the absolute mean deviation is less sensitive to outliers than the standard deviation.

In javascript we have the zscore library (https://github.com/seracio/zscore), to use it just install inside the project as shown below:

npm i zscore --save

To use it simply insert the data array of the interval variable inside the zcore (Array <number>) function and return the array with the new values.

// import modules
const zscore = require(‘zscore’).default;
const dataForge = require(‘data-forge’);

// reading base file and put dataframe on df variable
var df = dataForge.readFileSync(‘../datasets/iris.csv’).parseCSV();

// apply zscore
let newArray = zscore(df.getSeries(“sepallength”).toArray());

// get result Array
console.log(newArray);

Distance measures

Cluster analysis seeks to group data elements based on the similarity between them. Several factors are taken into account when clustering and require knowledge on the part of the analyst of which distance measure to choose according to the problem domain and the data set.

Distance measures in general can be defined as measures of similarity, dissimilarity:

  • Similarity: it is to define the degree of similarity between the instances and carry out the grouping according to their cohesion.
  • Dissimilarity: according to the differences of the attributes of the instances.

The most common distances of similarity are Euclidean Distance and Manhattan Distance.

The Euclidean Distance is defined as the sum of the square root of the difference between x and y in their respective dimensions. The Manhattan Distance has a simpler definition in which it is just the sum of the differences between x and y in each dimension.

Below is the mathematical representation and their respective implementations in Javascript:

Euclidian Distance:

const euclidean = (p, q) => Math.sqrt( (p[0]-q[0]) * (p[1]-q[1]));

Manhattan Distance: |x1 — x2| + |y1 — y2|.

const manhattan = (a, b) => a.map((value, index)=>{
return Math.abs(value - b[index])
}).reduce((v1, v2) => v1+v2);

One of the problems for the use of grouping techniques is the use of nominal data in its attributes, which, because they do not have an implicit metric, make it difficult for algorithms to work in terms of the attribution of weights and values for the formation of clusters.

A disadvantage of working with distance measures can happen if there is a difference in scale between dimensions. For example, if on the X-axis there is distance in kilometers, and in the Y-axis the distance is in centimeters, thinking in cartographic terms.

At the moment of the transformation of scale (ie conversion from centimeters to kilometers) the Euclidean results (which are based on squares and root) have a very large influence on the dimensions that have the highest values.

For the Manhattan Distance, in addition to the fact that the outliers are equally disregarded, there is no influence of scale of the data set on the result, since there is no squared increase of the values of X and Y.

In the ml-distance package we have the implementation of 43 algorithms of distance measures, representing greater power for the use of clustering models. Its documentation can be found at: https://www.npmjs.com/package/ml-distance

For installation, just run the command below in the project folder.

$ npm i ml-distance --save

To use this distance measure package, we will import into the script, invoking the distance object and accessing the implementations, as shown below:

const distance = require(‘ml-distance’).distance;const d_euclidean = distance.euclidean([1,2],[2,3]);
const d_manhattan = distance.manhattan([1,2],[2,3]);
console.log(“Euclidean: “,d_euclidean);
console.log(“Manhattan: “,d_manhattan);

In this way we calculate the respective distances that can be used in the process and clustering.

You’re ready for the next step! Let’s see in practice how to create clusters about data using javascript in the next article: Part 3 — Clustering Algorithms — Case Study 1

Leave your comment and add me on your social networks: Twitter e Linkedin

--

--