Data Clustering with Javascript — Part 1: Fundamental Concepts

Prof. João Gabriel Lima
3 min readNov 26, 2017

--

The need to classify elements by their characteristics is present in several areas of knowledge. Given the difficulty of examining all combinations of possible groups in a large volume of data, some techniques have been developed to aid in the formation of clusters.

This is the first part of a series of 3 complete articles with commented codes on Data Clustering using Javascript:

You may need to understand about manipulating data for analysis using Javascript, for this I recommend reading the article Exploratory Data Analysis with Javascript — Part 1: Data Manipulation

About Data Clustering

In summary, they are a set of techniques that aims to gather data objects into groups, such that objects that are in the same group are more similar to each other than objects that are in another group, according to a measure of similarity.

The primary objective is to identify the objects that have characteristics in common and to separate them into similar subsets, determining the number and the characteristics of these groups.

There are numerous algorithms based on measures of similarity or probabilistic models for the formation of clusters. Knowing the different algorithms, it is possible to determine which one is the best choice to achieve the established objectives, thus reducing the time and cost of recovery.

A Clustering technique is an unsupervised classification of patterns in data sets; groups are formed according to these characteristics in common, other than a supervised classification, in which the patterns already have a given classification.

Key Features

All computational models have fundamental characteristics and constraints. The main requirements for the construction of a clustering model are:

Independent of order of presentation of the data — A same set of data when presented in different orderings should lead to the same results;

Identify clusters of different sizes — In addition to form, some methods tend to make clusters with homogeneous size;

Be scalable to handle any amount of data — A large database can hold hundreds of billions of records. The methods must be fast and scalable based on the number of dimensions and the number of records to be clustered;

Provide interpretable and usable results — Descriptions of clusters should be easily assimilated. The results of the clusters are interpretable, understandable and usable.

Be efficient in the presence of noise — Most real databases contain noises, unknown or erroneous data. Their existence should not affect the quality of clusters obtained;

Accept constraints — Actual applications may need to group objects according to various types of constraints. The methods must find data groups whose structures satisfy the specified constraints;

Finding the right number of clusters — Finding the number of clusters in a set of objects is a difficult task. Many methods need a reference value, specified by the user.

It has high dimensionality — database support that have thousands of records and / or a large number of columns (attributes);

Domain Transparency — You must have the minimum knowledge of the domain to determine the input parameters;

Assuming that all the mentioned requirements are met, an important step in the use of cluster analysis is the choice of a criterion that can measure the proximity between two objects, that is, a criterion that says how much two objects are similar or different.

The smaller the value of the distance between objects, the more similar they will be.

The choice of measurement criterion depends basically on the type of variable involved. For each type there are one or more similarity measures to be applied.

In the next post, these criteria and algorithms will be treated to calculate the distance between the data: Part 2 — Standardization of variables and distance measures.
Leave your comment and add me on your social networks: Twitter e Linkedin

--

--