Learn about Variable Clustering.

Analyttica Datalab
2 min readDec 20, 2018

--

Clustering is a way to understand how data is structured and as dimension reduction technique. In today’s analytics scenario, when an analyst has the massive volume as well as multiple sources of data to access the information, it’s a challenge to process and provide insightful information using the same.

When the project involves numerous variables, it becomes difficult to find out the relationship between variables and too many variables can reduce the model efficiency as well. Also, it is much more difficult to have an explainable model when there are many variables.

The variable reduction is a crucial step for accelerating model building without losing the potential predictive power of the data. Variable clustering is one such technique that helps in variable reduction.

Input:

To run Variable Clustering in Analyttica TreasureHunt, select the variables you want to cluster and specify the number of clusters/groups you want to form. The variables should be continuous and contain no missing values.

Application & Interpretation:

Variables clustering divides a set of numeric variables into either disjoint or hierarchical clusters. Associated with each cluster is a linear combination of the variables in the cluster, which may be either the first principal component or the centroid component.

The rule dictates to select the variable with the minimum 1-R² as the cluster representative.

The 1-R² is defined as,

1-R² = (1-R²)(own))/(1 — R²)(nearest)

Intuitively, we want the cluster representative to be as closely correlated to its own cluster and as uncorrelated to the nearest cluster. Therefore, the optimal representative of a cluster is a variable where 1-R² tends to zero.

Typically, in the clustering literature, there is a rule for selecting the cluster representative, the 1-R². Business “knowledge from subject matter expert should also complement this rule to guide the selection of variables. For this reason, we could decide to use more than one variable per duster. Also, for business justification alternate variable may provide a better intuitive interpretation of the model than the cluster representative.

The below example illustrates how we can reduce variable using variable clustering:

In the above example, we have divided 9 variables into 3 groups or clusters. From each cluster, we select the variable with the lowest (1-R2) ratio.

See Also:

Principal Component Analysis, Coefficient of Determination, Variable Inflation Factor.

--

--

Analyttica Datalab

Analyttica Datalab (www.analyttica.com) is a contextual Data Science (DS) & Machine Learning (ML) Platform Company.