Visualising Correlations using Graph

Javier Burroni
3 min readDec 4, 2014

--

When dealing with high dimensional data, analysing correlations is a problem. I had this situations in various projects, and I found that using a graph to visualize correlations is very useful. It’s easy to perform and you will gain lot of information very quickly. With this technique, you will have a powerful insight on how to model complex processes. This is not a novel idea but I think it is worth sharing.

t = 0.35

Supose you have a dataset with a large number of dimensions. For example, in https://archive.ics.uci.edu/ml/datasets.html https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits you can find a dataset with 63 variables. Therefore, the correlations matrix is a matrix of 63 by 63 or 3969 elements. Going to each element and trying to infer relations in the matrix is a demanding activity.

We know that a labeled graph can be represented with an Adjacency Matrix. For the simple case, if we have a zero between column x a row y, we know that no edge goes from node x to node y. If we have a 1, we know that an edge goes from node x to node y. The correlation matrix is a square matrix with values going from -1 to 1. We want to transform this matrix into an adjacency matrix. Moreover, we know that the correlation matrix is symmetric then, if we are going to represent a graph using this matrix, the graph must be Undirected. The first approach is to transform into 1 every non zero value of the correlation matrix. In this case, we will have a complete graph unless we have zero correlations variables:

complete graph

This situation is not very useful as all the posibilities are displayed, and we are not using the actual correlations. Our main goal is find relations between a set of variables.

To improve this situation, we can create an edge between two nodes if the correlation between them is bigger than a threshold. In this way, we say that two nodes are connected if the correlation between them is bigger than a threshold, id est:

igraph.Graph.Adjacency((np.abs(adj)> threshold))

[To build the graph I’m using the igraph library for python, which has nice plots]

If the threshold is bigger than the maximum correlation between two distinct variables, we will have a set of unconnected nodes

correlation of 1

But we can make the threshold vary and spot how the structure emerges.

Graph using threshold .55, 0.45, 0.35, 0.25

From the images above it’s clear that some clusters emerge: the connected components. Even more, from the image related to threshold 0.35, we can take the giant component and apply community detection, this will create nested clusters. Using this information, we can analize the clusters by themselves. Also, we can use this information to create a reduced model where each cluster is represented by a single number. How this cluster becomes a number is a different story.

--

--

Javier Burroni

Ἐν οἴνῳ ἀλήθεια”. Lo que mata es la entropía