Scatter matrix , Covariance and Correlation Explained

Raghavan
3 min readAug 16, 2018

--

It is common among data science tasks to understand the relation between two variables.We mostly use the correlation to understand the relation between two variable.But often we also hear about scatter matrix (also scatter plot) and covariance too. Lets look into what each of them are , how they are calculated and what each signifies. We will also implement each one of them and build on top of another.

Scatter matrix generated with seaborn.

The question all of the methods answers is What are the relation between variables in data?

Scatter Matrix :

A scatter matrix is a estimation of covariance matrix when covariance cannot be calculated or costly to calculate. The scatter matrix is also used in lot of dimensionality reduction exercises. If there are k variables , scatter matrix will have k rows and k columns i.e k X k matrix.

How scatter matrix is calculated

In python scatter matrix can be computed using

#Create a 3 X 20 matrix with random values.mu_vec1 = np.array([0,0,0])
cov_mat1 = np.array([[1,0,0],[0,1,0],[0,0,1]])
samples = np.random.multivariate_normal(mu_vec1, cov_mat1,20).T
#Compute the mean vectormean_x = np.mean(samples[0,:])
mean_y = np.mean(samples[1,:])
mean_z = np.mean(samples[2,:])

mean_vector = np.array([[mean_x],[mean_y],[mean_z]])
#Computation of scatter plotscatter_matrix = np.zeros((3,3))
for i in range(all_samples.shape[1]):
scatter_matrix += (all_samples[:,i].reshape(3,1) - mean_vector).dot((all_samples[:,i].reshape(3,1) - mean_vector).T)
print('Scatter Matrix:\n', scatter_matrix)

The scatter matrix contains for each combination of the variable , the relation between them . Lets observe the scatter matrix for the following matrix :

arange = np.arange(0, 40)
samples = np.array([arange * 3 , arange * -1])
scatter_matrix = np.zeros((2,2))
for i in range(samples.shape[1]):
scatter_matrix += (samples[:,i].reshape(2,1) - mean_vector).dot((samples[:,i].reshape(2,1) - mean_vector).T)
print('Scatter Matrix:', scatter_matrix)
Output :
'Scatter Matrix:', array([[ 47970., -15990.],
[-15990., 5330.]])

If we tweak the matrix generation by changing -1 to 1 :

arange = np.arange(0, 40)
samples = np.array([arange * 3 , arange * 1])

The output also changes sign :

Output :
('Scatter Matrix:', array([[ 47970., -15990.],
[-15990., 5330.]]))

We can observe that the sign of the scatter matrix for a pair of two variables denote weather one increases when other increases / decreases.

Covariance Matrix :

The covariance is defined as the measure of the joint variability of two random variables . Given that we have calculated the the scatter matrix , the computation of covariance matrix is straight forward . We just a have to scale by n-1 the scatter matrix values to compute the covariance matrix.

This can we verified with :

print('Covariance Matrix:'np.cov(samples))
print('Scatter Matrix:', scatter_matrix)
print
('Unscaled covariance matrix which is same as Scatter Matrix:'np.cov(samples) * 39)
Output :
'Covariance Matrix:', array([[1230. , 410. ],
[ 410. , 136.66666667]])
'Scatter Matrix:', array([[47970., 15990.],
[15990., 5330.]])
'Unscaled covariance matrix which is same as Scatter Matrix:', array([[47970., 15990.],
[15990., 5330.]])

With both the scatter matrix and covariance matrix, it is hard to interpret the magnitude of the values as the values are subject to effect of magnitude of the variables. Hence the only the sign of value is really helpful.( Feature scaling do help. More on feature scaling here) . To really understand the strength of the relation of the variables we have to look to the correlation.

Correlation Matrix :

The correlation matrix gives us the information about how the two variables interact , both the direction and magnitude. The commonly used covariance is based on the Pearson correlation coefficient . The way we compute the correlation matrix is by dividing the covariance values of two variables by product of the standard deviation of two variables.

It can be verified as follows :

print('Covariance Matrix:',np.cov(samples))std_dev_of_x1 = np.std(arange * 3)
std_dev_of_x2 = np.std(arange * -1)

std_dev_products = np.array(
[[std_dev_of_x1 * std_dev_of_x1, std_dev_of_x1 * std_dev_of_x2],
[std_dev_of_x1 * std_dev_of_x2, std_dev_of_x2 * std_dev_of_x2]]
)

print('Covariance Matrix:', np.corrcoef(samples))
print('Std deviation products :', std_dev_products)
print('Covariance Matrix computed from covariance :', np.divide(np.cov(samples), std_dev_products))
('Covariance Matrix:', array([[1., 1.],
[1., 1.]]))
('Std deviation products :', array([[1199.25, 399.75],
[ 399.75, 133.25]]))
('Covariance Matrix computed from covariance :', array([[1.02564103, 1.02564103],
[1.02564103, 1.02564103]]))

Hola !! The correlation matrix from numpy is very close to what we computed from covariance matrix.

--

--

Raghavan

Data scientist at Ericsson AI Accelerator, Kravmaga Trainer