Understanding Marketing Analytics in Python. [Part 7] Working with Correlation Coefficients — with example and code

Kamna Sinha
Data At The Core !
Published in
7 min readSep 22, 2023

This is part 7 of the series on Marketing Analytics, have a look at the entire series introduction with details of each part here.

Uptill now we have been looking into finding relations between continuous variables using various visualization techniques [ part 4 , 5 , 6 ] . We go further into this now by looking at actual numerical values that can show these relationships in a more accurate way and conclusions can be reached with more certainty for better business decisions.

For this , we shall broadly look at following topics :

I. Pearson’s r for standardised correlation coefficient between bivariate normal data inputs for better interpretation.

II. Correlation Tests for statistical significance

III. Correlation Matrices — to measure correlation between all pairs of variables in a dataset using pandas corr() .

IV. Visualizing correlations using plt.imshow() and sns.heatmap() functions.

Pearson’s r

Covariance measures the degree that two variables vary together. In case we are comparing 2 variables, Covariance measures the degree to which both variables are higher or lower than their respective mean at the same time.A positive covariance indicates that their patterns match, and a negative covariance indicate that their patterns are offset, i.e. when one is higher than its mean the other is lower its mean.

Covariance can be computed for any two variables using the numpy.cov() function:

This is a variance-covariance matrix, with the variance of each variable on the diagonal. The covariance between age and credit score is in the off-diagonal: 74.55.
1. If values xi and yi tend to go in the same direction — to be both higher or both lower than their respective means — across observations, then they have a positive covariance.
2. If cov(x, y) is zero, then there is no (linear) association between xi and yi .
3. negative covariance means that the variables go in opposite directions relative to their means: when xi is lower, yi tends to be higher.

It is difficult to interpret the magnitude of covariance because the scale depends on the variables involved.

[Covariance will be different if the variables are measured in cents versus dollars or in inches versus centimeters. ]

So, it is helpful to scale the covariance by the standard deviation for each variable, which results in a standardized, rescaled correlation coefficient known as the Pearson product-moment correlation coefficient, often abbreviated as the symbol r.

— Pearson’s r is a continuous metric that falls in the range [−1, +1]. It is +1 in the case of a perfect positive linear association between the two variables, and −1 for perfect negative linear association. If there is little or no linear association, r will be near 0.

This makes r an easily interpreted metric to assess whether two variables have a close linear association or not.

here r= 0.29
next,
What value of r signifies an important correlation between two variables in marketing?
To determine whether a correlation is important, we often use Cohen’s Rules of Thumb :
for correlations between variables describing people,
1. r = 0.1 should be considered a small or weak association,
2. r = 0.3 might be considered to be medium in strength, and
3. r = 0.5 or higher could be considered to be large or strong.

important to note that : this depends on the assumption that the variables are normally distributed (also known as Gaussian) or are approximately so.

Correlation Tests

Measuring the effect of one variable over the other numerically may not be enough unless we also understand the statistical significance of the correlation to understand if the effect needs to be considered or not in further analyses.

For this we use the pearsonr function to check for significance or p-value.

The “scipy.stats.pearsonr()” function returns two values:

  • Pearson correlation coefficient: It ranges between “-1” and “1”, where “-1” specifies a perfect negative linear relationship, “1” indicates a perfect positive linear relationship, and “0” corresponds to no linear relationship.
  • p-value: The p-value is associated with the hypothesis test for the correlation coefficient.

For eg :

This tells us that r = 0.29 and the two-tailed p-value at the 95% level is very close to zero. This value is the probability that r would be greater than or equal to the reported r value under the null hypothesis that r = 0. In this case we can reject that null hypothesis with reasonable confidence. Such a correlation, showing a medium-sized effect and statistical significance, probably should not be ignored in subsequent analyses .

Correlation Matrices

For more than two variables, it is more convenient to use the pandas corr() method to compute the correlations between all pairs x, y at once as a correlation matrix. As with the numpy function, such a matrix shows r = 1.0 on the diagonal because cor(x, x) = 1. It is also symmetric; cor(x, y) = cor(y, x). But unlike numpy.corrcoef() it returns its output as a dataframe and ignores any non-numeric data:

A correlation matrix between all variables in the form of dataframe

Observations :

  1. In the second column of the first row, we see that cor(age, credit.store) = 0.29 as above.
  2. We can easily scan to find other large correlations; for instance, the correlation between store_trans, distance_to_store = -0.25, showing that people who live further from a store tend to have fewer in-store transactions.
  3. For sat_selection and sat_service, the corr() function drops any NaN values.

It would be even more interesting to visualize these correlations , using libraries from pyplot and seaborn…

Visualizing correlations using plt.imshow() and sns.heatmap() functions

Continuing with our example, we shall look at 3 variations of creating visualization , each one being an improvement upon the last :

  1. imshow() — The basic function of Matplotlib Imshow is to show the image object. As Matplotlib is generally used for data visualization, images can be a part of data, and to check it, we can use imshow.

format : plt.imshow(X, cmap=None, norm=None, aspect=None, interpolation=None, alpha=None, vmin=None, vmax=None, origin=None, extent=None)

parameters : X — It is the data that we want to display using imshow. This can be in the form of lists or array. For Grey images, it is a 2-D array, and for colored images, we use 3-D images. Every element in the array acts as a pixel.

Cmap– This parameter is used to give colors to non-colored images. We can pass any of the below values as the argument for this parameter. If the image is already colored, the cmap parameter is ignored.

Norm– This parameter is used to normalize the color values from 0.0 to 1.0. This is ignored in the case of colored images.

Aspect — This parameter is used to adjust the size of images. There are two options for arguments — auto and equal. We will better understand when we look at an example.

Alpha– If we want to change the transparency of the image, we can use this parameter. For opaque image, use ‘1’ as the argument for this parameter. And for a completely transparent image, use 0. The range is 0–1.

Origin– If we want to change the origin ((0,0)) from upper to lower, we can set the value of origin parameter as ‘lower.’

There are many more parameters in imshow, but these are the most important ones.

Result : `~matplotlib.image.AxesImage`

Coming back to our example : This is the simplest use of imshow to see correlation matrix amongst all variables of our datafram.

For clarity, we will first see the output of corr function :

Now, putting the same in imshow gives the following plot :

A nicer plot with proper axis labels can be generated with the seaborn heatmap() function.

We can also customize the output of heatmap() to make it even easier to interpret:

A number of optimizations to make it easier to interpret.
1. We have set vmin and vmax to improve the dynamic range, added an annotation and set that annotation to be rounded to two decimal places (annot=True, fmpt=’2f’).
2. Since the upper and lower triangles are identical, the diagonal is all 1.0, we can also add a mask to only include the lower triangle using the numpy.tri() function to simplify the visualization.
3. We also disable the color bar, as we’ve added an annotation directly on the figure (cbar=False).

Observations :
The colored and numeric values of r are shown in the lower triangle of the matrix.
This makes it easy to find the larger correlations in the data:
1. age is positively correlated with credit_score;
2. distance_to_store is negatively correlated with store_trans and store_spend;
3. online_visits, online_trans, and online_spend are all strongly correlated with one another, as are store_trans and store_spend.
4. In the survey items, sat_service is positively correlated with sat_selection.

We at Sensewithai did a similar analysis for one of our customers , details of which are share here.

Correlation coefficient r measures the linear association between two variables. If the relationship between two variables is not linear, it would be misleading to interpret r. We will see in Part 8— Transforming Variables before Computing Correlations — for nonlinear correlations and to be able to see correlations more clearly even through visualizing transformed data.

ref :

https://www.pythonpool.com/matplotlib-imshow/

--

--