On Data Science-1: Does the “Women Entrepreneurship Index” Depends Upon the “Entrepreneurship Index”?

How I avoided some common errors to find a robust correlation

Ritik P. Nayak
The Startup

--

Dear audience,

Having published a series of notebooks that manifest a beginner’s guide to Exploratory Data Analysis (EDA) for the beginners, I am in a better position to answer that question from a primitive perspective. As one goes through this story, one would understand that it exposes the reader to the primal and perhaps specific ways to find out the correlation between two variables.

  1. The Data:

The data was procured from the “Women Entrepreneurship Index and Global Entrepreneurship Index report” published in 2015, by a generous fellow, and published on Kaggle. It was thus used by me to make a series of tutorial notebooks that for the most part culminated in producing some of the best notebooks I have ever begotten. Though the dataset has an ickle 51 rows, that however was sufficient to prove our case and produce a worthwhile correlation tutorial. Please download/review the data using the following link:

2. Correlation: A common mistake:

A frequenter on Kaggle, as I am, I come across several notebooks that employ correlation. Little does it surprises me that almost all notebooks deploy one common method for estimating the correlation. That is;

Step 1: Import the “Corr” library.

Step 2: Insert the values in the function.

Step 3: Print the results.

The output happens to be a matrix containing all the variables/columns that endow numerical values. Each grid of the matrix has a number between -1 and 1, that portrays the strength of the correlation. People take that for the “accurate” and “final” strength, henceforward assuming that to be the most precise estimator for the correlation. At this very point, the supposedly precise estimation might contravene the actual value.

Why?

a. The “Corr” library uses the “Pearson’s Correlation” to compute the strength of the correlation.

b. It is a competent method, but only in a limited sense.

c. It gives accurate results when;

One, there is a linear relationship between the 2 variables,

Two, there is a little or no outliers in the data, and

Three, the variables are normally distributed.

How do we find out whether the relationship between two variables is linear?

a. We can discover that by using the scatter plots.

b. Scatter plots do not reveal much detail yet it is the simplest and method to begin with.

How do we find out whether the data is screwed and is normally distributed or not?

a. For skrewness, we can check the distribution of the data; histogram is the most common way to check that.

I have made a notebook explaining skewness and its estimation on Kaggle. To find out, please visit the following link:

b. To see if the distribution is normally distributed or not; we can plot the “Probability distribution functions” and the “Cumulative distribution functions” of the data.

I have made a notebook briefly explaining Cumulative distribution functions on Kaggle. To find out, please visit the following link:

What if the relationship is non linear?

a. One should employ “Spearman’s Rank correlation” in such cases.

b. It mitigates the effect of the outliers and skewed data.

What if the the data is not normally distributed?

a. We can take the log-transformed values of the variable which is not normally distributed.

b. If both possess the same property, log-transformation is to be applied on both.

That said, there is also one method that I love. That is, binning one variable and plotting the percentile of the other in all the bins. I wouldn’t dwell much on this method, for it is of not much relevance in the data I worked with.

3. My methodology:

I used all of the aforesaid methods. Though my data was linearly correlated (as was the outcome of the scatter plot), the variables however were not normally distributed. Therefore, I employed “Spearman’s method” also.

I did not use the “Corr” library in the first place. I nonetheless made my own algorithms to estimate the strength of the correlation by using both, “Pearson’s Correlation method” and “Spearman’s method”. The strength in both the cases came out to be more than 0.9 which portrayed that both the variables were strongly correlated.

One can discover, by going through my notebook that I used the other method ‘of binning the variables’ as well. That however shall not be taken into account in this very dataset, for the data that one has in this case is very limited. The graph might be misleading if one has to consider that at face value.

4. Conclusions:

a. Though the outcome was much contemplated, one should never stop thinking beyond what the data suggests. That is to say, correlation does not always imply causation.

b. A strong correlation, that exists between the 2 variables, to my knowledge implies that with the increase the entrepreneurship index in general, the women entrepreneurship index increases and with a decline in the former, the latter too follows suit.

c. There is a good proportionality between the two variables, the other way round doesn’t hold much ground after looking at the results.

One can review the series of notebooks made by me on this dataset; namely, “Autumn of matriarch: The complete guide to EDA”, in my account. Tiptoed, follows the link;

Yours, in beginners’ earnestness

Ritik Prakash Nayak

--

--

Ritik P. Nayak
The Startup

A student of B. Tech, CSE at Punjabi University Patiala, I write primarily on Data Science and Philosophy.