Statistical Analysis using Python

Gaurav Sharma
Analytics Vidhya
Published in
17 min readSep 16, 2020

If you already visited Part1-EDA then you can directly jump to this (Statistical Analysis section).

This is a 3 part series in which I will walk through a data-set analyzing it and then at the end do predictive modelling. I highly recommend to follow this series in the order given below but you can also jump to any part.

Part 1, Exploratory Data Analysis(EDA):
This part consists of summary statistics of data but the major focus will be on EDA where we extract meaning/information from data using plots and report important insights about data. This part is more about data analysis and business intelligence(BI).

Part 2, Statistical Analysis:
In this part we will do many statistical hypothesis testing, apply estimation statistics and interpret the results we get. We will also validate this with the findings from part one. We will apply both parametric and non-parametric tests. We will report all the important insights we get in this part. This part is all about data science requires statistical background.

Part 3, Predictive Modelling:
In this part we will predict a response using given predictors. This part is all about machine learning.

Meta-Data, Data about Data

I am using the auto mpg data for EDA taken from the UCI repository.

Title: Auto-Mpg Data
Number of Instances: 398
Number of Attributes: 9 including the class attribute
Attribute Information:

1. mpg — continuous
2. cylinders — multi-valued discrete
3. displacement — continuous
4. horsepower — continuous
5. weight — continuous
6. acceleration — continuous
7. model year — multi-valued discrete
8. origin — multi-valued discrete
9. car name — string (unique for each instance)

This data is not complex and is good for analysis as it has a nice blend of both categorical and numerical attributes.

This is part 2 i.e., Statistical Analysis. I won’t stretch this part too long and do following things in sequential manner.

  1. Some Pre-processing of the data, exact same as Part1-EDA.
  2. Tests for independence between two categorical attributes
  3. Normality Test for numeric attributes
  4. Correlation between numeric attributes
  5. Parametric and Non-Parametric test for samples

I will make use of hypothesis-testing heavily throughout the notebook, so it is also a good to go notebook for those who are looking for how to apply hypothesis-testing in data science and machine learning.

Firstly, import all necessary libraries.

The coming few cells involve data cleaning, this includes dealing with missing values, duplicate data if any and aligning the data. I already covered this in part1. So you can skip to this cell if you already visited part1.

We will first import the data into a pandas data-frame and inspect it’s properties.

png

The data is in rectangular(tabular) form, with 398 entries, each having 9 distinct attributes.

To inspect meta-data (i.e., data about data), we can use an inbuilt pandas function.

df.info() , describes many things about the data, like data type of each column, memory usage etc.

Now, I will make two distinct list for categorical and numerical column names as the analysis differ for both the types. For that I will introspect the datatype of each column and if it is of type object then it's categorical, else numerical.

I will use these two lists heavily throughout the analysis.

As there are very few unique values for cylinders and model_year, so it’s safe to make them categorical instead numeric. This conversion will be helpful during analysis as I will bifurcate some attributes on the basis of other.

Lists should be updated,

Now, inspect for nans in the data. I will check for nans column-wise.

png

The nan-row proportion in the data is 6 / len(df) = 0.01507. So, horsepower consists all 6 nan rows, comprising of around 1.5% of data. As this fraction is very low so it’s safe to drop the nan rows for now.

Note: If the nan proportion is large (more than 5%) then we won’t drop it but instead impute missing values or can even treat missing as another attribute.

For now remove all nan rows as they are just 1.5%.

Let’s see how many duplicate entries are there and drop them if there are any.

So, there are no duplicate rows.

The coming code cell is already explained in-depth in Part1-EDA so please refer that if you feel uncomfortable.

Before we move ahead it’s a good practice to group all variables together having the same type.

png

Statistical Analysis

Before moving we should first have a good understanding of various terms used in statistics. Otherwise as we move we will surely lost while interpreting the results.

  • Population: The entire data or entire possible observations.
  • Sample: A subset of observations taken from population. As the sample size increases sample will represent the population more closely(Law of Large Numbers).
  • Parameters: It’s the property of population which we are interested in and never know the exact value unless we do analysis on entire population(which is never the case) e.g, mean(mu).
  • Estimates: It’s sample idea/value about the population parameters. The entire goal of statistics is to make these sample estimates as close as population parameters e.g, average(x bar) is the best possible sample estimate of mean.
  • Descriptive Statistics: It’s for summarizing data.
  • Inferential Statistics: It’s for drawing conclusions about the population from samples eg., estimating population mean using sample average.
  • Parametric Statistics: Statistical methods where we assume a distribution of the data such as Gaussian.
  • Non-Parametric Statistics: Statistical methods where we do not assume any distribution of the data i.e., distribution free.
  • Statistical Hypothesis Tests: Methods that quantify the likelihood of observing the result given an assumption or expectation about the result. We will talk on this more later.
  • Estimation Statistics: Methods that quantify the uncertainty of a result using confidence intervals.

Statistical Hypothesis Tests

The idea of Statistical Hypothesis Tests is very simple and straight forward. We first assumes something about the data like two samples has same mean etc. And then we find the likelihood of observing the given data assuming this assumption as true. If the likelihood is close to zero then we reject the assumption and if the likelihood has value greater than some threshold(set by us) then we fail to reject the assumption.

In statistics lingo the assumption is called Hypothesis, the likelihood we get is called p-value, the threshold we set is of two types either level of significance or critical value and the test we use is called Statistical Hypothesis Tests.

So if the likelihood we get is very close to zero then that mean assuming this hypothesis to be true the likelihood of observing/occurring this data is very less so that suggests there is something wrong with our assumption. So in the example to means of 2 samples, if the resulted p-value is very close to zero than we can say that assuming the two samples having the same mean the data we have in hand is very less likely to be generated hence there is something wrong with our assumption and hence we reject it.

Note: all this is probabilistic and we can do mistakes sometimes and there are known name for those mistakes namely False Positive and False Negative.

Hypothesis

There are two type of hypothesis namely-

Null Hypothesis, H_0 — A null hypothesis, proposes that no significant difference exists in a set of given observations.
Alternate Hypothesis, H_1 — An alternate hypothesis, proposes that there is a significant difference exists in a set of given observations.

For the purpose of these tests in general,

H_0: Variable A and Variable B are independent
H_1: Variable A and Variable B are not independent.

Note: H_0 and H_1 are complement of each other.

p-value

It’s the probability of data given the assumption in a statistical test.

The statistical significance of any finding is done by interpreting the p-values. P-value tells us that whether are findings are due to same real change or they are just random fluctuations.

  • p-value ≤ alpha: significant result, reject null hypothesis.
  • p-value > alpha: not significant result, fail to reject the null hypothesis.

A p-value can be calculated from a test statistic by retrieving the probability from the test statistics cumulative density function (CDF).

Some tests return a test statistic value from a specific data distribution that can be interpreted in the context of critical values. A critical value is a value from the distribution of the test statistic after which point the result is significant and the null hypothesis can be rejected.

  • Test Statistic < Critical Value: not significant result, fail to reject null hypothesis.
  • Test Statistic ≥ Critical Value: significant result, reject null hypothesis.

Note: The most common value for significance level used throughout the data science and ML is 5% i.e., alpha=0.05 and we will be using this same value throughout this notebook.

refer this decision tree

Throughout the blog I am assuming the value of alpha as 0.05

ALPHA = 0.05

Tests for independence between two categorical variables

Pearson’s Chi-square test

The Chi-square statistic is a non-parametric statistic tool designed to analyze group differences when the dependent variable is measured at a nominal level(ordinal data can also be used). It is commonly used to compare observed data with data we would expect to obtain according to a specific hypothesis.

Where,
O : Observed (the actual count of cases in each cell of the table)
E : Expected value (calculated below)
chi ^{2} : The cell Chi-square value

Assumptions
1. The test becomes invalid if any of the expected values are below 5
2. The p value calculated is not exact but approximate and converges to exact on increasing data(so not good for small sample size)
3. The number of observations must be 20+

So, if the expected cell counts are small, it is better to use an exact test as the chi-squared test is no longer a good approximation in such cases. To overcome this we will be using fisher exact test.

Fisher’s exact test

Fisher’s exact test is used to determine whether there is a significant association between two categorical variables in a contingency table. Fisher’s exact test is an alternative to Pearson’s chi-squared test for independence. While actually valid for all sample sizes, Fisher’s exact test is practically applied when sample sizes are small. A general recommendation is to use Fisher’s exact test- instead of the chi-squared test — whenever more than 20 % of cells in a contingency table have expected frequencies < 5.

png

So chi2 assumption failed for every pair but it's not that we can't apply, we can but the results are not reliable. But the contingency table of origin and model_year is still good to try as most values are >= 5.

H_0: origin are model_year are independent, alpha=0.05

We will use chi2_contingency function of scipy,

For more information execute help(stats.chi2_contingency) .

scikit-learn also has chi2 test available, let’s use it to test dependency of all categorical attributes with mpg_level .

png
png

Statistical Tests for Numerical Attributes

Following are the numerical/continuous attributes in our data-set.

nums = ['mpg', 'displacement', 'horsepower', 'weight', 'acceleration']

Normality Test

I divided normality test into two parts -

1. Visual Normality Checks

We will visually check for normality using -

  1. Histogram
  2. Quantile-Quantile plot

2. Statistical Normality Tests

There are three statistical tests for checking the normality of data -

  1. Shapiro-Wilk Test (only for gaussian distribution)
  2. D’Agostino’s K 2 Test (only for gaussian distribution)
  3. Anderson-Darling Test (for many other distributions as well)

I will use Shapiro-Wilk Test but you can try other as well.

Visual Normality Checks

From these distributions generated in Part1-EDA we can clearly see that acceleration is gaussian, mpg and weight are right-skewed or maybe log-normal.

image.png

A log-normal distribution is a distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution.

We will check whether mpg and weight are log-normal or not.

png

So after applying log transformation we find that weight is not log-normal but mpg visually looks like log-normal.

Let’s check for normality using quantile-quantile plots. Below is the quantile-quantile plots on original data.

png

So, both histplot & qqplot of acceleration indicates that it is indeed close to gaussian.

Statistical Normality Tests

We will do hypothesis testing for the normality of numerical attributes using the shapiro wilk test.

H_0: Data is drawn from normal distribution, alpha=0.05

Let’s define a function which test the null hypothesis for columns of the data-frame under significance level alpha.

Oops, we expected acceleration to be normally distributed but are test rejected it. The p-value is 0.03, so we rejected it under the significance level of 5% but if it was 2.5% then we would have failed to reject the null-hypothesis. We won’t change the p-value now otherwise it will be p-hacking. One possible reason for the rejection of H_0 is maybe our data is not scaled and I think scaling it helps.

We will now apply power transform to make the data more gaussian like. And after that check for normality on the transformed data.

Power Transform: It transforms data feature-wise to make it more Gaussian-like. Power transforms are a family of parametric, monotonic transformations that are applied to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired just like here.

png

Note: on removing nans from the data-frame the power transformer is making the entire weight column 0. I am unable to find the reason for this and asked the sklearn community and update the notebook once I figured this out. If you detect the reason for that then please comment.

For now we will leave weight.

png

Power transforms does two things, first it scaled the data i.e., now data is centered at 0 and also made the distribution more gaussian-like simultaneously preserving the original structure. It does so by applying transformations like square-root, log etc.

acceleration is still gaussian, skewness is removed from mpg & weight making mpg gaussian-like. Also the distribution for displacement is improved now it's bimodal which respects the observation.

One thing you have noticed that after applying power transform distribution of mpg & weight is quite similar to what we get on applying log transform. Infact the power transform of sklearn indeed applies log transforms, refer this.

Let’s now check for normality using quantile-quantile plots. Below is the quantile-quantile plots on transformed data.

png

Indeed on normalizing the data the likelihood of observing acceleration assuming it's normal is very high as compared to earlier.

So, acceleration is normally distributed both visually and statistically.

Un-normalized data can sometimes lead to some wrong insights also. For eg,

In Part1-EDA we plotted the relationship and from these plots it’s seems that although all the attributes have a monotonic relation with mpg but the relations are not seeming exactly linear.

image.png

But in transformed data all relations are purely linear.

image.png

Note: Almost everyone is homoscedastic but acceleration is seeming bit more hetroscedastic.

Tests for correlation between two continuous variables

Covariance

The use of the mean in the calculation suggests the need for each data sample to have a Gaussian or Gaussian-like distribution hence its parametric statistic. Also it’s hard to interpret because it can take any value.

Linear Association (Pearson’s Correlation)

The Pearson correlation coefficient is just a normalized covariance between the two variables to give an interpretable score such that,

It can be used to summarize the strength of the linear relationship between two data samples. The use of mean and standard deviation in the calculation suggests the need for the two data samples to have a Gaussian or Gaussian-like distribution hence it’s a parametric statistic.

As a statistical hypothesis test, the method assumes that the samples are uncorrelated (fail to reject H_0).

Assumptions of pearson correlation:
1. Both variables should have a Gaussian or Gaussian-like distribution.
2. Relationship between the variables should be linear.
3. Homoscedasticity i.e., a sequence of random variables where all its random variables have the same finite variance.

Also Pearson is quite sensitive to outliers.

Two variables may be related by a nonlinear relationship, such that the relationship is stronger or weaker across the distribution of the variables. In these cases, even when variables have a strong association, Pearson’s correlation would be low. Further, the two variables being considered may have a non-Gaussian distribution. To properly identify association between variables with non-linear relationships, we can use rank-based correlation approaches.

Ordinal Association (Rank correlation)

Rank correlation refers to methods that quantify the association between variables using the ordinal relationship between the values rather than the specific values. In this we first sort data in ascending order, then assign integer rank to them and then use it to find the correlation b/w variables. Because no distribution for the values is assumed, rank correlation methods are referred to as distribution-free correlation or non-parametric correlation.

Four types of rank correlation methods are as follows:

1. Spearman’s Rank Correlation

Spearman’s Correlation is a non-parametric rank correlation and is also interpretable because,

In this instead of calculating the coefficient using covariance and standard deviations on the samples themselves, these statistics are calculated by converting the raw data into rank data hence non-parametric. This is a common approach used in non-parametric statistics.

As a statistical hypothesis test, the method assumes that the samples are uncorrelated (fail to reject H_0).

2. Kendall’s Rank Correlation

The intuition for the test is that it calculates a normalized score for the number of matching or concordant rankings between the two samples. As such, the test is also referred to as Kendall’s concordance test.

As a statistical hypothesis test, the method assumes that the samples are uncorrelated (fail to reject H_0).

3. Goodman and Kruskal’s Rank Correlation

4. Somers’ Rank Correlation

Types of Correlation:
Positive: both variables change in the same direction.
Neutral: no relationship in the change of the variables.
Negative: variables change in opposite directions.

The statistical test only tells the likelihood of an effect. It doesn’t tell us the size of the effect. The results of an experiment could be significant, but the effect so small that it has little consequence or the result could be insignificant, but the effect is large.

Effect size: It is the size or magnitude of an effect or result as it would be expected to occur in a population. Unlike significance tests which just tells how likely is the effect, effect size actually tells the value of the effect occurred. So it gives us more power.

We will find the effect test for the relation of mpg with other numerical features. i.e., we will be getting an absolute value instead of likelihood which quantify how much correlation is there b/w mpg and other numerical features.

But all the above association tests not only gives the effect size but also the p-value. So can look at both the values. We will be using spearman but you can use any other accept pearson because all numerical variables doesn't satisfies the pearson's assumptions.

So for all correlation test b/w mpg and other attribute our null hypothesis will be,

H_0: mpg and other attribute are not correlated, alpha=0.05

So all the H_0 are rejected under the significance level of 5%. Accept `acceleration` all the other correlations are very high and this is also evident from our previous plots.

We now create a data-frame for the correlation b/w every pair.

png

Correlation of pairs (mpg, acceleration), (displacement, acceleration) and (weight, acceleration) is moderate whereas remaining all pairs has very high correlation between them.

We will now test whether two samples has the same mean or not. For this we have two types of significance tests for two different conditions.

Parametric Statistical Significance Tests

  1. Student’s t-test — It tests whether the two independent normal distributed samples has the same mean or not.
  2. Analysis of Variance Test (ANOVA) — It tests whether the two or more independent normal distributed samples has the same mean or not.

ANOVA is same as t-test but for more than 2 variables. So either we can apply t-test pair-wise of apply ANOVA once. Also ANOVA only tells whether all samples are same or not, it doesn’t quantify which samples differ or by how much.

Non-Parametric Statistical Significance Tests

  1. Mann-Whitney U Test — Non-parametric equivalent of Student’s t-test.
  2. Kruskal-Wallis H — Non-parametric equivalent of ANOVA (it’s for median).

We will apply appropriate test depending on the sample, i.e., if samples are normally distributed then parametric tests otherwise non-parametric tests.

Let’s test whether acceleration in japan and usa has the same mean. First we check whether acceleration of both japan and usa are normally distributed or not and then apply the applicable tests.

So both are normally distributed so we can apply parametric test.

H_0: acceleration of japan and acceleration of usa has same sample mean, alpha=0.05

because the variance is not same for the two distributions hence equal_var=False

Let’s test whether horsepower across all the regions has the same distribution or not.

So all of them are not normally distributed so we will apply non-parametric test.

H_0: Sample distributions are equal for horsepower across region, alpha=0.05

Test whether acceleration has same distribution for samples with mpg_level high & medium.

Test for mpg distribution across the years.

Relation between Categorical and Continuous attributes

We will use feature_selection api of sklearn for this.

png
png

So we are done for now. We did some good statistical analysis and also explored various statistical tests provided to us by scipy and scikit-learn. We did a lot and can extend it further but you get the idea.

The next part is predictive modelling which is mostly about machine learning. As machine learning itself is a big thing so it’s tough to incorporate it in a single blog. You can refer to my this repository, it’s the continuation of this series and includes feature_engineering, cross validation, losses, metrics, testing different pipelines, hyper-parameter tuning and auto-ml as well. There is a lot to learn in that repository.

You can get the entire documented jupyter notebook for this blog from here, you just need to fork it. Also if you like the notebook then up-vote, it motivates me for creating further quality content.

If you like this story then do clap for it and also share with others.

Also, have a read of my other stories which includes variety of topics including,

and many more.

References

The following material has been used as source and guidance-

Thank-you once again for reading my stories my friends :)

--

--