Intro to Exploratory Data Analysis(EDA) with python.

Odeh Sebastine
The Startup
Published in
6 min readMay 5, 2020

Prerequisites

  • Basic knowledge on python
  • Basic knowledge on the numpy and pandas libraries

Table of content

  • Importing data from module
  • Descriptive Statistical Analysis
  • Correlation and Causation
  • ANOVA

Importing data from module

In python, the libraries used for importing data which is going to be analysed are the numpy(python numerical library) and pandas(a library build on numpy which makes working with datasets easier and visually more appealing). The libraries are imported as shown below.

importing the necessary libraries for working with data

Jupyter notebooks are our favorite tool for data analysis since they allow you to document while performing our analysis. The dataset used in this article can be gotten from

https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/automobileEDA.csv

Firstly, we write a variable for the path before applying the pandas function for reading csv files, that is, the pd.read_csv().

The dataframe.head() method shows us the first 5 rows of the imported data set, this is used to check if what we imported is correct.

Descriptive statistical Analysis

Firstly, we look at the describe function which computes basic statistics for all continuous variables in the data set.

The describe function shows you:

  • the count of that variable
  • the mean
  • the standard deviation
  • the maximum and minimum values
  • the IQR(interquartile range: 25%, 50%, 75%)

we apply the describe method as follows

the result you get should look like this

  • count: shows the number of values for the variable, in the table it’s 201. The count is used to show the number of inputs we have in a data set.
  • mean: the average value of the variable.
  • std: standard deviation, a quantity expressing by how much the members of a group differ from the mean value for the group. Simply put it is a measure of how spread out numbers are.
  • min: minimum value of the variable
  • IQR: the interquartile range shows you the value of the variable at the respective percentiles, that is, 25%, 50%, and 75%.
  • max: maximum value of the variable.

Without an argument the describe method only gives basic statistics for numerical variables. To get statistics for categorical variable, the include=[‘object’] argument is used.

the line of code to get statistics on categorical variables
notice the new information on the table?

We notice we have new info in our statistics table, unique, top, and freq.

  • unique : the refers to the number of unique entities we have in a particular series(field) for example the unique of make is 22.
  • top: the highest occurring value in the series
  • freq: the number of times the highest occurring value appeared. It is referred to as mode.

A useful method to know is the value counts method. Which is a good way of understanding the number of units for each characteristic/variable that we have, that is, it shows us the number of times each characteristic appears. Note, the value-counts method works on pandas series and not pandas dataframe so when using the method we include on one square bracket(for working with pandas series) and not two.

the value-counts method

The above image shows the number of units we have for front-wheel-drive(fwd), rear-wheel-drive(rwd), and 4-wheel-drive(4wd).

Correlation and Causation

  • Correlation: interdependence of quantity variables or a measure of the extent of interdependence between variables.
  • Causation: the relationship between cause and effect between variables.

It is important to note that correlation doesn’t imply causation. Determining correlation is much more simpler because it doesn’t require independent experimentation.

In descriptive analysis the pearson correlation can be used to determine correlation between two variables X and Y. When the method is used the result varies from -1 to 1 where,

  • 1: total positive linear correlation
  • 0: no correlation
  • -1: total negative linear correlation

The pandas method .corr() can be used to find the correlation

the pandas .corr() method

P-value: The p-value is the probability that the correlation between variables is statistically significant. As a standard we use a significance level of 0.05 which shows we are 95% confident that the correlation between the variables is significant. Their significance levels are ranked below.

  • p-value < 0.001 : this means the correlation is statistically significant
  • p-value < 0.05: moderate significance
  • p-value < 0.1 : weak significance
  • p-value > 0.1 : there is no evidence that the correlation is significant

We find the p-value between variables by using the pearson correlation function in the scipy(python library for scientific operations) library. The function shows the pearson coefficient and p-value of the variable.

The pearson correlation function is used after importing stats from the scipy library. First we calculate the pearson correlation between wheel and price from the data set we imported earlier.

the output

Conclusion: since the pearson coefficient is 0.5846 it shows a positive linear correlation which isn’t really strong while the p-value is statistically significant since its value is <0.001.

ANOVA

ANOVA means AnalysisOfVariance which is a collection of statistical models and their associated estimation procedures (such as the “variation” among and between groups) used to analyze the differences among group means in a data set or in simpler terms a method use to test whether there are significant differences between the means of of two or more groups. ANOVA returns two results:

  • F-test score: ratio of two variances, ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.
  • p-value: the p-value tells us how statistically significant our score is

With our data set if our price is strongly correlated with the variable we are analyzing, ANOVA returns a sizeable(large) F-score and a small p-value.

Drive wheels test

Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Because the ANOVA algorithm averages the data automatically, we do not need to take the average before hand.

Let’s see if different types ‘drive-wheels’ impact ‘price’, we group the data.

Let’s see if different types ‘drive-wheels’ impact ‘price’, we group the data.

which results in

We can obtain the values of the method group using the method “get_group”.

we use the f_oneway function in the module ‘stats’ in the scipy library to get the F-test score and p-value.

With the ANOVA results showing a large F score it is strongly correlated with a p-value of < 0.001 it is statistically significant. Although the results a good, that doesn’t mean that all tested three groups are highly correlated.

In conclusion:

  • We’ve learnt how to interpret the statistics table when using the describe function
  • We understand how to interpret correlation and what it means when analyzing data and how it doesn’t imply causation.
  • We understand the basics of ANOVA and what the parameters it returns means.

--

--

Odeh Sebastine
The Startup

Software engineer with a knack for mobile, building apps with Flutter. Follow me to join my development journey.