Exploratory Data Analysis(EDA)
This is a 3 part series in which I will walk through a data-set, analyzing it and then at the end do predictive modelling. I recommend to follow the parts in sequential order but you can jump to any part.
Part 1, Exploratory Data Analysis(EDA):
This part consists of summary statistics of data but the major focus will be on EDA where we extract meaning/information from data using plots and report important insights about data. This part is more about data analysis and business intelligence(BI).
Part 2, Statistical Analysis:
In this part we will do many statistical hypothesis testing, apply estimation statistics and interpret the results we get. We will also validate this with the findings from part one. We will apply both parametric and non-parametric tests. We will report all the important insights we get in this part. This part is all about data science and requires some statistical background if possible.
Part 3, Predictive Modelling:
In this part we will predict a response using given predictors. This part is all about machine learning.
Meta-Data, Data about Data
I am using the auto mpg data for EDA taken from the UCI repository.
Title: Auto-Mpg Data
Number of Instances: 398
Number of Attributes: 9 including the class attribute
Attribute Information:
1. mpg — continuous
2. cylinders — multi-valued discrete
3. displacement — continuous
4. horsepower — continuous
5. weight — continuous
6. acceleration — continuous
7. model year — multi-valued discrete
8. origin — multi-valued discrete
9. car name — string (unique for each instance)
This data is not complex and is good for analysis as it has a nice blend of both categorical and numerical attributes.
This is part 1 i.e., EDA. I won’t stretch this part too long and do following things in sequential manner.
- Some Pre-processing of the data, this includes dealing with missing values, duplicate data if any and then aligning the data.
- EDA on categorical attributes, this includes analyzing their distributions and relations with other cat.(categorical) attributes.
- EDA on numerical attributes, this includes analyzing their distributions and relations with other num.(continuous/numerical) attributes.
- Then we will analyze the relation b/w numerical & categorical attributes.
I will use seaborn heavily throughout the notebook, so it is also a good go to notebook for those who are looking for EDA using seaborn.
Firstly, import all necessary libraries.
We will first import the data into a pandas data-frame and inspect it’s properties.
The data is in rectangular(tabular) form, with 398 entries, each having 9 distinct attributes.
To inspect meta-data (i.e., data about data), we can use an inbuilt pandas function.
df.info()
, describes many things about the data, like data type of each column, memory usage etc.
Now, I will make two distinct list for categorical and numerical column names as the analysis differ for both the types. For that I will introspect the datatype of each column and if it is of type object
then it's categorical, else numerical.
I will use these two lists heavily throughout the analysis.
Let’s see how many unique values are there in each column.
As there are very few unique values for cylinders
and model_year
, so it’s safe to make them categorical instead numeric. This conversion will be helpful during analysis as I will be bifurcating some attributes on the basis of other.
So, lists need to be updated.
Now, inspect for nans
in the data. I will check for nans
column-wise.
The nan-row proportion in the data is 6 / len(df) = 0.01507
. So, horsepower
consists all 6 nan rows, comprising of around 1.5%
of data. As this fraction is very low so it’s safe to drop the nan rows for now.
Note: If the nan proportion is large (more than 5%) then we won’t drop it but instead impute missing values or can even treat missing as another attribute.
For now remove all nan rows as they are just 1.5%.
Let’s see how many duplicate entries are there and drop them if there are any.
So, there are no duplicate rows.
Before we move ahead it’s a good practice to group all variables together having the same type.
Now we are all good to go for some in-depth analysis.
Analysis on Categorical Attributes
The analysis includes both descriptive stats and EDA.
I will first slice out the categorical columns from original data-frame and then do analysis on it, keeping the original data untouched. At the end I will incorporate needed changes in the original data-frame.
As origin
and name
consists of text data so it needs some pre-processing. We will remove all extra spaces from each string, otherwise the same string with different spacing will be treated as different categories which should not be the case.
I will create an artificial categorical attribute named mpg_level
which categorizes mpg into low
, medium
and high
. This is done for two reasons, first it will help a lot in EDA i.e., I can bifurcate plots on the basis of mpg and secondly this is easy to understand as compared to numbers.
I am dividing mpg
into three regions as,
[min, 17) -> low
[17, 29) -> medium
[29, max) -> high
Also the choice of the range is analytical and can be anything till it seems to be reasonable.
Note: This is feature-engineering and mostly done in predictive modelling but it makes sense to introduce it here.
Let’s inspect the unique characters in origin
, cylinders
and model_year
. I am leaving name
because it is almost unique for each entry in this case, hence nothing interesting to inspect it.
Although descriptive stats for categorical attributes are not much informatic but still it’s worth looking once. Also pandas describe
function is only for numeric data and in df_cat cylinders
and model_year
are the only numeric type.
df_cat.describe()
It seems that most of the values in cylinders
are 4 and (min, max) is (3, 8).
Analysis of Distribution
Now we analyze the distribution for each categorical attribute and make some insights from the plots.
In case of categorical variables an ideal (or atleast loved) distribution is uniform
or uniform-like
, below is an uniform distribution.
Let’s plot the distribution for different categorical attributes in our data.
Let’s calculate the proportion of dominant classes in each category.
Insights
origin
is highly imbalanced,usa
alone consists of 62.5% of data whereasjapan
&europe
are having similar proportion. We will see this dominance in future analysis.We will try to find the reason for this in our further analysis.
cylinders
is highly imbalanced,4
alone consists of 50.77% of data. Whereas8
&6
are nearly in same proportion but3
&5
collectively accounts for only 7 entries i.e., 1.8% of entire data. We will see this huge proportional imbalance incylinders
in future analysis.mpg_level
is highly imbalanced,medium
alone consists of 52.3% of data whilelow
&high
are in the same proportion. This dominance is due the fact of our thresholding while manufacturing this feature because the medium range is broader hence it consists of more data points. It won't be there in originalmpg
feature as it is continuous.model_year
is considerably balanced which is good.
Now we analyze car's name
.
Firstly, even though name is categorical but it has a lot categories and this even makes sense because product names generally varies a lot in any domain. So it’s not fruitful to do analysis on car names as these are names just like product id and seems to hold no important insights.
But one thing to be noticed here is that each car name starts with a company name, so maybe the case that there are very few companies in the data-set and it will be fruitful to extract them as separate feature and do analysis on that. So let’s do it.
I will create a new attribute named as car_company
by extracting the first word from all names. I will also remove the car company from each car name because it is not needed now, and also rename column name
to car_name
.
Now, check for total unique values in car_company
.
Great this is what we wished and indeed we get. Our idea that there will be few car companies involved in this data is indeed correct. Because the number of categories are less we can now do analysis on that. So we took a step in right direction just by our in-tuition.
Now, we analyze the distribution of car_company
.
Insights
- We found that
car_name
has a lot categories, close to total data points. So it's not fruitful to do analysis on that as it is unique for most of the points and also in most cases names are safe to be avoided as they doesn't have correlations with other. - We then create an artificial attribute named
car_company
by extracting company names from car names. We find that there are much few car companies as compared to car names (around 8 times less). - We then found that the distribution of
car_company
is not uniform and most of the proportion is covered by top 15 car companies. Whereasford
andchevrolet
alone comprises of around 23% (almost a quarter).
Conclusion
- Every categorical attribute except
model_year
is highly imbalanced and far fromuniform distribution
. In all cases most of the data is comprised of top few categories. - Although
model_year
is not perfectly uniform but we can think it asuniform-like distribution
. This is a digestible assumption for two reasons, first we can clearly see in plot that indeed the distribution isuniform-like
and also this is not the entire population but a sample of it so may be in large run it will converge touniform
which may be the true population distribution (Law of Large Number).
Now we will analyze how different features behaves on changing other features.
Insights
We can clearly see the impact of imbalanced categories in our bifurcated plots.
cylinders bifurcated by origin
- Japan is the only origin with vehicles having 3 cylinders.
- Europe is the only origin with vehicles having 5 cylinders.
- USA is the only origin with vehicles having 8 cylinders.
- All origins has 4 cylinder vehicles and in almost equal proportion, also because 4 is dominating in cylinders.
- All origins has 6 cylinder vehicles but dominated by USA due the fact that it is dominating in origin.
mpg_level bifurcated by origin
- Japan doesn’t have any vehicle with low mpg_level whereas europe has negligible vehicles with low mpg_level and almost all vehicles that has low mpg_level are from usa.
- Japan has the most vehicles with high mpg_level.
- USA has the most vehicles with medium mpg_level (again due to the fact that most vehicles belongs to USA).
mpg_level bifurcated by cylinders
- Vehicles with low mpg_level has either 6 or 8 cylinders and most of them has 8 cylinders.
- Almost all vehicles with high mpg_level has 4 cylinders and with very few (less than 5) has 5–6 cylinders.
- Most vehicles with medium mpg_level has 4,6 & 8 cylinders. This is due to the fact that most of the vehicles has these number of cylinders as we inspected in cylinder distribution earlier.
Let’s analyze mpg_level with cylinders bifurcated by origin.
Insights
- Japan doesn’t have any cylinder with low mpg_level and most of it’s vehicles have high mpg_level with mostly 4 cylinders(this we already inspected from earlier plots as well, nothing new).
- Almost every vehicle in europe has 4 cylinders and most of them are medium or high.
- USA has few vehicles with high mpg level as compared to other knowing the fact that most of the vehicles belongs to USA.
Note: Although these insights can be detected from earlier plots but it was bit hard and bifurcation helped us and we extracted even more meaning from data.
Conclusion
- Japan is leading in case of mpg level with most of it’s vehicles having high mpg_level. It has more than twice vehicles with high mpg compared to other origins.
- It seems like as number of cylinders increases mpg decreases in general.
Let’s analyze model_year,
Insights
- As year progresses manufacturing of low mpg car decreases and eventually stopped after 79.
- As year progresses manufacturing of high mpg car increases and infact after 79 their manufacturing was 1.5–2 times as compared to vehicles with medium mpg.
- Throughout the years there is no significant change in manufacturing of vehicles with medium mpg and infact from 74–79 their manufacturing was more than the sum of remaining two.
- From the scatter plot we can clearly see an upward linear trend i.e., as year progress mpg increases.
Insights
- As the year progresses vehicles with more cylinders (8 & 6) decreases significantly.
- As the year progresses vehicles with less cylinders increases.
- One important thing to be noticed that throughout the years vehicles with 4 cylinders have significant proportion and infact in the 80’s most of the vehicles has 4 cylinders.
- These results make sense because as year progresses technology advances vehicles with low mpg and more cylinders looses focus and vehicles with high mpg and less cylinders are the new stars.
Insights
- In the starting year manufacturing in USA is dominated completely.
- As the year progresses japan and europe started manufacturing more vehicles. Infact, in the year 80 both japan and europe manufactured more than USA. This may be due to something because throughout the years USA dominates and suddenly there is a considerable decrease in their manufacturing. This may have something to do in USA in year 80.
- Initially europe manufactures more vehicle then japan but then japan exceeds it after 76.
Let’s analyze car_company,
As car_company
contains a lot categories and most of them has very low proportion so we will analyze only the top 15 car companies.
We can see that top 15 car companies alone manufactures 83% of vehicles.
Insights
- Top manufacturing companies have all mpg level vehicles but companies with less vehicles focuses more on high or medium mpg vehicle.
- All top manufacturing companies are from
usa
and that is the reason why usa has most of the vehicles in data-set (this is one of our key findings). We now answered this question asked earlier by us. - All top manufacturing companies focuses on vehicles with cylinders 4, 6 & 8 equally, but the companies with less manufacturing generally uses less cylinders in their vehicles.
- We are done with the analysis of categorical attributes and found lots of interesting things and answered many unknown questions. Now we will incorporate the required changes of df_cat into df.
Every attribute except car_name
is of interest and participated in our analysis. So we will not add car_name
to our data-frame as it is of no interest. This is feature reduction
and is an integral part of feature engineering
.
Save these changes to a new file.
df.to_csv("mpg_cated.csv", index=False)
Analysis on Numerical Attributes
The analysis includes both descriptive stats and EDA.
df = pd.read_csv("mpg_cated.csv")
df.head()
I will first slice out the numerical columns from original data-frame & then do analysis on it keeping the original data untouched, and at the end incorporate needed changes in the original data-frame.
Analysis of Distribution
Now we analyze the distribution for each numerical attribute and make some insights from the plots.
In case of numerical variables an ideal (or atleast loved) distribution is gaussian
or gaussian-like
, for a gaussian it's various distribution plots look like below,
Let’s plot the distribution for different numerical attributes in our data.
Insights
acceleration
is the only distribution which is gaussian. There are few values in acceleration which lie outside the whiskers(the bars extending outwards from the box), these are fliers/outliers.- distributions of
mpg
&weight
seems to beright-skewed gaussian
. - distributions of
displacement
&horsepower
seems to be far from gaussian.
Currently we are analyzing the distributions just from plots, in the next phase(statistical analysis) we will do hypothesis testing for the normality of these distributions.
Let’s analyze the outliers using tukey formula.
acceleration
and horsepower
are the only attributes with tukey outliers and we can also notice this from the above boxplots as well.
df.iloc[list(tukey_outliers(df_num.acceleration).index)]
df.iloc[list(tukey_outliers(df_num.horsepower).index)]
Insights
- Outliers in acceleration seems to be random nothing conclusive. One thing we can notice that none of them are from japan.
- Outliers in horsepower are not seeming random, there is a lot common in them
- All of them are from usa (maybe because vehicles from usa are in majority).
- All of them has 8 cylinders.
- All of them has low mpg level.
- All of them has weight in the range 4000.
- Most of them has displacement in range 400.
- Most of them were manufactured in early years(before 74).
See data is not scaled properly, we need to scale it for modelling but it works fine for analysis.
Now we analyze relationship between different numerical attributes
Insights
- as mpg increases displacement, horsepower & weight decreases but acceleration increases.
- as horsepower increases displacement & weight increases but acceleration decreases.
- as weight increases displacement increases but acceleration decreases.
- as acceleration increases displacement decreases.
So all numerical attributes are related with each other.
Now we bifurcate these relationships with different categories. In this plot we analyze the relationship of horsepower & acceleration bifurcated by origin, mpg_level & cylinders in a single plot.
Insights
- In every region there is a negative relation b/w horsepower & acceleration.
- vehicles with low mpg has low acceleration and high horsepower.
- vehicles with more cylinders has low acceleration and high horsepower.
In this plot we analyze the relationship of weight & horsepower bifurcated by origin, mpg_level & cylinders in a single plot.
Insights
- In every region there is a positive relation b/w weight & horsepower.
- vehicles with low mpg has high weight & horsepower.
- vehicles with more cylinders has high weight & horsepower.
- on bifurcating we didn’t found anything new.
Analysis of relationship between numerical and categorical attributes
Note: I am using boxen and violin plots for this but we can also use stripplot.
Variation of numerical features with origin
Insights
- vehicles of
usa
has less mpg on an average as compared to japan & europe. - vehicles of
usa
has more displacement, horsepower and weight as compared to japan & europe. - all vehicles has relatively same acceleration irrespective of the origin but distribution of acceleration from
usa
is more spreaded due to the fact that it comprises a lot vehicles compared to other.
Variation of numerical features with mpg_level
Insights
- as mpg_level increases displacement decreases on average.
- as mpg_level increases horsepower decreases on average.
- as mpg_level increases weight decreases on average.
- vehicles with low mpg_level usually has less acceleration compared to other whereas vehicles with medium and high mpg_level has same acceleration.
Variation of numerical features with cylinders
Insights
- as cylinders increases from 3 to 4 mpg also increases but on further increasing the cylinders mpg starts decreasing.
- displacement increases in polynomial order as cylinders increases.
- as cylinders increases from 3 to 5 horsepower decreases but on further increasing the cylinders it starts increasing.
- on increasing cylinders vehicle’s weight increases on average (very obvious).
- as cylinders increases from 3 to 5 vehicle’s acceleration also increases but on further increasing the cylinders it starts decreasing(maybe due to the fact that vehicles with more cylinders have more weight and hence less acceleration).
Variation of numerical features with model_year
Variation of numerical features with model_year bifurcated by origin.
Insights
- as year progresses there is an increase in mpg across all origins(this we already observed in analysis on categorical data).
- as year progresses there is a slight decrease in displacement, horsepower & weight of the vehicles belonging to
usa
but there is no significant change injapan
&europe
.One thing we can observe is that in the 80's all vehicles have similar displacement because unlike the 70's the distribution is not spread out (i.e., distribution is short fatty instead of tall skiny). - throughout the years acceleration remains relatively the same across all regions.
So we are done for now. We did some good amount of EDA and also explored various plotting features provided to us by seaborn. I highly recommend you all to use seaborn because it’s easy and simple. Also you can use plotly to construct all the graphs in this notebook. Plotly plots are not only visually awesome but are also interactive and due to this they take much more memory specially scatter plots (as they need to store every data-point information to enable interactivity). That’s why I didn’t included any plotly plots.
You can get the entire documented jupyter notebook for this blog from here, you just need to fork it. Also if you like the notebook then up-vote, it motivates me for creating further quality content.
If you like this story then do clap for it and also share with others.
In the next part we will do some Statistical Analysis.
Thank-you for reading m