Exploratory Data Analysis(EDA)

Gaurav Sharma
Analytics Vidhya
Published in
16 min readJun 30, 2020

This is a 3 part series in which I will walk through a data-set, analyzing it and then at the end do predictive modelling. I recommend to follow the parts in sequential order but you can jump to any part.

Part 1, Exploratory Data Analysis(EDA):
This part consists of summary statistics of data but the major focus will be on EDA where we extract meaning/information from data using plots and report important insights about data. This part is more about data analysis and business intelligence(BI).

Part 2, Statistical Analysis:
In this part we will do many statistical hypothesis testing, apply estimation statistics and interpret the results we get. We will also validate this with the findings from part one. We will apply both parametric and non-parametric tests. We will report all the important insights we get in this part. This part is all about data science and requires some statistical background if possible.

Part 3, Predictive Modelling:
In this part we will predict a response using given predictors. This part is all about machine learning.

Meta-Data, Data about Data

I am using the auto mpg data for EDA taken from the UCI repository.

Title: Auto-Mpg Data
Number of Instances: 398
Number of Attributes: 9 including the class attribute
Attribute Information:

1. mpg — continuous
2. cylinders — multi-valued discrete
3. displacement — continuous
4. horsepower — continuous
5. weight — continuous
6. acceleration — continuous
7. model year — multi-valued discrete
8. origin — multi-valued discrete
9. car name — string (unique for each instance)

This data is not complex and is good for analysis as it has a nice blend of both categorical and numerical attributes.

This is part 1 i.e., EDA. I won’t stretch this part too long and do following things in sequential manner.

  1. Some Pre-processing of the data, this includes dealing with missing values, duplicate data if any and then aligning the data.
  2. EDA on categorical attributes, this includes analyzing their distributions and relations with other cat.(categorical) attributes.
  3. EDA on numerical attributes, this includes analyzing their distributions and relations with other num.(continuous/numerical) attributes.
  4. Then we will analyze the relation b/w numerical & categorical attributes.

I will use seaborn heavily throughout the notebook, so it is also a good go to notebook for those who are looking for EDA using seaborn.

Firstly, import all necessary libraries.

We will first import the data into a pandas data-frame and inspect it’s properties.

png

The data is in rectangular(tabular) form, with 398 entries, each having 9 distinct attributes.

To inspect meta-data (i.e., data about data), we can use an inbuilt pandas function.

df.info() , describes many things about the data, like data type of each column, memory usage etc.

Now, I will make two distinct list for categorical and numerical column names as the analysis differ for both the types. For that I will introspect the datatype of each column and if it is of type object then it's categorical, else numerical.

I will use these two lists heavily throughout the analysis.

Let’s see how many unique values are there in each column.

As there are very few unique values for cylinders and model_year, so it’s safe to make them categorical instead numeric. This conversion will be helpful during analysis as I will be bifurcating some attributes on the basis of other.

So, lists need to be updated.

Now, inspect for nans in the data. I will check for nans column-wise.

png

The nan-row proportion in the data is 6 / len(df) = 0.01507. So, horsepower consists all 6 nan rows, comprising of around 1.5% of data. As this fraction is very low so it’s safe to drop the nan rows for now.

Note: If the nan proportion is large (more than 5%) then we won’t drop it but instead impute missing values or can even treat missing as another attribute.

For now remove all nan rows as they are just 1.5%.

Let’s see how many duplicate entries are there and drop them if there are any.

So, there are no duplicate rows.

Before we move ahead it’s a good practice to group all variables together having the same type.

png

Now we are all good to go for some in-depth analysis.

Analysis on Categorical Attributes

The analysis includes both descriptive stats and EDA.

I will first slice out the categorical columns from original data-frame and then do analysis on it, keeping the original data untouched. At the end I will incorporate needed changes in the original data-frame.

png

As origin and name consists of text data so it needs some pre-processing. We will remove all extra spaces from each string, otherwise the same string with different spacing will be treated as different categories which should not be the case.

I will create an artificial categorical attribute named mpg_level which categorizes mpg into low, medium and high. This is done for two reasons, first it will help a lot in EDA i.e., I can bifurcate plots on the basis of mpg and secondly this is easy to understand as compared to numbers.

I am dividing mpg into three regions as,

[min, 17) -> low
[17, 29) -> medium
[29, max) -> high

Also the choice of the range is analytical and can be anything till it seems to be reasonable.

Note: This is feature-engineering and mostly done in predictive modelling but it makes sense to introduce it here.

Let’s inspect the unique characters in origin, cylinders and model_year. I am leaving name because it is almost unique for each entry in this case, hence nothing interesting to inspect it.

Although descriptive stats for categorical attributes are not much informatic but still it’s worth looking once. Also pandas describe function is only for numeric data and in df_cat cylinders and model_year are the only numeric type.

df_cat.describe()
png

It seems that most of the values in cylinders are 4 and (min, max) is (3, 8).

Analysis of Distribution

Now we analyze the distribution for each categorical attribute and make some insights from the plots.

In case of categorical variables an ideal (or atleast loved) distribution is uniform or uniform-like, below is an uniform distribution.

image.png

Let’s plot the distribution for different categorical attributes in our data.

png

Let’s calculate the proportion of dominant classes in each category.

Insights

  • origin is highly imbalanced, usa alone consists of 62.5% of data whereas japan & europe are having similar proportion. We will see this dominance in future analysis. We will try to find the reason for this in our further analysis.
  • cylinders is highly imbalanced, 4 alone consists of 50.77% of data. Whereas 8 & 6 are nearly in same proportion but 3 & 5 collectively accounts for only 7 entries i.e., 1.8% of entire data. We will see this huge proportional imbalance in cylinders in future analysis.
  • mpg_level is highly imbalanced, medium alone consists of 52.3% of data while low & high are in the same proportion. This dominance is due the fact of our thresholding while manufacturing this feature because the medium range is broader hence it consists of more data points. It won't be there in original mpg feature as it is continuous.
  • model_year is considerably balanced which is good.

Now we analyze car's name.

Firstly, even though name is categorical but it has a lot categories and this even makes sense because product names generally varies a lot in any domain. So it’s not fruitful to do analysis on car names as these are names just like product id and seems to hold no important insights.

But one thing to be noticed here is that each car name starts with a company name, so maybe the case that there are very few companies in the data-set and it will be fruitful to extract them as separate feature and do analysis on that. So let’s do it.

I will create a new attribute named as car_company by extracting the first word from all names. I will also remove the car company from each car name because it is not needed now, and also rename column name to car_name.

png

Now, check for total unique values in car_company.

Great this is what we wished and indeed we get. Our idea that there will be few car companies involved in this data is indeed correct. Because the number of categories are less we can now do analysis on that. So we took a step in right direction just by our in-tuition.

Now, we analyze the distribution of car_company.

png

Insights

  • We found that car_name has a lot categories, close to total data points. So it's not fruitful to do analysis on that as it is unique for most of the points and also in most cases names are safe to be avoided as they doesn't have correlations with other.
  • We then create an artificial attribute named car_company by extracting company names from car names. We find that there are much few car companies as compared to car names (around 8 times less).
  • We then found that the distribution of car_company is not uniform and most of the proportion is covered by top 15 car companies. Whereas ford and chevrolet alone comprises of around 23% (almost a quarter).

Conclusion

  • Every categorical attribute except model_year is highly imbalanced and far from uniform distribution. In all cases most of the data is comprised of top few categories.
  • Although model_year is not perfectly uniform but we can think it as uniform-like distribution. This is a digestible assumption for two reasons, first we can clearly see in plot that indeed the distribution is uniform-like and also this is not the entire population but a sample of it so may be in large run it will converge to uniform which may be the true population distribution (Law of Large Number).

Now we will analyze how different features behaves on changing other features.

png

Insights

We can clearly see the impact of imbalanced categories in our bifurcated plots.

cylinders bifurcated by origin

  • Japan is the only origin with vehicles having 3 cylinders.
  • Europe is the only origin with vehicles having 5 cylinders.
  • USA is the only origin with vehicles having 8 cylinders.
  • All origins has 4 cylinder vehicles and in almost equal proportion, also because 4 is dominating in cylinders.
  • All origins has 6 cylinder vehicles but dominated by USA due the fact that it is dominating in origin.

mpg_level bifurcated by origin

  • Japan doesn’t have any vehicle with low mpg_level whereas europe has negligible vehicles with low mpg_level and almost all vehicles that has low mpg_level are from usa.
  • Japan has the most vehicles with high mpg_level.
  • USA has the most vehicles with medium mpg_level (again due to the fact that most vehicles belongs to USA).

mpg_level bifurcated by cylinders

  • Vehicles with low mpg_level has either 6 or 8 cylinders and most of them has 8 cylinders.
  • Almost all vehicles with high mpg_level has 4 cylinders and with very few (less than 5) has 5–6 cylinders.
  • Most vehicles with medium mpg_level has 4,6 & 8 cylinders. This is due to the fact that most of the vehicles has these number of cylinders as we inspected in cylinder distribution earlier.

Let’s analyze mpg_level with cylinders bifurcated by origin.

png

Insights

  • Japan doesn’t have any cylinder with low mpg_level and most of it’s vehicles have high mpg_level with mostly 4 cylinders(this we already inspected from earlier plots as well, nothing new).
  • Almost every vehicle in europe has 4 cylinders and most of them are medium or high.
  • USA has few vehicles with high mpg level as compared to other knowing the fact that most of the vehicles belongs to USA.

Note: Although these insights can be detected from earlier plots but it was bit hard and bifurcation helped us and we extracted even more meaning from data.

Conclusion

  • Japan is leading in case of mpg level with most of it’s vehicles having high mpg_level. It has more than twice vehicles with high mpg compared to other origins.
  • It seems like as number of cylinders increases mpg decreases in general.

Let’s analyze model_year,

png
png

Insights

  • As year progresses manufacturing of low mpg car decreases and eventually stopped after 79.
  • As year progresses manufacturing of high mpg car increases and infact after 79 their manufacturing was 1.5–2 times as compared to vehicles with medium mpg.
  • Throughout the years there is no significant change in manufacturing of vehicles with medium mpg and infact from 74–79 their manufacturing was more than the sum of remaining two.
  • From the scatter plot we can clearly see an upward linear trend i.e., as year progress mpg increases.
png

Insights

  • As the year progresses vehicles with more cylinders (8 & 6) decreases significantly.
  • As the year progresses vehicles with less cylinders increases.
  • One important thing to be noticed that throughout the years vehicles with 4 cylinders have significant proportion and infact in the 80’s most of the vehicles has 4 cylinders.
  • These results make sense because as year progresses technology advances vehicles with low mpg and more cylinders looses focus and vehicles with high mpg and less cylinders are the new stars.
png

Insights

  • In the starting year manufacturing in USA is dominated completely.
  • As the year progresses japan and europe started manufacturing more vehicles. Infact, in the year 80 both japan and europe manufactured more than USA. This may be due to something because throughout the years USA dominates and suddenly there is a considerable decrease in their manufacturing. This may have something to do in USA in year 80.
  • Initially europe manufactures more vehicle then japan but then japan exceeds it after 76.

Let’s analyze car_company,

As car_company contains a lot categories and most of them has very low proportion so we will analyze only the top 15 car companies.

We can see that top 15 car companies alone manufactures 83% of vehicles.

png

Insights

  • Top manufacturing companies have all mpg level vehicles but companies with less vehicles focuses more on high or medium mpg vehicle.
  • All top manufacturing companies are from usa and that is the reason why usa has most of the vehicles in data-set (this is one of our key findings). We now answered this question asked earlier by us.
  • All top manufacturing companies focuses on vehicles with cylinders 4, 6 & 8 equally, but the companies with less manufacturing generally uses less cylinders in their vehicles.
  • We are done with the analysis of categorical attributes and found lots of interesting things and answered many unknown questions. Now we will incorporate the required changes of df_cat into df.

Every attribute except car_name is of interest and participated in our analysis. So we will not add car_name to our data-frame as it is of no interest. This is feature reduction and is an integral part of feature engineering.

png

Save these changes to a new file.

df.to_csv("mpg_cated.csv", index=False)

Analysis on Numerical Attributes

The analysis includes both descriptive stats and EDA.

df = pd.read_csv("mpg_cated.csv")
df.head()
png

I will first slice out the numerical columns from original data-frame & then do analysis on it keeping the original data untouched, and at the end incorporate needed changes in the original data-frame.

png

Analysis of Distribution

Now we analyze the distribution for each numerical attribute and make some insights from the plots.

In case of numerical variables an ideal (or atleast loved) distribution is gaussian or gaussian-like, for a gaussian it's various distribution plots look like below,

image.png

Let’s plot the distribution for different numerical attributes in our data.

png

Insights

  • acceleration is the only distribution which is gaussian. There are few values in acceleration which lie outside the whiskers(the bars extending outwards from the box), these are fliers/outliers.
  • distributions of mpg & weight seems to be right-skewed gaussian.
  • distributions of displacement & horsepower seems to be far from gaussian.

Currently we are analyzing the distributions just from plots, in the next phase(statistical analysis) we will do hypothesis testing for the normality of these distributions.

Let’s analyze the outliers using tukey formula.

acceleration and horsepower are the only attributes with tukey outliers and we can also notice this from the above boxplots as well.

df.iloc[list(tukey_outliers(df_num.acceleration).index)]
png
df.iloc[list(tukey_outliers(df_num.horsepower).index)]
png

Insights

  • Outliers in acceleration seems to be random nothing conclusive. One thing we can notice that none of them are from japan.
  • Outliers in horsepower are not seeming random, there is a lot common in them
  • All of them are from usa (maybe because vehicles from usa are in majority).
  • All of them has 8 cylinders.
  • All of them has low mpg level.
  • All of them has weight in the range 4000.
  • Most of them has displacement in range 400.
  • Most of them were manufactured in early years(before 74).
png

See data is not scaled properly, we need to scale it for modelling but it works fine for analysis.

Now we analyze relationship between different numerical attributes

png
png

Insights

  • as mpg increases displacement, horsepower & weight decreases but acceleration increases.
  • as horsepower increases displacement & weight increases but acceleration decreases.
  • as weight increases displacement increases but acceleration decreases.
  • as acceleration increases displacement decreases.

So all numerical attributes are related with each other.

Now we bifurcate these relationships with different categories. In this plot we analyze the relationship of horsepower & acceleration bifurcated by origin, mpg_level & cylinders in a single plot.

png

Insights

  • In every region there is a negative relation b/w horsepower & acceleration.
  • vehicles with low mpg has low acceleration and high horsepower.
  • vehicles with more cylinders has low acceleration and high horsepower.

In this plot we analyze the relationship of weight & horsepower bifurcated by origin, mpg_level & cylinders in a single plot.

png

Insights

  • In every region there is a positive relation b/w weight & horsepower.
  • vehicles with low mpg has high weight & horsepower.
  • vehicles with more cylinders has high weight & horsepower.
  • on bifurcating we didn’t found anything new.

Analysis of relationship between numerical and categorical attributes

Note: I am using boxen and violin plots for this but we can also use stripplot.

Variation of numerical features with origin

png

Insights

  • vehicles of usa has less mpg on an average as compared to japan & europe.
  • vehicles of usa has more displacement, horsepower and weight as compared to japan & europe.
  • all vehicles has relatively same acceleration irrespective of the origin but distribution of acceleration from usa is more spreaded due to the fact that it comprises a lot vehicles compared to other.

Variation of numerical features with mpg_level

png

Insights

  • as mpg_level increases displacement decreases on average.
  • as mpg_level increases horsepower decreases on average.
  • as mpg_level increases weight decreases on average.
  • vehicles with low mpg_level usually has less acceleration compared to other whereas vehicles with medium and high mpg_level has same acceleration.

Variation of numerical features with cylinders

png

Insights

  • as cylinders increases from 3 to 4 mpg also increases but on further increasing the cylinders mpg starts decreasing.
  • displacement increases in polynomial order as cylinders increases.
  • as cylinders increases from 3 to 5 horsepower decreases but on further increasing the cylinders it starts increasing.
  • on increasing cylinders vehicle’s weight increases on average (very obvious).
  • as cylinders increases from 3 to 5 vehicle’s acceleration also increases but on further increasing the cylinders it starts decreasing(maybe due to the fact that vehicles with more cylinders have more weight and hence less acceleration).

Variation of numerical features with model_year

png

Variation of numerical features with model_year bifurcated by origin.

png

Insights

  • as year progresses there is an increase in mpg across all origins(this we already observed in analysis on categorical data).
  • as year progresses there is a slight decrease in displacement, horsepower & weight of the vehicles belonging to usa but there is no significant change in japan & europe.One thing we can observe is that in the 80's all vehicles have similar displacement because unlike the 70's the distribution is not spread out (i.e., distribution is short fatty instead of tall skiny).
  • throughout the years acceleration remains relatively the same across all regions.

So we are done for now. We did some good amount of EDA and also explored various plotting features provided to us by seaborn. I highly recommend you all to use seaborn because it’s easy and simple. Also you can use plotly to construct all the graphs in this notebook. Plotly plots are not only visually awesome but are also interactive and due to this they take much more memory specially scatter plots (as they need to store every data-point information to enable interactivity). That’s why I didn’t included any plotly plots.

You can get the entire documented jupyter notebook for this blog from here, you just need to fork it. Also if you like the notebook then up-vote, it motivates me for creating further quality content.

If you like this story then do clap for it and also share with others.

In the next part we will do some Statistical Analysis.

Thank-you for reading m

--

--

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com