10 Step EDA- Getting started with your Data Science project for beginners.

SUBHODH K S
Nerd For Tech
Published in
6 min readApr 27, 2021

Exploratory data analysis(EDA) is the first step in any data science project. It gives us an overview of the data and generates meaningful insights with just a few lines of code.

EDA is crucial to generate feature importance and have a practical and intuitive understanding of your data set.

Getting started with your data science project can be daunting, especially as a beginner . Here are the 10 steps or checklists you can refer to get started with your EDA process.

Photo by Franki Chamaki on Unsplash

The data set i’m using can be found here:

This is a classification problem. Here the task is to identify which of the nine types of surfaces a robot is travelling on using data collected from Inertial Measurement Units (IMU sensors) containing attributes related to acceleration and angular velocities.

You can refer to my full starter notebook on this dataset here:

After importing the data set into a pandas data frame , below are the steps to perform EDA

1. Shape of your data

You can use dataframe.shape property to return a tuple representing the dimensionality of the DataFrame. This gives us information regarding number of rows and columns in the dataframe.

Example:

2. Display few rows of dataframe

Pandas DataFrame has functions that allow us to display first few (head) , last few(tail) or a random sample of your data (sample). This gives us a better view of the data.

Example:

Insights:

  • column names
  • significant missing values(NaN)
  • types of classes in categorical features

3. Data types of all columns

pandas .dtypes attribute helps us to display the data types of all columns as a series

Observations:

  • x_train has only numeric (int and float) data types except row_id attribute
  • y_train has a combination of numeric( int ) and categorical(object) features

4. Statistics summary of the features

After we get a high-level intuitive feel of the features, we will look at summary statistics of the data.

dataframe.describe() generates a descriptive statistics.This includes those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

count , mean , min/max values , and different quantiles are displayed

  • Here we see that most frequent category is concrete with frequency 779
  • We have 9 unique surface values
  • We have a total of 3810 rows
  • Here we can use information under mean , max , min and 50%le (50th percentile or median)

5. Check missing values for each feature

We can use Pandas .isnull() and .sum() functions together to display count of missing values of each attribute

Here we see that none of the features in x_train contain missing values.

6. Distribution of target variable

In the next step we can plot a seaborn countplot displaying the frequency of each category of the target variable.

Code:

f, ax = plt.subplots(1,1, figsize=(15,5))
graph = sns.countplot(y_train[‘surface’])
graph.set_title(“Number of labels for each class”)
plt.show()

Here we see that values concrete , soft_pvc and wood have the top 3 frequencies.

7. Distribution of all features

Next we can plot a histogram displaying the distribution of all features . This gives us more insight into the data . Matplotlib.pyplot library is used .

Code:

plt.figure(figsize=(26, 16))
for i, col in enumerate(x_train.columns[3:]):
plt.subplot(3, 4, i+1)
plt.hist(x_train[col], color=’red’, bins=100)
plt.hist(x_test[col], color=’green’, bins=100)
plt.title(col)

Here we see distributions of test(green) and train(red) datasets . We can observe that:

  • Feature distributions in train and test are quite similar
  • Velocity and acceleration have normal distribution

8. Boxplots

A Box Plot ,also known as Whisker plot, is created to display values such as first quartile, median and third quartile

Box plot can also be used to spot outliers in the data . Seaborn library is used here.

  • Here the values for linear acceleration and angular velocities seem to contain outliers ( presence of whiskers ) this is due to the fact that acceleration and velocity can take on any value at a given point of time.
  • Box plots gives us information regarding 25%le , 50%le and 75%le of the data .
  • In the above graph we see that median value(horizontal line separating the box)for orientation_x is around 0.12

9. Correlation Heatmaps

A correlation is a value between -1 and 1 which indicates how two features are related to each other . A positive correlation means that as one feature increases the other one also increases, while a negative correlation means one feature increases as the other decreases. Correlations close to 0 indicate a weak relationship while closer to -1 or 1 signifies a strong relationship.

Correlations can be obtained from dataframe.corr() function

And a corresponding heatmap can be obtained from sns.heatmap() function to visualize the correlations in the data set.

Example:

f,ax = plt.subplots(figsize=(8, 8))
sns.heatmap(x_train.iloc[:,3:].corr(), annot=True, linewidths=.5)

  • Here we see that features angular_velocity_y and angular_velocity_z are highly negatively correlated (black tiles). While linear_acceleration_Z and linear_acceleration_Y have a weak positive correlation(orange tiles)

10. Crisp Conclusions

The most important part of your EDA process will be it’s interpretability . Short summaries describing every plot or findings and to the point brief observations are essential to complete your EDA process.

That’s all folks! These were the 10 basic steps you need to explore during your EDA process . I wrote this blog as I found it hard to get started with my data science projects and found no useful reference to perform EDA to get a high level understanding of the data . Do let me know if you find this useful in your projects .

You can connect with me on linkedin here:

--

--