Body Performance Project-2.3

EDA: Exploratory data analysis

Daniel Chiebuka Ihenacho
5 min readOct 5, 2023
Photo by Carlos Muza on Unsplash

In the previous post, we explored getting to data cleaning & feature engineering and now we continue with our routine; EDA: Exploratory data analysis.

What is EDA: Exploratory data analysis?

Exploratory Data Analysis involves conducting initial examinations of data to uncover patterns, identify outliers, assess hypotheses, and validate assumptions through the use of summary statistics and visual presentations.

It is advisable to begin by comprehensively understanding the data and extracting as many insights as possible. Exploratory Data Analysis (EDA) revolves around the concept of extracting meaningful information from the data at hand before delving into detailed analysis.

Here’s our EDA process for our dataset,

  1. Check for data types
  2. Creation of tables for detailed observations
  3. Creation of eye-appealing visuals for deeper insights
# Step 1
eda = df_feat.copy()
eda.dtypes

print(eda.select_dtypes(exclude=pd.CategoricalDtype).columns.value_counts().sum())
# 11 numerical columns
num_list = eda.select_dtypes(exclude=pd.CategoricalDtype).columns.to_list()
Data types

We excluded the categorical data types from the num_list because we need to carry out summary statistics when grouping the data based of the category.

# Step-2
# range function
def range_cal(arr):
return arr.max() - arr.min()

# Summary statistics function
def summary_stats(group:str,column):
try:
# Checking if it's a numeric column
if eda[column].dtype ==np.number or eda[column].dtype == np.int8:
group_data = (
eda.groupby(group)[column].agg(
[
(f"total_{column}",'sum'),
(f'average_{column}','mean'),
(f'deviation_{column}','std'),
(f"range_{column}",range_cal),
(f"skewness_level_{column}","skew")
]
).reset_index()
)
return group_data
else:
# Categorical columns
group_data = (
eda.groupby(group)[column].agg(
[
(f"count_total_{column}",'count')
]
).reset_index()
)
return group_data
except KeyError:
print(f"This is the list of keys: {eda.columns}")

The above code aids us in the creation of tables for the purpose of summary statistics. Here’s the application of the code below;

# Change the index number and the change the category group name to your desires
data = summary_stats(column=num_list[0],group='gender')
data
# Based off the previous blog post
# 1 --> Male
# 0 --> Female
Gender summary statistics table

With the code above, we just created this table giving a summary of the data based off gender category.

What observation could we get from this ? The females have a higher mean age than males.

Here’s another but with height

Height summary statistics table

Again, one observation we could get from this is; Females have a lower mean height than males; women are shorter than men.

I highly recommend creating tables for your data first when doing EDA to avoid mistakes in visualisations, this is something I learned with the use of Microsoft PowerBI. A more comprehensive dive could be found in the full jupyter notebook.which would be linked at the end of the blog post.

Finally, we have arrived at the visualisation stage, EDA is never complete without the showcasing of visuals which are appealing and informative for stakeholders and clients to properly understand what is hidden in the data and to also showcase your findings. A few of these plots are;

  • Bar plots
  • Count plots
  • Histograms
  • Box plots
  • Heatmaps
  • Line plots
  • Scatter plots

I will be showing the visuals for each of these plots, but code snippets could be found in the linked jupyter notebook.

Bar plot

From the chart above we could see that the age deviation of females is greater than that of males, and could mean that females’ ages are highly variable in the dataset.

Count plot

From the chart above, the following could be observed;

  • Males are more than females in this dataset
  • There are more class C(2) for males and the least class is class A(0).
  • There are more class A(0) for females and the least class is class C(2).
Histogram Plot
Skewness table

From the Histogram plot and using the table above, we could observe the following;

  • bmi is slightly skewed
  • weight_kg is symmetrical
  • height_m is symmetrical
Box plot

Based of the box plot show above, we can see that;

  • Males have a higher tendancy of having a weight above 100kg while females do not.
  • More females are likely to have weight below 40kg
  • Males have far greater median weight than that of females
Heatmap

The Heatmap essentially shows the correlation between two values, here are some observations we could make from the heatmap above;

  • age is negatively correlatd to sit_ups_counts.
  • weight is positively correlated to height, bmi and grip_force.
  • height is positively correlated to sit_ups_counts and broad_jump, but negatively correlated to body_fat_percentage
  • body_fat_percentage is negatively correlated to grip_force, sit_ups_count and broad_jump
  • grip_force is positively correlated with sit_ups_count and broad_jump
  • sit_ups_count is positively correlated with broad_jump
  • diastolic and systolic is are positively correlated with each other.
Line Plot
Scatter plot

The line plot and scatter are essentially one of the same and I would say depends on preferences;

Do you want smooth line showing the trend of the data or do you what to visualise each data point and see its correlated values. I personally find using both intuitive, one thing I do like about the scatter plot is the ability to also see the extreme values as showed in the box plot.

As seen in the scatter plot we could see extreme values tending to140kg

Conclusion

In conclusion, Exploratory Data Analysis represents both a philosophical and artistic approach aimed at comprehensively capturing every subtlety from the given dataset.

I highly recommend you go through the jupyter notebook, write out the code on your own rather than just copy and paste it. So as to gain a sense of understanding of not only the coding aspect but also the entirety of the process before entering the next section which will be use of Machine Learning techniques.

Kindly contact me for any questions if needed. Power to data 🚀!

--

--

Daniel Chiebuka Ihenacho

A Data scientist & Analyst — Always looking to learn and grow in the data field. Looking forward to connecting with you all