Body Performance Project-2.3

EDA: Exploratory data analysis

5 min readOct 5, 2023

In the previous post, we explored getting to data cleaning & feature engineering and now we continue with our routine; EDA: Exploratory data analysis.

What is EDA: Exploratory data analysis?

Exploratory Data Analysis involves conducting initial examinations of data to uncover patterns, identify outliers, assess hypotheses, and validate assumptions through the use of summary statistics and visual presentations.

It is advisable to begin by comprehensively understanding the data and extracting as many insights as possible. Exploratory Data Analysis (EDA) revolves around the concept of extracting meaningful information from the data at hand before delving into detailed analysis.

Here’s our EDA process for our dataset,

Check for data types
Creation of tables for detailed observations
Creation of eye-appealing visuals for deeper insights

# Step 1
eda = df_feat.copy()
eda.dtypes

print(eda.select_dtypes(exclude=pd.CategoricalDtype).columns.value_counts().sum())
# 11 numerical columns
num_list = eda.select_dtypes(exclude=pd.CategoricalDtype).columns.to_list()

We excluded the categorical data types from the num_list because we need to carry out summary statistics when grouping the data based of the category.

# Step-2
# range function
def range_cal(arr):
    return arr.max() - arr.min()

# Summary statistics function
def summary_stats(group:str,column):
    try:
    # Checking if it's a numeric column
        if eda[column].dtype ==np.number or eda[column].dtype == np.int8:
            group_data = (
                eda.groupby(group)[column].agg(
                    [
                        (f"total_{column}",'sum'),
                        (f'average_{column}','mean'),
                        (f'deviation_{column}','std'),
                        (f"range_{column}",range_cal),
                        (f"skewness_level_{column}","skew")
                    ]
                ).reset_index()
            )
            return group_data
        else:
      # Categorical columns
            group_data = (
                eda.groupby(group)[column].agg(
                    [
                        (f"count_total_{column}",'count')
                    ]
                ).reset_index()
            )
            return group_data
    except KeyError:
        print(f"This is the list of keys: {eda.columns}")

The above code aids us in the creation of tables for the purpose of summary statistics. Here’s the application of the code below;

# Change the index number and the change the category group name to your desires
data = summary_stats(column=num_list[0],group='gender') 
data
# Based off the previous blog post
# 1 --> Male
# 0 --> Female

With the code above, we just created this table giving a summary of the data based off gender category.

What observation could we get from this ? The females have a higher mean age than males.

Here’s another but with height

Again, one observation we could get from this is; Females have a lower mean height than males; women are shorter than men.

I highly recommend creating tables for your data first when doing EDA to avoid mistakes in visualisations, this is something I learned with the use of Microsoft PowerBI. A more comprehensive dive could be found in the full jupyter notebook.which would be linked at the end of the blog post.

Finally, we have arrived at the visualisation stage, EDA is never complete without the showcasing of visuals which are appealing and informative for stakeholders and clients to properly understand what is hidden in the data and to also showcase your findings. A few of these plots are;

Bar plots
Count plots
Histograms
Box plots
Heatmaps
Line plots
Scatter plots

I will be showing the visuals for each of these plots, but code snippets could be found in the linked jupyter notebook.

From the chart above we could see that the age deviation of females is greater than that of males, and could mean that females’ ages are highly variable in the dataset.

From the chart above, the following could be observed;

Males are more than females in this dataset
There are more class C(2) for males and the least class is class A(0).
There are more class A(0) for females and the least class is class C(2).

From the Histogram plot and using the table above, we could observe the following;

bmi is slightly skewed
weight_kg is symmetrical
height_m is symmetrical

Based of the box plot show above, we can see that;

Males have a higher tendancy of having a weight above 100kg while females do not.
More females are likely to have weight below 40kg
Males have far greater median weight than that of females

The Heatmap essentially shows the correlation between two values, here are some observations we could make from the heatmap above;

age is negatively correlatd to sit_ups_counts.
weight is positively correlated to height, bmi and grip_force.
height is positively correlated to sit_ups_counts and broad_jump, but negatively correlated to body_fat_percentage
body_fat_percentage is negatively correlated to grip_force, sit_ups_count and broad_jump
grip_force is positively correlated with sit_ups_count and broad_jump
sit_ups_count is positively correlated with broad_jump
diastolic and systolic is are positively correlated with each other.

The line plot and scatter are essentially one of the same and I would say depends on preferences;

Do you want smooth line showing the trend of the data or do you what to visualise each data point and see its correlated values. I personally find using both intuitive, one thing I do like about the scatter plot is the ability to also see the extreme values as showed in the box plot.

As seen in the scatter plot we could see extreme values tending to140kg

Conclusion

In conclusion, Exploratory Data Analysis represents both a philosophical and artistic approach aimed at comprehensively capturing every subtlety from the given dataset.

I highly recommend you go through the jupyter notebook, write out the code on your own rather than just copy and paste it. So as to gain a sense of understanding of not only the coding aspect but also the entirety of the process before entering the next section which will be use of Machine Learning techniques.

Kindly contact me for any questions if needed. Power to data 🚀!

Body Performance Project-2.3

EDA: Exploratory data analysis

Written by Daniel Chiebuka Ihenacho