Body Performance Project-2.1

Knowing my data

3 min readSep 14, 2023

Welcome to the second part of the project, I believe you’ve completed the first of your journey towards this project. If you haven’t and just starting out data science, kindly refer to this link for setting up your environment. But if you have a Colab account, feel free to make use of it.

In the book titled Pandas 1.x Cookbook Second Edition by Matt Harrison & Theodore Petrou, Chapter 4 introduced the need for developing data analysis routine. This aids you to get acquainted with your dataset as soon as possible and find out the hidden details within it. It predominantly serves as a checklist and precedes model creation (Machine Learning and/or Deep Learning) if needed.

So let’s begin 😁!

About the dataset

The dataset can be downloaded from kaggle website which can be found here.

The data includes the following features:

age : 20 ~64
gender : F,M
height_cm : (If you want to convert to feet, divide by 30.48)
weight_kg
body fat_%
diastolic : diastolic blood pressure (min)
systolic : systolic blood pressure (min)
gripForce
sit and bend forward_cm
sit-ups counts
broad jump_cm
class : A,B,C,D ( A: best) / stratified

Import needed libraries

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

Loading and returning five random samples from the dataset

df = pd.read_csv("data/bodyPerformance.csv")
df.sample(random_state=42,n=5)

Returned frame from the above code (columns shortened)

The code below aids with knowing your data;

df.shape → gives you the number of rows and columns in the data
df.ndim → gives you the dimension of the data
df.dtypes → lists out the data types present in the data
df.describe(include=np.number) → gives the summary statistics for only numeric data types
df.describe(include=np.object) → gives the summary statistics for only object data types

You could even turn this procedure into a function if you want.

print(f"Shape: {df.shape}\n")
print(f"Dimension: {df.ndim}\n")
print(f"Data types: \n{df.dtypes}\n")
print(f"Description:")
df.describe(include=np.number)
print(f"Description:")
df.describe(include=np.object)
df.info(memory_usage='deep')

Below are images of the code’s output for each line

df.shape & df.ndim

From this procedure, some observations can be made;

There would be some data type conversions.
The gender column and target column class would be converted into categorical data format using get_dummies.
Convert the height to meters and drop the cm column
Convert all centimeters to meters
Rename all columns.
Calculate BMI;

Conclusion

With just this simple routine you’ve made some key observations in the data, but that’s not all, in the next part of this series, the routine is continued with data cleaning and feature engineering being applied to get the best out of the data, making it more consistent for further usage.

References

Body performance Data

multi class classification

www.kaggle.com

pandas.DataFrame.sample - pandas 2.1.0 documentation

Return a random sample of items from an axis of object. You can use random_state for reproducibility. Parameters…

pandas.pydata.org