Body Performance Project-2.1

Knowing my data

Daniel Chiebuka Ihenacho
3 min readSep 14, 2023
Photo by Maxim Hopman on Unsplash

Welcome to the second part of the project, I believe you’ve completed the first of your journey towards this project. If you haven’t and just starting out data science, kindly refer to this link for setting up your environment. But if you have a Colab account, feel free to make use of it.

In the book titled Pandas 1.x Cookbook Second Edition by Matt Harrison & Theodore Petrou, Chapter 4 introduced the need for developing data analysis routine. This aids you to get acquainted with your dataset as soon as possible and find out the hidden details within it. It predominantly serves as a checklist and precedes model creation (Machine Learning and/or Deep Learning) if needed.

So let’s begin 😁!

About the dataset

The dataset can be downloaded from kaggle website which can be found here.

The data includes the following features:

  • age : 20 ~64
  • gender : F,M
  • height_cm : (If you want to convert to feet, divide by 30.48)
  • weight_kg
  • body fat_%
  • diastolic : diastolic blood pressure (min)
  • systolic : systolic blood pressure (min)
  • gripForce
  • sit and bend forward_cm
  • sit-ups counts
  • broad jump_cm
  • class : A,B,C,D ( A: best) / stratified

Import needed libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

Loading and returning five random samples from the dataset

df = pd.read_csv("data/bodyPerformance.csv")
df.sample(random_state=42,n=5)
Returned frame from the above code (columns shortened)

The code below aids with knowing your data;

  • df.shape → gives you the number of rows and columns in the data
  • df.ndim → gives you the dimension of the data
  • df.dtypes → lists out the data types present in the data
  • df.describe(include=np.number) → gives the summary statistics for only numeric data types
  • df.describe(include=np.object) → gives the summary statistics for only object data types

You could even turn this procedure into a function if you want.

print(f"Shape: {df.shape}\n")
print(f"Dimension: {df.ndim}\n")
print(f"Data types: \n{df.dtypes}\n")
print(f"Description:")
df.describe(include=np.number)
print(f"Description:")
df.describe(include=np.object)
df.info(memory_usage='deep')

Below are images of the code’s output for each line

df.shape & df.ndim
df.dtypes
df.describe(include=np.number)
df.describe(include=np.object)

From this procedure, some observations can be made;

  • There would be some data type conversions.
  • The gender column and target column class would be converted into categorical data format using get_dummies.
  • Convert the height to meters and drop the cm column
  • Convert all centimeters to meters
  • Rename all columns.
  • Calculate BMI;
BMI (Body Mass Index)

Conclusion

With just this simple routine you’ve made some key observations in the data, but that’s not all, in the next part of this series, the routine is continued with data cleaning and feature engineering being applied to get the best out of the data, making it more consistent for further usage.

References

--

--

Daniel Chiebuka Ihenacho

A Data scientist & Analyst — Always looking to learn and grow in the data field. Looking forward to connecting with you all