Body Performance Project-2.1
Knowing my data
Welcome to the second part of the project, I believe you’ve completed the first of your journey towards this project. If you haven’t and just starting out data science, kindly refer to this link for setting up your environment. But if you have a Colab account, feel free to make use of it.
In the book titled Pandas 1.x Cookbook Second Edition by Matt Harrison & Theodore Petrou, Chapter 4 introduced the need for developing data analysis routine. This aids you to get acquainted with your dataset as soon as possible and find out the hidden details within it. It predominantly serves as a checklist and precedes model creation (Machine Learning and/or Deep Learning) if needed.
So let’s begin 😁!
About the dataset
The dataset can be downloaded from kaggle website which can be found here.
The data includes the following features:
- age : 20 ~64
- gender : F,M
- height_cm : (If you want to convert to feet, divide by 30.48)
- weight_kg
- body fat_%
- diastolic : diastolic blood pressure (min)
- systolic : systolic blood pressure (min)
- gripForce
- sit and bend forward_cm
- sit-ups counts
- broad jump_cm
- class : A,B,C,D ( A: best) / stratified
Import needed libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
Loading and returning five random samples from the dataset
df = pd.read_csv("data/bodyPerformance.csv")
df.sample(random_state=42,n=5)
The code below aids with knowing your data;
- df.shape → gives you the number of rows and columns in the data
- df.ndim → gives you the dimension of the data
- df.dtypes → lists out the data types present in the data
- df.describe(include=np.number) → gives the summary statistics for only numeric data types
- df.describe(include=np.object) → gives the summary statistics for only object data types
You could even turn this procedure into a function if you want.
print(f"Shape: {df.shape}\n")
print(f"Dimension: {df.ndim}\n")
print(f"Data types: \n{df.dtypes}\n")
print(f"Description:")
df.describe(include=np.number)
print(f"Description:")
df.describe(include=np.object)
df.info(memory_usage='deep')
Below are images of the code’s output for each line
From this procedure, some observations can be made;
- There would be some data type conversions.
- The gender column and target column class would be converted into categorical data format using get_dummies.
- Convert the height to meters and drop the cm column
- Convert all centimeters to meters
- Rename all columns.
- Calculate BMI;
Conclusion
With just this simple routine you’ve made some key observations in the data, but that’s not all, in the next part of this series, the routine is continued with data cleaning and feature engineering being applied to get the best out of the data, making it more consistent for further usage.