Body Performance Project-2.2

Data Cleaning & Feature Engineering

Daniel Chiebuka Ihenacho
4 min readSep 15, 2023
Photo by JESHOOTS.COM on Unsplash

In the previous post, we explored getting to know your data and now we continue with our routine; data cleaning & feature engineering.

What is Data Cleaning?

Data cleaning is a crucial step in Data Science, it helps with the identification & removal of unwanted values which could be; missing, duplicates or even irrelevant data points. It makes the application of data analysis and ML models more relevant to the entire data pipeline process.

Here’s our data cleaning process for our dataset,

  1. Check for missing values
  2. Check for duplicates
  3. Rename columns
  4. Ensure np.object data types are not missed types to ensure uniformity. ( Trust me, I seen poorly created datasets).
# Steps 1-2
# Make a copy of the original dataset
df_clean = df.copy()
# Check for missing values; in our case there were none; 0
df_clean.isna().sum()
# Check for duplicated values; 1 duplicate value found
df_clean.duplicated().sum()
# Drop all duplicated data points
df_clean.drop_duplicates(inplace=True)
# Step 3
# Creating a dictionary based on the columns
# from the datasets that needs to be renamed
columns_to_rename = {
'body fat_%':"body_fat_percent",
'gripForce':"grip_force",
'sit and bend forward_cm':'sit_and_bend_forward_cm',
'sit-ups counts':"sit_ups_counts",
'broad jump_cm':'broad_jump_cm',
}
# Renaming columns using the columns using the created dictionary
df_clean.rename(columns=columns_to_rename,inplace=True)
# Checking out the applied steps
df_clean.sample(random_state=42,n=5)
# Step 4
# This code checks the object data types to see if they were
# properly named/created.
for column in df_clean.select_dtypes(include=np.object):
print(f"These are the values in the {column} column:\n{df_clean[column].unique()}")
print(f"This is the total values present in the:{column} column:\n{df_clean[column].nunique()}\n")

With the applied code blocks above steps 1–4 have been realised. You’re one step closer towards EDA (Exploratory Data Analysis), next we go into feature engineering.

What is Feature Engineering?

Feature engineering is a technique that leverages data to create new variables. This can be used for further analysis and also improve machine learning algorithms accuracy in making predictions. A terribly feature engineered dataset would not only impact the analysis but also the model accuracy.

Here’s are the steps to be followed for our feature engineering process;

  1. Convert centimeters to meters
  2. Calculate BMI
  3. Encoding the gender columns (One-Hot-Encoding/Binary Encoding)
  4. Encoding the class columns (Label Encoding)
  5. Column renaming and type conversion.
  6. Drop irrelevant columns
# Steps 1-2
# Making a copy of the cleaned data
df_feat = df_clean.copy()

# Converting to meters renaming the column to meters
for column in df_feat.columns:
if "cm" in column:
df_feat[column[:-2]+"m"] = df_feat[column]/100

# Calculating BMI using the weight_kg and height_m columns
df_feat['bmi'] = df_feat.weight_kg/np.power(df_feat.height_m,2)
# Step 3
df_feat = pd.get_dummies(
data = df_feat,
columns=['gender'],
drop_first=True
)
# mappings
# 1 --> Male
# 0 --> Female
# Step 4
from sklearn.preprocessing import LabelEncoder
encode = LabelEncoder()
df_feat['encoded_class'] = encode.fit_transform(df_feat['class'])
# mappings
# A - 0,
# B - 1,
# C - 2,
# D - 3
# Step 5
# rename gender_M column
df_feat.rename(columns={"gender_M":'gender'},inplace=True)

# Type conversion
df_feat.encoded_class =df_feat.encoded_class.astype("category")
df_feat.gender = df_feat.gender.astype('category')
df_feat.age = df.age.astype('int8')
# Step 6
# Drop irrelevant columns
# Dropping all centimeter columns
for column in df_feat.columns:
if 'cm' in column:
df_feat.drop(column,inplace=True,axis=1)

# Dropping the class column
df_feat.drop(columns=['class'],inplace=True)

# Checking data memory usage by the column
df_feat.info(memory_usage="deep")

Let’s look a bit closer into steps 3 and 4; the difference between One-Hot-Encoding/Binary encoding & Label Encoding. Referring to the book Data Mining Concepts & Techniques by Jiawei Han, Micheline Kamber Jian & Jian Pei, pages 41–42, binary and ordinal attributes are defined as follows;

A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where 0 typically means that the attribute is absent, and 1 means that it is present. Binary attributes are referred to as Boolean if the two states correspond to true and false.

In our case, the gender column is a binary attribute, where; 1 → males & 0 → females

An ordinal attribute is an attribute with possible values that have a meaningful order or ranking among them, but the magnitude between successive values is not known.

In our case the encoded_class column is an ordinal attribute, where; 0–3 are increasing likelihood of bad performance representing classes; A — D.

Binary Encoding involves converting binary attributes into numerical binary values, while Label Encoding is the conversion of ordinal attributes into numerical ordinal values.

Conclusion

With the completion of data cleaning & feature engineering, you’re all set for EDA (Exploratory Data Analysis) which would be in the next part of this series. Congrats 🎉🎉!

References

https://www.geeksforgeeks.org/data-cleansing-introduction/

https://towardsdatascience.com/what-is-feature-engineering-importance-tools-and-techniques-for-machine-learning-2080b0269f10

Data Mining Concepts & Techniques by Jiawei Han, Micheline Kamber Jian & Jian Pei

--

--

Daniel Chiebuka Ihenacho

A Data scientist & Analyst — Always looking to learn and grow in the data field. Looking forward to connecting with you all