Member-only story
5 Steps to Transform Messy Functions into Production-Ready Code
The Data Scientist’s Guide to Scalable and Maintainable Functions
Motivation
Functions are essential in a data science project because they make the code more modular, reusable, readable, and testable. However, writing a messy function that tries to do too much can introduce maintenance hurdles and diminish the code’s readability.
In the following code, the function impute_missing_values
is long, messy, and tries to do many things. Since there are many hard-coded values, it would be impossible for someone else to reuse this function for a DataFrame with different column names.
def impute_missing_values(df):
# Fill missing values with group statistics
df["MSZoning"] = df.groupby("MSSubClass")["MSZoning"].transform(
lambda x: x.fillna(x.mode()[0])
)
df["LotFrontage"] = df.groupby("Neighborhood")["LotFrontage"].transform(
lambda x: x.fillna(x.median())
)
# Fill missing values with constant
df["Functional"] = df["Functional"].fillna("Typ")
df["Alley"] = df["Alley"].fillna("Missing")
for col in ["GarageType", "GarageFinish", "GarageQual", "GarageCond"]:
df[col] = df[col].fillna("Missing")
for col in ("BsmtQual", "BsmtCond", "BsmtExposure"…