ProTip: Stay away from Coding Sporks
Are you familiar with the Spork? At least among hikers, this handy item is something they have likely come across once or twice. It’s a single tool that can function as any of the three main cutlery utensils — a spoon, fork, or knife.
While cool, it has its drawbacks. For one thing, you can only use one of the three utensils at a time, so the common knife-fork combo is not an option. Also, the restrictions placed by the need to combine it all into a single utensil warps the individual tools. The knife is quite small, and when you use the fork you risk pricking your mouth on the serrated side. These properties make the spork ok as Plan B but not a great Plan A. I tend to pack one just in case, but my backpack will also carry the usual utensil trio, which is what I use 9/10 times.
So, that’s the analogy — now let’s move on to what it’s an analogy for, shall we?
In my career in R&D, I’ve spent time in both R and D sides of the aisle. One of the things I’ve noticed is how researchers tend to write the “spork” code — long sequential scripts dedicated and optimized to the specific task at hand. Give that code to a software engineer, however, and they will suggest to break the code into a set of small “lego blocks” that can be put together and used in multiple ways, where the current use is just one example. Researchers, from my experience, will most times shrug, unconvinced. They will say that the code is there to analyze data, and if the data analysis works out fine, it’s a waste of time to obsess about the structure of the code.
In what follows, I’ll try to convince the researchers in the audience that the engineers are on to something here. To do this, I’ll first walk you through an example of how to take a spork-ish code and modify it to be more lego-like. Then, I’ll demonstrate why the latter is better — better for the code, and better for the researcher.
Presenting: A Snippet of Spork Coding
Let’s take a look at the following snippet of spork-ish code. For the purpose of this exercise, I took a segment of code from a Kaggle notebook found online, modified much of it while keeping the basic structure in place.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from catboost import CatBoostClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
import shap
train = pd.read_csv("/data/train.csv")
def feat_analysis(target, plottype):
train_c = train.copy()
train_c[f"{target}_binary"] = train_c["label"].apply(lambda x: 1 if x == target else 0)
features = [col for col in train.columns if col not in ["label", f"{target}_binary"]]
model = CatBoostClassifier(random_state=101, silent=True)
model.fit(train_c[features], train_c[f"{target}_binary"])
skf = StratifiedKFold(n_splits=4, shuffle=True, random_state=50)
cv_score = cross_val_score(CatBoostClassifier(random_state=1, silent=True),
X=train_c[features], y=train_c[f"{target}_binary"],
scoring="roc_auc", cv=skf).mean()
if plottype == "importance":
importances = model.feature_importances_
indices = importances.argsort()[::-1]
selected_features = [features[i] for i in indices][::-1]
importances = importances[indices][::-1]
plt.figure(figsize=(16, 9))
plt.bar(range(len(importances)), importances,
color=sns.light_palette("seagreen",
n_colors=len(importances)),)
plt.yticks(range(len(importances)), selected_features)
plt.xlabel("Feat. Importance")
plt.title(f"Feature Importance of {target}, score = {cv_score}")
plt.show()
elif plottype == "shap":
shap.initjs()
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(train[features])
shap.summary_plot(shap_values, features=train[features],
feature_names=train[features].columns,
max_display=len(train[features].columns),
)
sicknesses = list(train_df.prognosis.unique())
feat_analysis(sicknesses[0], "importance")
feat_analysis(sicknesses[0], "shap")
The code above follows a standard flow of data analysis, so we will add some comments indicating the different steps. While we’re at it, let’s add the variable train
to the function signature of feat_analysis
. This will be helpful since (a) we can rid ourselves of using train_c
in the function; (b) we can move reading the data down to the end of the file, with the rest of the execution flow; and most importantly - (c) to rid ourselves of the evil global variable.
After these changes we get:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from catboost import CatBoostClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
import shap
def feat_analysis(train, target, plottype):
train = train.copy() # don't change the original
# data munging / wrangling
train_c[f"{target}_binary"] = train["label"].apply(lambda x: 1 if x == target else 0)
# feature selection
features = [col for col in train.columns if col not in ["label", f"{target}_binary"]]
# model building and evaluation
model = CatBoostClassifier(random_state=101, silent=True)
model.fit(train[features], train[f"{target}_binary"])
skf = StratifiedKFold(n_splits=4, shuffle=True, random_state=50)
cv_score = cross_val_score(CatBoostClassifier(random_state=1, silent=True),
X=train[features], y=train[f"{target}_binary"],
scoring="roc_auc", cv=skf).mean()
# plotting
if plottype == "importance":
importances = model.feature_importances_
indices = importances.argsort()[::-1]
selected_features = [features[i] for i in indices][::-1]
importances = importances[indices][::-1]
plt.figure(figsize=(16, 9))
plt.bar(range(len(importances)), importances,
color=sns.light_palette("seagreen",
n_colors=len(importances)),)
plt.yticks(range(len(importances)), selected_features)
plt.xlabel("Feat. Importance")
plt.title(f"Feature Importance of {target}, score = {cv_score}")
plt.show()
elif plottype == "shap":
shap.initjs()
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(train[features])
shap.summary_plot(shap_values, features=train[features],
feature_names=train[features].columns,
max_display=len(train[features].columns),
)
# get dataset
train = pd.read_csv("/data/train.csv")
# lets run it!
sicknesses = list(train.prognosis.unique())
feat_analysis(train=train, target=sicknesses[0], plottype="importance")
feat_analysis(train=train, target=sicknesses[0], plottype="shap")
Next, to the main event: let’s extract some sub-functions out of the main code, one such function for each of the sections that we added a comment for — munging, feature selection etc. These are going to be our lego pieces:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from catboost import CatBoostClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
import shap
def feat_analysis(train, target, plottype):
train = train.copy() # don't change the original
munge_data(target, train)
features = select_features(target, train)
model, cv_score = fit_and_evaluate(features, target, train)
if plottype == "importance":
plot_importance(model, features, cv_score, target)
elif plottype == "shap":
plot_shap(model, features, train, target)
def munge_data(target, train):
train[f"{target}_binary"] = train["label"].apply(lambda x: 1 if x == target else 0)
def select_features(target, train):
features = [col for col in train.columns if col not in ["label", f"{target}_binary"]]
return features
def fit_and_evaluate(features, target, train):
model = CatBoostClassifier(random_state=101, silent=True)
model.fit(train[features], train[f"{target}_binary"])
skf = StratifiedKFold(n_splits=4, shuffle=True, random_state=50)
cv_score = cross_val_score(CatBoostClassifier(random_state=1, silent=True),
X=train[features], y=train[f"{target}_binary"],
scoring="roc_auc", cv=skf).mean()
return model, cv_score
def plot_importance(model, features, cv_score, target):
importances = model.feature_importances_
indices = importances.argsort()[::-1]
selected_features = [features[i] for i in indices][::-1]
importances = importances[indices][::-1]
plt.figure(figsize=(16, 9))
plt.bar(range(len(importances)), importances,
color=sns.light_palette("seagreen",
n_colors=len(importances)))
plt.yticks(range(len(importances)), selected_features)
plt.xlabel("Feat. Importance")
plt.title(f"Feature Importance of {target}, score = {cv_score}")
plt.show()
def plot_shap(model, features, train, target):
shap.initjs()
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(train[features])
shap.summary_plot(shap_values, features=train[features],
feature_names=train[features].columns,
max_display=len(train[features].columns))
# get dataset
train = pd.read_csv("/data/train.csv")
# lets run it!
sicknesses = list(train.prognosis.unique())
feat_analysis(train=train, target=sicknesses[0], plottype="importance")
feat_analysis(train=train, target=sicknesses[0], plottype="shap")
Let’s take a moment to consider this version, when all we did was to copy the existing code into dedicated functions. Our new version is much improved in multiple ways!
Impact #1: Easy to follow & orient
In the original version of the code, the solitary function was 40+ lines long, and you had to read it carefully to figure out what it was doing. In the new version, reading only 12 lines gives you a clear and concise sense of what is going on:
def feat_analysis(train, target, plottype):
train = train.copy() # don't change the original
munge_data(target, train)
features = select_features(target, train)
model, cv_score = fit_and_evaluate(features, target, train)
if plottype == "importance":
plot_importance(model, features, cv_score, target)
elif plottype == "shap":
plot_shap(model, features, train, target)
The importance of this improved readability alone cannot be overrated.
- It’s very easy to understand what the code does even if you never saw it before. We even removed the comments we had added before — the function names are so clear, the reader does not need additional cues to understand what’s happening.
- It’s easy to navigate to the portion of the code you are interested in (e.g., “where can I find the place where features are selected?”)
- If the code crashes due to a bug, it’s easy to figure out by the enclosing function in which phase of the pipeline the bug is (e.g., “building the model is causing a problem”). It can also help in deciding where to place debug prints and breakpoints.
Impact #2: Easy to update & modify
Improving the readability of the code also enables you to quickly update and modify the code when needed. This is due to the fact that, since sections of the code are in their own functions, you know exactly what each variable is used for, and what it can affect. There is less room for concern that one change will ripple far and wide, unnoticed till it’s too late.
Impact #3: Easy to track information flow
Each function you extract has a signature, which allows us to see what information each phase uses. For example, it’s easy to see just from the signatures that target
is used also when extracting features, which makes sense — you don’t want to use your dependent variable as a feature. Seeing that in the signature of the function can help you be secure the code makes sense.
In other words, externalizing the information each phase uses gives us a high-level view of the logic we’re applying, and help us zoom in on suspicious components as well.
Impact #4: Code Reuse and Extendibility
The final major benefit of this restructure is that each of these blocks of code are now reusable anywhere in your code base. For example, let us consider the last function we created:
def plot_shap(model, features, train, target):
shap.initjs()
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(train[features])
shap.summary_plot(shap_values, features=train[features],
feature_names=train[features].columns,
max_display=len(train[features].columns))
When looking at this function in isolation, it’s clear that there is very little (if anything) in this code that is specialized for this specific dataset or analysis. Rather, it would be quite nice if this plotting logic would be available to you and others when using SHAP values in other projects.
By extracting the logic into its own coding block and wrapping it into a function, you also turn it into a standalone tool. Then, due to its more narrow focus, it can actually be used much more broadly (just as a fork is useful in more cases than a Spork precisely because it has less capabilities).
To state this more generally: By breaking the code into multiple sub-functions, you are changing your code from a single combo-tool that can only be used in this project (= Spork) to building an entire toolbox filled with dedicated tools that can be used broadly (= standard utensils).
Summary, and: how important is this really?
In this post we focused on a single change to the code: breaking it into self-contained logical chunks. We saw how even this small change enables us to work better and smarter. The ROI of this change is quite high, and requires very little effort.
Well — usually. In some cases you might find that breaking up the code into such chunks is complicated. If that’s the case, and you’re in a time crunch, then sure — don’t spend time on it right away. However, when you have such “spaghetti code” that is not easily broken up, it’s usually a sign that your code is very hard to follow and manage, and that will come back to bite one day.
In other words, with such code you’re likely to find yourself in the future banging your head against the wall wishing you had cleaned it up when you knew what it was all about. So, if you find that it’s not easy to make even this small change, consider it a big warning sign. Thus, even a mental exercise trying to see if you could do such a change easily will help you evaluate where you are at risk.
It’s a well-known adage that “code is read much more than it is written”, and in many cases that future reader of your code is you — six months from now, when you don’t remember anymore what you were doing exactly. So — take care of your future self!
Finally: Buy a Spork. It’s not good for much, but it’s still cool to have one around, just in case… :)