DS in the Real World

SWE + Data Science: Refactoring Old Jupyter Notebook Projects

Why we need to take a page out of the software engineering industry’s (note)book…

Shane Austrie
The Urban Nerd

--

Photo by Sticker Mule on Unsplash

Something that’s not commonly talked back in Data Science, unless you come from a Software Engineering background, is the need to refactor old Jupyter Notebooks.

In the early stages of prototyping (whether it be implementing a new ML model or simply doing data exploration), we all tend to code up quick and dirty solutions. This is perfectly understandable, especially considering the Silicon Valley mindset of “Just get it done.” However, at some point, you’ll need to share these notebooks with other humans — e.g. coworkers and managers — so your code needs to be presentable for both non-engineers and engineers who are not yourself.

As your coding and architecture intuition improve, your hacky solutions will look more and more like industry-level code. Now, some people will say this comes with decades of experience, but I say you can speed up this process by knowing what to look out for.

Descriptive Naming

The simplest thing you can do to increase code readability and understandability is improving your naming conventions.

Here’s a partial/beginning example of using Jaccard Similarity as a confidence weighting.

Worse

def calculate_similarity(anime_1, anime_2, df):
1_df = df[df[‘anime_id’] == anime_1]
len_1 = len(1_df)
2_df = df[df[‘anime_id’] == anime_2]
len_2 = len(2_df)
df = pd.merge(1_df, 2_df, how=’inner’, on=[‘username’])
len_1_2 = len(df)
if(len_1_2 == 0):
return (False,)

This is an example of code that’s hard to understand.

  • anime_1 and anime_2 are ambiguous. You don’t know if it’s String, Number, or even a DataFrame.
  • 1_df and 2_df are horrible naming conventions. When looking at the variable names, I don’t know if they’re copies of each other or if one is a mutation of the other.
  • len_1 and len_2 are vague unless you look back at what the coder took the length of.
  • df = pd.merge... is bad for multiple reasons: later on, I will forget that df is the intersection of the user ratings of anime_1 and anime_2; additionally, you overwrote the original DataFrame, which is usually a problem by itself.
  • A common naming convention in the Data Science community is to call every DataFrame df, but what about when you start making changes to that one DataFrame or when you have multiple DataFrame sources in one notebook?

Better

def calculate_similarity(anime_1_id, anime_2_id, input_df): cleaned_input_df = input_df.dropna()
anime_1_df = cleaned_input_df[
cleaned_input_df[‘anime_id’] == anime_1_id
]
anime_1_df_len = len(anime_1_df)
anime_2_df = cleaned_input_df[
cleaned_input_df[‘anime_id’] == anime_2_id
]
anime_2_df_len = len(anime_2_df)
anime_1_2_user_intersection_df = pd.merge(anime_1_df, anime_2_df, how=’inner’, on=[‘username’])
anime_1_2_user_intersection_df_len = len(anime_1_2_user_intersection_df)
if(anime_1_2_user_intersection_df_len == 0):
return (False,)

With this refactored version of the former code, at any step, you will understand the current context of the code. You may lose a bit of speed as far as implementation time, but gain immensely in understandability — your future self will thank you.

Remove unnecessary code

A common concept in the software industry is DRY’ing your code — also phrased as “keeping your code DRY” — with DRY standing for Don’t Repeat Yourself. If you write an algorithm multiple times throughout your code base, then that algorithm should be made into a function with you simply calling that function.

The following is an example of an algorithm used to calculate cosine similarity.

Worse

anime_id = 10
num_related=100
numerator = self.factors.dot(self.factors[anime_id])
denominator = (
np.linalg.norm(self.factors) * np.linalg.norm(self.factors[anime_id])
)
scores = numerator / denominator
best = np.argpartition(scores, -N)[-N:]
result = sorted(zip(best, scores[best]), key=lambda x: -x[1])
...anime_id = 22
num_related=200
denominator = (
np.linalg.norm(self.factors) * np.linalg.norm(self.factors[anime_id])
)
numerator = self.factors.dot(self.factors[anime_id])
scores = numerator / denominator
best = np.argpartition(scores, -N)[-N:]
result = sorted(zip(best, scores[best]), key=lambda x: -x[1])
...anime_id = 1
num_related=150
denominator = (
np.linalg.norm(self.factors) * np.linalg.norm(self.factors[anime_id])
)
scores = numerator / denominator
numerator = self.factors.dot(self.factors[anime_id])
best = np.argpartition(scores, -N)[-N:]
result = sorted(zip(best, scores[best]), key=lambda x: -x[1])
  1. If you wanted to change the recommendation algorithm from cosine similarity to simply dot product or Euclidean distance, you will have to edit the algorithm in numerous location s— imagine when your code base is the size of a mini-startup, it’ll be a nuisance to work with.
  2. You probably didn’t notice the third replication of the algorithm has an error in it. (hint: UnboundLocalVariable error)

Better

def get_related_cosine(anime_id, num_related):
numerator = self.factors.dot(self.factors[anime_id])
denominator = (
np.linalg.norm(self.factors) *
np.linalg.norm(self.factors[anime_id])
)
scores = numerator / denominator
best = np.argpartition(scores, -N)[-N:]
return sorted(zip(best, scores[best]), key=lambda x: -x[1])
result = get_related_cosine(
anime_id = 10,
num_related=100
)
...result = get_related_cosine(
anime_id = 22,
num_related=200
)
...result = get_related_cosine(
anime_id = 1,
num_related=150
)

Add comments and an executive summary

To make your code even more understandable, besides just having a good naming convention, simply having comments help!

Average

for anime_2 in ns.animes:
if anime_1 == anime_2:
break
else:
similarity_distance = calculate_similarity(anime_1, anime_2, df)
if similarity_distance[0]:
current_weights_matrix = current_weights_matrix.append(
[
pd.Series(
[anime_1, anime_2, similarity_distance[1]],
index=[‘anime_1’, ‘anime_2’, ‘weight’]
),
pd.Series(
[anime_2, anime_1, similarity_distance[2]],
index=[‘anime_1’, ‘anime_2’, ‘weight’]
)
],
ignore_index=True
)

The code is understandable for the most part, thanks to having descriptive naming, but comments for the non-intuitive parts can increase readability even for non-engineers. For stylistic and logical reasons, it’s good to mix up one-line comments, medium-sized comments, and large comments.

Better

# Instead of going from 0 to N (number of animes),
# we decide to be more efficient in this for-loop
# by going from 0 to A(i) (the current anime's index).
# This averages out the overall computation/runtime
# of the dataset (not just this for-loop)
# from O(N^2) to O((N^2)/2).
# We also skip over doing computation
# of an anime in comparison to itself
# so the runtime of the overall dataset is closer
# to O(((N^2) - N) / N)
for anime_2 in ns.animes:
if anime_1 == anime_2:
# Skip over comparing an anime to itself.Because if you're
# already watching anime_1, then why should we
# continue recommending it to you.
# Additionally, breaking allows us to achieve O((N^2 - N)/N)
break
else:
similarity_distance = calculate_similarity(anime_1, anime_2, df)
if similarity_distance[0]:
current_weights_matrix = current_weights_matrix.append(
[
pd.Series(
[anime_1, anime_2, similarity_distance[1]],
index=['anime_1', 'anime_2', 'weight']
),
pd.Series(
[anime_2, anime_1, similarity_distance[2]],
index=['anime_1', 'anime_2', 'weight']
)
],
ignore_index=True
)

Maybe you’ve been working on a new feature that increases the Click-Through-Rate by 10%, you’ll definitely be presenting this at the next team meeting and maybe even sending the notebook to a few higher-ups 🙏.

Even though they’ll be briefly overlooking the notebook’s code, an executive summary and a few visualizations will definitely help them understand your notebook to its fullest.

Here’s an executive summary example from the last iteration of my anime recommender system (don’t worry, I now use latent features, cosine similarity, and clustering, and have gotten execution time from hours to minutes ⌚️):

Executive SummaryImplementing this algorithm for our specific dataset (5GBs) proved difficult. Days were spent on improving training execution speed: from optimizing the algorithm (Big O((N² — N)/N) instead of just Big O(N²)), overwriting variables (a form of forcing garbage collection early so we don’t have to wait for the item to fall out of scope), and trying out several big data tools — specifically, Dask and Modin (they actually had an error in their code at the time, so I had to go into the source code and limit max RAM usage to 40% — which obviously is an another issue by itself, since we need a lot of RAM for this dataset). However, we saw that those tools added their own performance overhead, where the main benefit of those tools really only come in handy when the dataset can’t even be read into memory at all or when a company has a cluster of computers/instances. Both cases were not true for us. We upgraded our instance hardware specs on Amazon’s EC2, so the dataframe could be read into memory comfortably. As well as, we were only using a single machine/instance. 
NOTE: You may notice that a lot of data analysis steps in preprocessing may have been left out. I have chosen to remove those steps in the process of cleaning up this notebook, because those steps was hindering my iteration time of this algorithm.Also, you may wonder why my preprocess function is so large. I did this because, again, I was optimizing for memory. All the variables/objects it took to come to that final training and testing sets, gets garbage collected a lot earlier if you put them in a function in order for them to fall out of current scope faster. I'll be breaking up that function into smaller functions in future iterations.More information on the algorithm:Overall, this recommender system uses item-to-item similarity as it’s goal, and matrix factorization as it’s methodology.Uses Euclidean Distance as it’s basis for comparison instead of cosine similarity, due to the fact that we want to know how much is another anime is better or worse, not just how much is that anime different.Uses Euclidean Distance normalized by Jaccard Similarity in order to prevent the small number of ratings that two animes have in common from being too deterministic. If the animes have a large percent of the reviews in common, Jaccard Similarity will be closer to 1, which then basically indicates that we should trust the Euclidean Distance.##### Creator: Shane Austrie, ML Engineer specializing in Personalization and Recommeder Systems [www.shaneaustrie.com](www.shaneaustrie.com)

Here are some additional tips that weren’t included in this post:

  • Breaking long methods into smaller ones
  • Utilizing native and built-in functions
  • Implementing modules and classes (yes, OOP is even important in Python).

For more articles covering coding, music, dating, and the overall urban nerd lifestyle follow me or The Urban Nerd publication!

--

--

Shane Austrie
The Urban Nerd

Gen Z AdTech Expert | ML/AI Consultant | SiliconValleyConsulting.io | Casual writer about techy & non-techy things | Connect with me on LinkedIn!