ML Refactoring Pt. 1

Published in

Analytics Vidhya

4 min readAug 3, 2020

I was enticed by the hype of GPT-3 and decided to learn about it.

The first step was to contact a good friend of mine, Haryo Akbarianto Wibowo, who knows a lot about NLP (go check his medium, linkedin, and github page, I’m not joking) to learn about GPT-3 and he immediately said

‘Don’t rush it. Understand how to solve simpler problems first and work your way up. Just try to finish the pytorch nlp tutorials.’

To be honest, the entire conversation was much longer than that, since he tends to be overly enthusiastic every time something about nlp comes up. But anyway, I agree with him because pytorch tutorials are undeniably the most impactful tutorials throughout my journey in computer vision.

And the first tutorial that I choose was NLP From Scratch: Classifying Names with a Character-Level RNN because it was the easiest among the others

NLP From Scratch: Classifying Names with a Character-Level RNN - PyTorch Tutorials 1.6.0…

Edit description

pytorch.org

But after a quick look, my engineer senses were tingling and found things that can be improved. Yes was I perfectly aware that the original tutorial serves specific purpose, that is to show how to create basic RNN network for classification purpose, but this article serve another purpose, which is to share my knowledge about refactoring ML codes. To be frank, I haven’t find many people talk about this but its kinda funny since this is what I mostly do at work.

So why should we refactor?

To get better result. This is the heart of refactoring. In ml codes, this means that we want to have better evaluation / benchmark performance or resource efficiency.
Its painful to do experiments on the code. The rule of thumb is, how painful would it be to change the dataset, model, and or parameters? It would be cool if changing the dataset or model can be done as easy as Thanos’ snap
Its painful to reproduce. This is a big problem. Go search it on your favourite search engine
Efficiency
The code is ugly

There are lots and lots other reasons, but I think in general we refactor because we want to get better results.

So here are the reasons why I made this article:

I want to learn about nlp
I want to get at better result than the original implementation
I haven’t write for ages

And more disclaimer: You are highly encouraged to understand the original tutorial first, because my goal is to discuss about refactoring, not about the RNN model and the preprocessing steps since the original author had explained perfectly.

Without further ado, let’s take a look at the code and make strategy. You can either fork or clone from here

spro/practical-pytorch

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

which is the repo of the tutorial

This is my go to strategy

Fork / clone the original repo
Explore the code
Try to reproduce the result
Add new branch(es)
Improve the code, train, then evaluate again

Explore the code

So after a quick look at the code, I found several major things that could be improved

There is no evaluation function. It means that we don’t know how well the model performs. It also doesn’t contain any validation dataset so we have to split the data into train and val set from the full dataset
The data preprocessing is not wrapped into python iterator. This is problematic since it would introduce a lot of pain when inspecting and using new dataset of different format
For each epoch, the trainer code takes random sample to train. To my knowledge the definition of one epoch is to use all training samples in a training loop.

To be honest at this point point no 2 and 3 is not that important. Want we want now is to evaluate how good this model is so we should solve the 1st problem first

Before that, lets make a new branch called devel and start to work from there

#splitter.py

This is the highlight of the dataset

There are 18 languages stored in txt file
Each row in the txt file contains the name of the people
Different languages have different number of samples

So the basic strategy is to randomly sample 80% for training and 20% for validation. It’s totally up to you but to keep things simple we do this instead. In real-world scenario, I’d add one more set, which is called the test set. And I usually handpicked them myself. This ensure that we’re incorporating cases that are relevant to us. Last, we want the split to be consistent so lets use random seed.

Notice at line 31 we have some kind of assertion. Ideally we can make a function out of this and create test cases to assure that we’re doing is right to some extent but its too much for now, so let’s use simple trick to check that the last value in training set is different from the 0th value in val set.

So lets run the script with the correct parameter and you’ll get train/ and val/ folder under the output argument

That’s all for now. I hope you guys enjoyed it. Don’t hesitate to give me any kind of feedbacks!

PS: You can see the current progress on this repository

We’ll try to evaluate the performance of the original model. If you understand metrics for classifiers such as precision, recall, accuracy, and confusion matrix, you’re good to go. If not, brush it up. I highly recommend this since the scikit team had explained it wonderfully. Until next time!!!