What did I learn from my first Kaggle competition?

Steve Tan
8 min readMar 23, 2019

Ever since I’ve started learning data analytics and machine learning, it seems like the gold standard to be able to compete in Kaggle, hence when the opportunity came up recently in the Shopee National Data Science Competition 2019, I’ve signed up as a beginner and dive head first into the deep end, below are some of my notes.

25 minutes before my first competition is officially over

The competition involves using product title to predict the category that the item belongs to, and we were given three main categories of items, namely mobile, fashion and beauty products. Additionally, each item’s image file was also given probably as a hint to us that image learning would be required, all 50GB of it!

I didn’t have a team so I was randomly assigned one, and from day one, the brain behind my team, M, came up with our first solution using SGD Classifier and Count vectorizer to handle the prediction, and for a while before all the big guns started jumping into the competition, we found ourselves in the unlikely #1 spot in the leaderboard for a tiny while.

Could it be this easy to win a Kaggle competition? — it turns out not really :(

After that initial elation, I’ve googled up for more text classification method and went to a lot of trial and errors using TDIDF, Logistic Regression, XGBoost briefly, SVM, Naive Bayes, but most of these model was a quick hit and miss as I didn’t spend a lot of time in optimizing once I saw the initial run with these models return a CV result far below what I’m getting with SGD. So in hindsight, I should have a better strategy in testing and validation which models to spend time on because it turns out, trying to build and validate models can take up a lot of time for the trial and error to get it running correctly, and not to mention if you’re randomly trying a large model, the training time can be a real killer. I can’t describe how sucky it feels to have wasted 20+ hours training a model just to see the result getting squashed to pieces because I didn’t set the correct hyperparameters right or there was a bug in the code.

So here comes lesson #1, start small until you have completed and verified the whole run, then if the CV score is better than your best model, invest more time to train a full-fledged model. For example, instead of spending 10 hours training the whole 160K rows of the sample and doing a five-fold CV run, test a 0.1% of that as long as it can be completed within too much lag time and compare the accuracy against your best model training using the same sample size.

For collaboration tool, Github and Google Colab is a Godsend. With Google Colab, you get a full form Jupyter Notebook with access to GPU and TPU which can be linked to your Google Drive for file access. Plus the notebook can be shared with your teammate, so everyone can be working or checking your code in real time. Initially, I started running the code on my local notebook with my puny GPU, but soon realized that with Colab, it was so much more flexible with the fact that you can spawn a working notebook with nothing more than your browser. If you’re new to Google Colab and would like to learn how to use it with codes you found via someone’s Github account, feel free to follow through this simple tutorial here.

Healthy diet, good rest and some tips to avoid becoming like this guy while waiting for your model to converge

With regards to working with heavy processing codes, here comes lesson #2, ensure that you include a progress check code in or use something like tqdm so that you know that the code is actually running. I had wasted another half a day running a code to do stemming on the data frame which never complete or almost complete? Either way, I wouldn’t know because there was no progress in the background… And besides that, it’s worthwhile to break the code down into different sections so that you can run them in parallel using multiple copies of Google Colab :)

And one more thing, if it’s going to take a while to run the code, you want to make sure that all that time doesn’t go to waste due to something strange such as Google Colab crashing (usually due to memory depletion after you have exhausted all 12GB of the memory given to you) or time out if Colab noticed that you haven’t been doing much for quite some time, it’s good to dump the model + the weights from the training into physical file such as by using pickle. If you want to have a quick reference, please refer to this notebook that I used for the competition here. And here’s the relevant snippet:

model_path = "/content/drive/My Drive/GitHub/nsdc_beginner/ModelLogR_F5_{}".format(dt.datetime.today().strftime('%Y%m%d%H%M'))
if (not os.path.exists(model_path)):
os.mkdir(model_path)

# save the model to disk
filename = 'finalized_model.sav'
pickle.dump(Fo_logr, open(os.path.join(model_path,filename), 'wb'))

# save the vectorizer as well
v_filename = "vector.pkl"
pickle.dump(F5_vect, open(os.path.join(model_path,v_filename), "wb"))

print ("Save the model the vectorizer to disk")

And later reload back the weights and the model like this:

# load the model from disk
F_Logr = pickle.load(open(os.path.join(model_path,filename), 'rb'))
F_Logr

F_vec = pickle.load(open(os.path.join(model_path,v_filename), "rb"))
F_vec

I started using it once I had to use CNN to train the image and moved from freemium to the now expensive Google Compute which I wouldn’t dare to leave it on and running throughout the night as every cent continues to tick away in an exponential manner. If you’re new to Google Cloud — here’s some tutorial that I did for a previous project.

P.S: from reading the notes from the top performers in the contest, it appears that an even better method would be to run the heavy processing using Kaggle kernel directly.

I didn’t manage to get any breakthrough from the text classification models, so I thought of using the images to get better accuracy. From the category split, the Fashion category was the one with the lowest accuracy, around 65% vs Beauty, 75% and Mobile 82%, so the Fashion category seems like the best place to put our effort in. The image size was huge but once you set up a proper Google Cloud vm instance with some massive GPU and RAM, P100 GPU and 90GB of ram for my case, the rest of it was just typing as fast as you can to complete the code and let it train while you do some background praying in parallel for better accuracy. As it was expensive to let the Google Cloud run, a better way, would be to test out everything in Google Colab as mentioned above and leave the heavy lifting training part to Google Cloud only when necessary. Otherwise, you would end up burning over a few hundred dollars of (luckily trial $$ — Google Cloud will give you a trial of $300 when you sign up) to get subpar results after a few nights of bad sleep.

I didn’t get anywhere with the image training as the accuracy was so low that I thought even if I try to build an ensemble model with it, it probably wouldn’t work, hence I went on continue to dig up more models that I can try quickly without properly understanding how things work, which in hindsight now, would make this lesson #3, I probably didn’t go deep enough with the model optimization or looking up relevant research paper to make the model work for this particular business case, I was just randomly moving from one generic model to the next hoping that with enough brute force and GPU computation power that I would be able to crack the accuracy game.

wow, didn’t know there’s so many types of tops…

Lesson #4, it’s good to hang out in the discussion and public kernel for the competition because kind and smart people were more than willing to share nuggets of tidbits here and there, and although I’ve signed up to be notified for any new thread, there’s no such functionality for the public kernel, so that would require you to manually check in once a while to check. And another good thing is many people are willing to share fully working kernels for you to get started as well as to test many new techniques.

Lesson #5, not sure how the other teams are doing it, but probably need to work better with the team to bounce ideas around or to divide up the methods so we all don’t end up wasting time and effort. This is probably one of the areas that we could have done better on because the competition really suck up a lot of time and not having a proper process on sharing the workload will just lower your chances to get any good accuracy, plus bouncing your idea around would probably cut short of any time wastage in exploring some random genius idea that sounds dumb immediately once you voice it out loud. There are tons of good relevant research paper and ML tutorials out there, so much so that it can easily get confusing, there was a lot of trial and errors that lead up to nowhere.

So the final lesson #6 that I can think of would probably to spend as much time as understanding the features and what do they mean in perspective of the model that you’re building rather than just relying on the brute force of the model. A lot of the title was really messy and confusing, and it was a mix of English and Bahasa Indonesian in the title, I try to manually translate a big chunk of that to English but I’ve learned my lesson earlier, and build a quick model to test it out, it turns out the translation wasn’t that accurate so the result was not as good as the original so I didn’t have to continue with that menial work.

All in all, it was really a lot of fun, I wish I had done better, but I did learn a lot and I hope that this article could inspire you to participate in your first Kaggle competition soon too! Happy kaggling :)

For the code, feel free to visit them here.

--

--

Steve Tan

Data enthusiast, Grad student in AI with background in Payments, Banking, Fin Services.