GSoC19 with Tensorflow Hub

Published in

Learning Machine Learning

9 min readJul 5, 2019

I was selected as a Google Summer of Code(19) participant at Tensorflow to work with the tensorflow hub team. My github profile is here. In this article I’ll document the application, community bonding period and overview of my time. My mentors are Vojtech Bardiovsky and Sachin Joglekar and the other teammate is Adrish Day. Paige Bailey is also working to supervise the student-mentor pairs.This article is a bit more like biography than a fully technical post.

Application

During the application period I initially wanted to apply with a project proposal on tensorflow-ranking and went through the papers listed in the repository quickly. Tensorflow ranking is an underrated library for learning-to-rank problems with implementations for common loss functions and metrics for relevant problems. Learning-to-rank refers to using machine learning techniques to the results of information retrieval systems, for example if you have 50000 relevant results for a given query but you essentially want the most relevant result for you to be on the top, slightly less relevant result to be on the second and so on. Each permutation of the 50000 documents actually has a certain ranking utility to you and you want to “best” permutation of those document results to be presented to you. I thought the product was cool. However after studying for a while I understood the library had already fixed most of the problems mentioned in the Github issues and I should look into other projects. The proposal was not coming out well.

After that I started exploring Tensorflow-Hub library and immediately liked it. With tensorflow hub people can get reusable components of other machine learning models like pretrained embeddings and weights for convolutional networks for computer vision problems which have been trained on imagenet and use them immediately in their own projects, assemble them together, fine-tune them on their own dataset and so on. I ended up making a proposal for adding more modules into hub and applied.

Getting In

I didn’t expect to be accepted actually, March 2019(the application period) was a very hard time for me. Two of my cats(Moturam and Mini) died back to back in two weeks with Feline Panleukopenia. My aunt brought two two new cats from a pet shop, who had the virus and spread to my cats. Worse part is that my cats were unvaccinated and one of the kittens my aunt brought also died(the one who actually had the virus). Panleukopenia is the sort of devastating virus that can literally destroy a cat colony in a week. From Feb 30th to March 20th I didn’t work on anything related to GSoC. It was literally a life or death war for us inside the home. On one side I had to try to stop the virus from expanding, another side I wanted to spend as much time as possible with my dying cats. I dropped my programming language course since it was impossible to handle the situation while working full time on all courses. I had a project going for my machine learning course where we were trying to predict stock prices from news data and that was consuming a lot of time too.

I started working on the application in the last 10 days of the proposal and finished it on April 7th. I was numb with grief and didn’t really feel anything aside from the fact that A) there were more experienced people than me who were applying too to Tensorflow and my chances for being accepted was slim, B) after facing death for our family members (our pets are our family members), I didn’t want to think about anything for a while. I was also afraid I’ll apply to it and get rejected, since 1500+ people were chatting in the Gitter channel about their application while many other orgs received less attention. But my mother eloquently put it, “there’s no logical reason why you should be accepted into things every time you apply to something”, pointing out subtly it’s okay to take some risks.

Past GSOC 2018

I worked with Berkman Klein Center of Internet and Society , a research lab in Harvard that focuses on media data for a network visualization project last GSoC and made Mediaviz, a network visualization library for automatically scaling graphs with force based layout. I had experience with machine learning projects from Kaggle and my class projects,(here’s my website with more projects), but making a jump to a deep learning style project from network visualization is actually pretty hard.

First of all, the mindset is different. When we are doing a dataviz project, the attention is to what information am I representing and am I using the right abstractions(color,shape,layout and so on) to represent it? What will be the takeaway information of the dataviz consumer? But when we are doing a machine learning project we have to think about other things like model performance, training time, utility of the problem being solved, deployment and so on. When it comes to deep learning instead of the usual parameters we have to think about the model architecture.

Community Bonding

During the community bonding period first I got to meet Adrish, who ended up from being Calcutta. He’s making TPU samples for TF-Hub right now and training ESRGAN as a module for his project. I sat with my mentor Vojtech soon and we started exploring projects. We settled on training ULMFiT(Universal Language Model Fine Tuning) as a module for TF-Hub along with fixing other github issues and implementing new features. I’m quite new in using deep learning for NLP, but I’m picking up things pretty fast. We also settled on a plan to get the tasks done, identified the relevant steps for the projects and I started tinkering with the codebase to get familiar.

To be really honest(again) jumping to Tensorflow 2.0 directly is easier, but when I’m working with a moving codebase like the current beta release for 2.0, things are literally changing. To understand or upgrade some colab or feature to TF2.0, it’s important to also understand relevant aspects of previous API along with a working knowledge of which parts were deprecated or added.

Work Time

Overview

In the first week I fixed a colab on how to do text classification with kaggle datasets and tensorflow hub. Then I made an exporter tool for exporting pretrained embeddings like FastText and Glove to tensorflow hub module. Right now I’m using the hub module I made to make a demonstration colab for showing how to train a model for any local language with Tf-Hub and digging more into ULMFiT.

Text,text and more text

My project is very much NLP oriented this time. While making the pretrained embedding exporter I had to get used to tensorflow’s file I/O, embedding layers and make my concepts clearer on how the embedding layer works internally. Embedding layer works as a lookup table essentially where we learn the weights corresponding to tokens(words or characters) based on a supervised task like training a language model. The resulting word vectors maintain the semantic relationship between words and work as the dense representation of words as input to the models. Multiple pretrained embeddings are available online like FastText and Glove , my embedding exporter makes a TF2.0 module out of those embeddings which can be used for downstream tasks. For example, here we take the FastText embeddings for Bangla, my mother tongue and make it a hub module.

For the local language demo I’m doing right now we use a larger version of the module with 100,000 most frequent tokens and try to classify a Bangla news article dataset. Of course instead of Bangla the same process can be used for any other language which has pretrained embeddings available.

Balancing With University

This term fortunately I’m doing a NLP course which is going extremely well. The teacher is sadistic, so he assigned us to implement this paper “Unsupervised Machine Translation Using Monolingual Corpora Only”. We’ve had pretty good discussions on RNNs, Word Vectors, Language model training and encoder-decoder architecture for Seq-to-Seq models. My project for that course is to make machine translation model from Bangla to English. We’ve already begged for dataset to other professors and ultimately got one of our junior guys to download a Bangla-English machine translation dataset from IEEE. I’ll probably implement the project using tf.keras.

To be clear, I don’t really hold a fascination with Bangla data, but the whole NLP field is kind of drawing me in and given last time I was at a NLP conference in Bangladesh everyone complained about the state of NLP in Bangla, I’m really just going with the flow. Plus the knowledge gained should be helpful for future projects when I want to work on more general problems. Other than that I’ve a microprocessor course which I find unfathomably boring, a history course for GED requirements which I’ve procrastinated on so far and bunch of labs. Balancing GSOC work with university isn’t that heavy of a workload yet. In the class days I stay morning to evening in my university and I mostly work on weekends. Other participating students in GSOC have their own schedules. This is my last GSOC, so I’m kind of quite happy with the experience.

Unit Testing and Code Clean up

Tensorflow has a test suite tf.test for running tests which I picked up while implementing the embedding exporter. My biggest issue was actually code cleanup and had to get used to auto-formatting tools.

Bazel

Bazel is the build tool used by tensorflow and many other products. I didn’t dig deep into Bazel aside from learning how to build and run tests and pretty much how to modify the BUILD files. But it was a new tool that I had to learn and keeping track of workspace and package directories can be somewhat problematic.

Google Cloud Platform

I didn’t have much idea how to operate a cloud console before GSOC. I’ve experimented with running Gradient notebooks on paperspace before, but when it came to AWS I found the work quite hard. Also since I’m from Bangladesh we have this huge issue with adding payment methods. My family didn’t have an international credit card and the one we have needs to go through multiple processes like endorsing money with passport in the bank and then requesting to open international gateway for payment. After learning how to add GCP credits and taking care of the payment issue, I needed to learn how to start a deep learning VM in the cloud with GPU. Then there was the issue of increasing quota.

I really liked this new project AI Hub introduced in Google I/O19 which is a directory of all AI related sources google provides including the premade VM instances and how to run them. The real issue with cloud is not only the cost(which seemed quite cheap if I’m not running deep learning models daily), but also how to use it , uploading and downloading data, working with Cloud storage buckets and getting used to the command line tool gcloud took me some time. But ultimately I was able to use the whole fasttext bangla embedding and export it as a 3.5GB bucket for further use.

First Evaluation

I felt like I was making a lot of rookie style mistakes in the first period and also had to pick up lots of new stuff. But the way I see it, my strength is that I learn pretty fast, but my weakness is I lack experience. So having this sort of force feeding knowledge to myself actually works for me. Not to mention my mentor has been exceptionally helpful in providing me feedback in how to improve and where I’m going wrong. He’s very communicative and has a lot of patience too in giving feedback. Right now the dynamics has become me trying to finish stuff fast because I kind of work that way, while the mentor encouraging me to do things well while taking my time. I passed the first evaluation so yay. We have two more months to go, so I hope to actually finish the ULMFiT training and hopefully emerge as an expert in NLP.