Yann Dubois, from Biomedical Engineering to Machine Learning

Published in

Nurture.AI

5 min readFeb 22, 2018

Yann implemented NIPS 2017 paper “Hash Embeddings for Efficient Word Representations”, and is a winner of the Global NIPS Paper Implementation Challenge. See his code implementation here.

Tell us a little about yourself?

I graduated in Biomedical Engineering from École Polytechnique Fédérale de Lausanne (EPFL) in Switzerland, and had the opportunity to take up machine learning classes during a full year of exchange at the University of British Columbia in Canada.

I was amazed when I first discovered word embeddings and their linear substructure property — which is essentially doing maths with language! Current Natural Language Processing (NLP) methods tend to not do work well on non-European languages. To confront this issue I decided to join the largest hail-riding services in Southeast Asia, Grab, where I am in charge of building a classifier to understand 40000 daily reviews in different languages such as Thai, Vietnamese, English, Bahasa, Burmese etc. Besides developing the pipeline, I am actively investigating how to train word embeddings in a multi-task learning, semi-supervised learning, and language agnostic manner.

I find most of the machine learning domains fascinating; my research interests are mostly in NLP and sample-efficient methods, e.g. multi-task learning, transfer learning, model-based reinforcement learning, etc.

How did you get started in AI research?

During my second year of Bachelor’s studies in Biomedical Engineering, I was working in Prof. Jacques Fellay’s lab, focusing on computational genomics and infectious diseases. As I didn’t have the biological background to find mutations associated with a condition, I decided to make a ranking algorithm based on features including predictions of the deleteriousness of a mutation, proximity of the mutated genes in the cohort, and gene function based on text mining of NCBI’s website.

One major hurdle encountered was the difficulty to obtain labelled data. The number of learnt parameters thus had to be kept to a minimum level. I tackled this by proving and defining new complex feature transformation functions based on mathematical models of biology. The algorithm turned out to be effective, and the project thus shifted from investigating a single condition to improving objectivity and efficiency of whole exome sequencing pipelines, by prioritising variants.

This was the first time I experienced “machine learning” and it got me really excited about the field. At the end of the project I embraced the freedom I had as an exchange student in order to switch to machine learning. Although biomedical engineering and machine learning are very different, the transition went relatively smoothly. Of course I had to brush up on my computer science skills but at least I had the required mathematical and statistical background. People from all domains are starting to work in the field of machine learning and I believe that it is a relatively easy field to transition into if you have a mathematical way of thinking and you are willing to get your hands dirty with implementations similar to this challenge.

What are you most passionate about in AI?

Like many people in the field, I am excited about the impact that you can have on society with your research. It’s surprising to see that your work can be in production in a few weeks. But what I like the most about machine learning is that it perfectly balances mathematical/statistical theory with computer science practice.

Can you give us an overview of your implementation in this Challenge?

I implemented Hash Embeddings for Efficient Word Representations in PyTorch. Hash embeddings are a way of approximating the hashing trick with less parameters using multiple hashes. I was able to replicate their results without any hyper parameter tuning or seed cherry picking, which probably indicates that the authors didn’t do so either. I then tried improving the word embeddings by making them more memory efficient and decreasing the number of collisions. Finally, in order to make more robust conclusions, I ran multiple experiments with varying seeds on a subset of the experiments and compared the distribution of the results.

The goal for me was not only to replicate the results but also to make an out-of-the-box implementation of hash embeddings that people could use. It was also a nice way to put in place a pipeline to compare word embeddings for future work.

Were there any challenges while implementing your selected paper?

This was the first time I re-implemented a paper. Although the authors did a nice job explaining their experiments, they did not give the values of all their hyperparameters or discuss about the initialisations they used. All the choices were overwhelming and I didn’t have the computational power to test them. I ended up choosing hyperparameters that I thought made sense and sticking with them. I’m quite surprised that it worked well without any tuning!

I also had to learn PyTorch as I had only worked with Tensorflow before. The learning went smoothly and I think that PyTorch is great for research focused work. Its dynamic computational graph gives an impressive flexibility and simplicity for defining novel layers or networks. The learning curve is also smaller as the API is very pythonic and NumPy-like. In general, I think that both frameworks focus on a somewhat disjoint community, and I’m glad to have taken the time to learn about it.

What’s next for you in your work?

I will start a Master of Philosophy in Machine Learning at the University of Cambridge in the coming October. Until then I will continue to work on building a pipeline for general NLP at Grab. I am looking forward to pushing my research forward on semi-supervised learning word embeddings. Multi-task learning and multi-lingual word embeddings are already giving encouraging results!

Currently based in Singapore, Yann is a Data Science Trainee at Grab. To keep in touch with Yann, check out his blog and Github.

This is a feature of the winner of the Global NIPS Paper Implementation Challenge. You can read other winners’ feature here. Let us know if you enjoyed this series and would like to see more of content like this, drop us a comment or an email at info@nurture.ai