The goal was to classify satellite images of roofs, to identify their orientation (this could be used to estimate the solar energy potential of a country).
We had the opportunity to meet the two best teams at our office in Paris :
- four students from UPMC, ranked first at the end of the challenge : Remi Cadene, Mickael Chen, Edouard Delasalles, Thomas Gérald
- four students from Ecole Polytechnique, ranked second at the end of the challenge : Guillaume Lample, Pierre-Victor Chaumier, Ismael Lemhadri, and Marc Szafraniec
Here is what we learned.
What is your background? How did you met?
We met at the UPMC Data Science master, and currently work together in the same lab and same office. It was natural to build a team, and it was easy for us to meet physically everyday during the competition to organize our strategy.
We were all students at Ecole Polytechnique, at different levels. We received a message through Facebook, inviting us to participate. That’s how we initially build the team. And we also used Facebook to communicate during the challenge !
An other team from Polytechnique also finished among the 20 best teams selected for the final challenge. Unfortunately, only one team per university is allowed to be finalist. It’s a pity, because they were very serious candidates for the challenge title !
How did you split the work?
When we discovered the data and the challenge, Remi naturally took the lead of the machine learning part : indeed he already worked on similar problems of image classification, had the tools and proper code framework to start from. The rest of the team worked on data analysis and visualization, features extraction, data augmentation and stacking.
The work breakdown was quite natural : Guillaume was the most experienced, and had the suitable hardware — so he was responsible for running the neural networks itself. The rest of the team was mainly in charge of data preparation, data augmentation, and ensembling (see below).
What technology did you use?
We spent the first three weeks brainstorming and reading papers. Our conclusion was that Inception V3 (GoogleNet) seemed to present better transfer learning properties than VGG — that’s why we decided to focus on it for the rest of the competition.
We used a Torch implementation optimized by Remi, and Nvidia cuDNN over a cluster of 4 TitanX GPUs. We started from a model pre-trained on ImageNet, and fine tuned it on the images set of the challenge.
We augmented the training dataset with images transformations (flips, rotations, …), but didn’t use the unlabeled images. Indeed our thought was that labelling these images would not help the models : it would provide no extra information (for the images trivially labeled by the models), or noise (for the images incorrectly labels by the models).
We had a lot of variance in our cross-validation score. To reduce it, we used bagging of bootstrapped models. We also generated models with different hyperparameters, to create diversity. Our final submission was a blend of almost 100 models.
Our cornerstone was the VGG very deep ConvNet, from the University of Oxford (the 19 layers flavour). We started with a network pre-trained on ImageNet, and then fine-tuned the weights by continuing the backpropagation on data provided for the challenge (transfer learning).
To have more training data, we used data augmentation technics to extend the available images, like flips or 90° / 180° / 270° rotations for example.
What’s more, as there were unlabeled images, we used a semi-supervised technic : we labelled the unlabeled images with our classifiers, and used this newly labeled data to augment our training test. Surprisingly, this didn’t improved results a much as we would have expected.
What made the difference?
Using GPUs helped us to iterate quicker. For a given set of hyperparameters it took around 30 min to train a model.
To blend our models, we didn’t use an average of the soft decisions. Instead, we used a majority vote of the hard decisions — which worked better for us.
We also tweaked the predictions of our final submission to counteract the fact that the train set had unbalanced classes, while the test set was balanced. This improved a bit our score, even though without this trick we still were ranked first.
Tuning Deep Neural Networks is a lot about experience and know-how. Even if you know the theory, you need to have tuned dozens of networks before, to get the best of it. Our previous experiments with this technology helped us a lot. But more than that, without GPUs we won’t have been able to reach this performance.
Stacking different models also helped us to improve accuracy (at the end we had 84 models : 12 root models, trained on 7 different versions of the images). Our best submission was a classifier trained on the predictions of these 84 models.
Thank you to all of you for these insights. Here are our main takeaways. And good luck for the final stage !