Amazon Basin from Space: Competitive Approach to Train Neural Network

7 min readJul 11, 2018

Hello! My name is Artur Kuzin, and I’m a Lead Data Scientist at Dbrain.
In the past, I was a researcher in the field of Micro and Nanotechnology. I have been engaged in Data Science for the last 3+ years. I’m a Kaggle Grandmaster (Top 100 rank) and a prize-winner on different competition platforms (DrivenData, TopCoder, Kaggle). I have a strong belief that solutions and hacks from competitions are applicable for a large number of real-world cases. Today, I will illustrate the list of useful approaches, thereby starting a new series of machine learning articles on Dbrain Medium.

A year ago, our team came up with an approach that scored #7 out of 900+ in the ranking at the Kaggle competition. It would be fair to say that every technique that we used is still applicable — not only for competitions but also for a training of production-oriented neural network solutions. Our task was to classify satellite imagery of forests in the area of Amazon Basin. Here is how it went.

Setup

Planet has produced a dataset presented in two formats: TIFF 16 bit RGB + Near Infra Red and JPG 8bit RGB. As was mentioned above, the dataset was compiled from satellite imagery of the Amazon Basin.

The task was to predict a class for every 256х256 tile, choosing a label from four mutually exclusive weather alternatives (Cloudy, Partly Cloudy, Haze, Clear) and one or more non-weather related (Agriculture, Primary, Selective Logging, Habitation, Water, Roads, Shifting Cultivation, Blooming, Conventional Mining). The “cloudy” label excluded any other.

The model precision was measured by the F2 score. Since the dataset is relatively small (~40k images) but consists of large images, it can be considered as “MNIST on steroids”.

Context

From the task overview, it was clear that one doesn’t really need a rocket science to solve this problem. All we had to do was to fine-tune a pre-trained neural network, and due to Kaggle specifics, to stack a bunch of other models on top of it. But of course, it was not enough to score #1: model ensemble had to be diverse and every element had to have an outstanding performance.

Teamwork

In order to win the gold medal we gathered a team of seven competitors with similar results on the leaderboard:

Aleksander Buslaev
Aleksey Noskov
Konstantin Lopuhin
Artur Kuzin
Evgeny Nizhibitsky
Ruslan Baikulov
Vladimir Iglovikov

Training Process

Each one of us was developing a completely independent pipeline, the only things being shared were a common repository, prediction formatting, and parameters of cross-validation.

A Common Approach

Image from https://github.com/tornadomeet/ResNet

This graph represents a usual learning process: stochastic gradient descent with a randomly initialized weights (LR 0.1 Nesterov Momentum 0.0001 WD 0.9), after 30 epochs the learning rate is decreased tenfold. Basically, we used the same approach, however, in order to accelerate the training phase, we lowered the learning rate if loss function did not show any decrease on validation for 3–5 epochs. As an alternative to that, some members of our team reduced the number of epochs on every LR step and lowered and performed a scheduled learning rate decrease.

Data Augmentation

The augmentation strategies have to reflect the diversity of the data. Conventionally they are divided into two groups: those that are adding bias to the data and those that are not. The bias is considered as a change in various low-level statistics, i.e. color histogram and a histogram of specific object size. For example, the HSV augmentation and scaling strategies tend to add bias while the random cropping tactic does not.

At the early stages of the training procedure many different augmentation techniques can be used, sometimes resulting in a quite massive dataset. But as you approach the final stages, you should avoid using augmentations, especially those that add a bias to the data. This approach gives a slight overfitting effect on the network, allowing it to show a better result on the validation. (To learn more about data augmentation, click here.)

Freezing Layers

Usually, there is no point in training a neural network from scratch. The much more effective way is to fine-tune a pre-trained on ImageNet model. However, you can go even further than that, not only replacing the last layer with a fully-connected layer with a proper number of classes but also training this layer while freezing the weights of all the other convolutional layers. If a random initialization for the last layers’ weights is used and a network is trained without freezing first layers, the weights of convolutions get corrupted and the overall performance falls. It is especially prominent in this problem due to a particularly small training set. In other Kaggle competitions with larger datasets (i.e. cdiscount) it was unnecessary to freeze the weights of every layer, freezing groups of layers instead resulted in significant acceleration of training process since frozen layers do not require a gradient calculation.

Cyclic Annealing

The idea of the cyclic annealing process is to retain best sets of parameters and to repeat the training procedure with smaller learning rate and time constraints (3–5 epochs). This allows to find better local minima and, therefore, better model performance. Such an approach consistently improved results for quite a lot of competitions. (See more here or here).

Test Time Augmentations

Since there are basically no limits on inference time at Kaggle competitions we can improve performance a bit by augmenting data at test time too. Practically that means, that images from the testing dataset get distorted the same way as it was happening during the training process: they get flipped, rotated, cropped, scaled, etc. We predict classes on images augmented by various distortions and average the predictions. That gives some profit in practice. (Click-click).

While i was taking part in other contests, I have also experimented with random augmentations. For example, I doubled down the amplitude of augmentations for the whole dataset, fixed the random seed value, and made several randomly distorted images. That also showed some positive effect on the result.

Snapshot Ensembling (Multicheckpoint TTA)

Let’s explore the annealing idea a bit further. On every step of the annealing, the neural network converges to slightly different local minimum points. That means, that we can average these slightly different models to get a slightly better result. This way, when predicting class labels for the test data we can use three best checkpoints and average their predictions. (See more). I have also tried to use not the best three, but the most diverse three out of the top 10 checkpoints, but the overall result declined. In most of the production-oriented cases, this idea is not really applicable. However, the model’s weights can be averaged from several checkpoints. That gave an insignificant but stable gain.

Human Grid Search Results

To some extent, each member of our team used various combinations of the above techniques.

Stacking and Other Hacks

We’ve trained every model with each set of parameters using 10 folds; after that, we have trained second-level models using the out-of-fold predictions (OOF): Extra Trees, Linear Regression, Neural Network and plain model averaging. Blending weights selection for predictive models was based on the OOF predictions. (Read more about stacking here and here).

Surprisingly enough, this approach also takes place in a number of production-oriented cases: for instance, when the data is presented in different domains (images, text, categories, etc.) and you need to combine the predictions of corresponding models. The probabilistic averaging can also be used in such cases, but second-level models give better results.

Conclusion

Could we possibly get the same (or even better) result without training so many models? I think so. Two competitors (stasg7 и ZFTurbo, #3 in the leaderboard) ended up using a fewer number of first-level models, but they trained 250+ second-level models instead. (The report on this solution is available (ru), plus, you may check this overview).

So, who won the competition? It was the mysterious bestfitting, and he is a monster. By the way, right now he is #1 in the overall Kaggle rating. For a long time, he remained incognito until Nvidia shed some light on this subject by interviewing him, where he admitted that he has ~200 subordinates.

Anyway, that was a nice experience and I’m pleased I could attend that contest. Plus, I’m really glad to start a new column on Dbrain Medium! Stay tuned for more news about data science and its tricks.