Recently I competed in a ‘for-fun’ kaggle competition and I figured people might ask ‘what does a decent but non-winning result look like?’

Hopefully I’ll be able to explain what that looks like in this post. I’m targeting this post to those with some knowledge of the keras library. I didn’t want to bog you down with the fullblown notebook code but rather give an idea of the process i used that started to produce results. The goal of the competition was to predict whether or not a plant is an invasive species of hydrangea. I scored in 8th place; you can see my spot here on the competition leaderboard. My solution was an ensemble of models pretrained on imagenet which I retrained on the hydrangea dataset.

the general steps are:

1 — fit a model to the training data and get the loss as low as possible

2 — regularize that model

3 — repeat for all the models you want in your ensemble

4 — use all of the models to make predictions and then average those predictions

Step 1 — fitting the data on a single model:

The logloss is an error metric that tells you how well your model is doing, the lower the better. In keras you compile it into the model by setting loss=binary_crossentropy. The first step is to try to get this loss value as low as possible on the training data for a single model. If you can manage to get your loss below 0.01 (or close to that) this means your model is capable of represtenting the data. If your loss is higher than this your model is not able to learn the training data well enough.

Abridged code:

# define a function for making inceptionv3 models
# with a trainable dense layer at the end and a
# untrainable (frozen) set of pretrained weights
# on imagenet, start with no dropout and a low
# number of dense units
def make_incepv3_conv(input_shape):
base_model = inceptionv3(input_shape=input_shape, weights='imagenet', include_top=false, pooling=none)
base_model = freeze_model(base_model)
m = flatten()(base_model.layers[-1].output)
m = batchnormalization()(m)
m = dense_block(128, 'relu', 0, inputs=m)
m = batchnormalization()(m)
outputs = dense_block(1, 'sigmoid', 0, inputs=m)
model = model(inputs=base_model.input, outputs=outputs)
return model

# create a model and compile it with the right loss and optimizer
# start with a low image resolution as this will make your testing faster
# so x_train[0].shape is like 224x224 and not 500x500
model = make_incepv3_conv(x_train[0].shape)
opt = optimizers.adam(lr=0.00025)
model.compile(loss='binary_crossentropy', optimizer=opt)

# create a data generator that starts empty
train_datagen = imagedatagenerator()
train_datagen.fit(x_train)

# fit the dense layer for a few epochs to get it
# to work nicely with pretrained convolutional layers
_ = model.fit_generator(train_datagen.flow(x_train, y_train, batch_size=batch_size, shuffle=true),
steps_per_epoch=(len(x_train)//batch_size)+1,
epochs=10)

# set the weights in the convolutional layers to trainable
# starting with 8 retrainable conv layers
conv_layers = [l for l in model.layers if type(l) is convolution2d]
for l in conv_layers[-8:]: l.trainable = true

# recompile the model so adam can see those weights imagenet weights
# that we're about to retrain
model.compile(loss='binary_crossentropy', optimizer=grab_optimizer('adam', 0.00025))

# fit the convolutional layers
_ = model.fit_generator(train_datagen.flow(x_train, y_train, batch_size=batch_size, shuffle=true),
steps_per_epoch=(len(x_train)//batch_size)+1,
epochs=30)

my conclusions:

1 - settling on a single dense layer (tried 1 vs 2 vs 3)

2 - 512 units in that single dense layer (tried 128, 256, 512, 1024)

3 - no pooling instead of average or maxpooling for the network (tried none, 'avg', 'max')

4 - settled on learning rate for adam of 0.00025 (tried 0.0025, 0.00025, 0.000025)

5 - bigger image resolutions for more epochs yield a lower loss (tried 224x224, 300x300, 500x500)

step 2 — add regularization

Once you have a model below or close to 0.01 loss on training data, start adding regularization to make your model generalize better and start scoring a higher validation loss. Before this point Ididn’t even print the validation loss you’ll want to update your code to include it:

# make a data generator for validation data
valid_datagen = imagedatagenerator()

_ = model.fit_generator(train_datagen.flow(x_train, y_train, batch_size=batch_size, shuffle=true),
steps_per_epoch=(len(x_train)//batch_size)+1,
validation_data=valid_datagen.flow(x_valid,y_valid, batch_size=batch_size, shuffle=false),
epochs=10)

In the end I tried a bunch of different values and ended up with:

1 - dropout value of 0.25 on my single dense layer
2 - simple data augmentation of on train_datagen: (rotation_range=30, shear_range=0.2, horizontal_flip=True)
3 - retraining the entire network instead of leaving some convolutional layers locked. the more layers i unlocked the lower my training loss went, but it may be that if i had left a few layers locked in the beginning (first 8) my validation loss would have been lower.

You should repeat this process for every model you want to use in your final ensemble but if you are time constrained it is ok to generalize your hyperparameter and parameter settings to other models (it usually works) but your results may be less good than they could have been.

In the end I had:

incevptionv3 224x224, 299x299, 450x450 
xception 224x224, 299x299, 450x450
resnet50 224x224, 299x299

Step 3 — picking a way to combine the predictions:

I decided to simply average the predictions in my ensemble. I experimented with stacking by making predictions on the training set and trying to use those predictions as the input features to a new model (tried xgb and randomforest). I found they tended to overfit.

Using cross validation

I ended up using my gpus to compute 5 models (5 fold cross validation) instead of 1 model for each image resolution for each model. The idea was that getting more training exposure across all the data would bea good idea. I’m convinced now that i could have gotten to 8th place without doing this so this was probably a waste of time.

Pseudo labeling

I also ended up using my own predictions on the test set as labels to expand the amount of data I had available for training. The way this works is that you train on the test set and the labels for this training process are your best predictions on the test set. It’s ok to do this because you havent given your model the actual training labels. This only improved performance a little bit but once you are in the top 15 any improvement helps a lot.

Conclusion

Starting with a single model, getting the loss as low as possible then regularizing (and repeating for more models) is an effective way to climb a kaggle leardeboard. I hope that after reading this you have a better idea for how to approach deep learning problems and how to make a model that can represent that problem effectively in keras.

I want to thank everyone who offered advice and pointers. They were very helpful, even if they re-affirmed something I was already trying.

If you’d like to get the next blog post you can signup here.


Originally published at cueblog.com.