Identifying Berlin birds. Part II.

8 min readMay 13, 2019

Data cleaning

In the first part, I’ve collected lots of different species images and sorted them into “class folders”. Even though the process included defining stopwords and deleting bad files, it couldn’t distinguish images based on their content.

Reviewing downloaded images showed that along with good photos, I’ve got lots of images which definitely couldn’t be used for network training. Every folder out of 84 had to be reviewed and cleaned from non-bird content.

Magazine cover, feathers, and eye structure. Such images have to be removed.

Quotes and statistics like following didn’t improve my optimism.

Data scientists spend 60% of their time on cleaning and organizing data. Collecting data sets comes second at 19% of their time, meaning data scientists spend around 80% of their time on preparing and managing data for analysis.

Quote and diagram from Forbes article: “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says”

Manual attempt

Using Mac OS could make your life much easier if you know some tricks. As all folder with birds contained photos, it is quite easy to review them using “Gallery View” mode introduced in Mojave version.

My steps were the following:

Open Finder window in image folders.
Enable “Gallery View”
Enter the folder with hotkey “Cmd+⬇”
Navigate through images with ⬅ and ➡ keys
Delete wrong images using “Cmd+⌫” (Backspace)
Exit folder using “Cmd+⬆”

That worked well but was veeeryyy time-consuming and not much exciting.

Me, cleaning the dataset

Alternative approach

I’ve stopped after 4th folder and asked myself “Why can’t a machine do that? Let it do all boring stuff”. Ok, so I decided to train a neural network to help me with training a neural network.

Building an “artificial assistant”

Prerequisites

I’ve decided to go easy and define two main goals for myself

Validate an approach to be used for “berlin-birds” network
Validate that downloaded images can be used for building the network
Don’t try to be perfect and simplify the dataset cleaning process, but not automate it

Preparing the data. Again.

To clean my dataset I needed a simple 2 class network with “bird” and “non-bird”. I’ve used the same approach for collecting images. Downloaded different kinds of “non-bird” and “bird” images via Google Images Download.

My steps were the following:

Download non-bird general examples with kinds of images which I wanted to get rid of like feathers, eyes, toys, corpses, nests and different kinds of drawings. For that, I’ve created a separate list called non-birds-extended.csv
As script downloaded every word/class into a dedicated folder, I’ve merged all non-bird downloaded images into one folder
Then I’ve downloaded images for birds (100 images for each species) and merged them the same way

Of course, both classes needed review and cleanup, but the number of images was much less and it didn’t take much time.

Nice trick discovered during the cleanup process. Going through bird species I’ve deleted a bunch of irrelevant images and they were exact kind of images I wanted to get rid of in further steps. So I’ve moved them from the “Trash bin” to non-bird class folder.

Now having all the data in place I could train the network

Training the network. Almost

I’m not going to cover any kind of general neural network details. That kind of information is available already all over the internet.

My approach was:

MobileNetV2 as architecture — I needed my network to run on low-power devices like Raspberry Pi or Nvidia Jetson Nano.
Keras and Tensorflow as frameworks — I knew them quite well and didn’t want to switch to something new (of course I’d like to know PyTorch better, but not know)
Home PC and Google Colab as training environment — I prefer training networks on the local machine, but as a backup plan, I’ve used free Google Colab service.
Transfer learning as a training approach — Saving time and energy

Splitting the dataset

As usual, data has to be split into “Train” and “Validation” sets. Today it can be done without even direct interaction with files. Some time ago Keras extended their API to support parameter called validation_split.

I’ve used validation_split from Keras and it was one of my mistakes.

In my project, I’m working with images and as usual, data is very limited. To overcome such limitations today there are different approaches to applying “Data augmentation” techniques. Of course, Keras also provides a very convenient API for data augmentation.

The problem is — “When data augmentation is used, it also applies to validation_split”, states StackOverflow

So, safer split files into different folders. I’ve used a small script found based on sample code discovered a long time on GitHub.

Running the code resulted in two folders named valid and train located in dataset folder. That is the actual dataset used for training the network.

Running the network

Now, collected and processed images can be fed to the network.

That is what the network was dealing with.

Few steps covering the steps building and setting up the network training. I’m highlighting some parts of the setup.

Main parameters

As I’ve been using Comet.ML for logging the network progress, it is very convenient to keep all parameters in the dictionary and pass it as a parameter for the experiment.

params = {}


params["batch_size"] = 64
params["num_classes"] = 2
params["epochs"] = 10
params["optimizer"] = "adam"
params["activation"] = "relu"
params["base_lr"] = 0.0001
params["fine_tune_lr"] = params["base_lr"]/10
params["fine_tuning"] = False
params["fine_tune_at"] = 20
params["initial_epochs"] = 15
params["fine_tune_epochs"] = 15

Model setup

Here I’ve defined two functions for model adaptation

get_model — provides the initial model without classical 1000 classes classifier and lets you train your own one. All layers except your own are frozen here don’t learn anything new.

fine_tune_model — is unfreezing some layers to for fine-tuning classifier.

The approach I was taking is very well covered in Google Tensorflow documentation

def get_model(params):
    base_model = MobileNet(
        weights="imagenet", include_top=False
    )  # imports the mobilenet model and discards the last 1000 neuron layer.

    for layer in base_model.layers:
        layer.trainable = params["fine_tuning"]
    
    x = base_model.output
    x = GlobalAveragePooling2D()(x)
    x = Dense(512, activation=params["activation"])(x)  # dense layer 3
    
    preds = Dense(params["num_classes"],activation='softmax')(
        x
    )  # final layer with softmax activation

    model = Model(inputs=base_model.input, outputs=preds)
    return model

def fine_tune_model(model, params):
    
    for layer in model.layers:
        layer.trainable = params["fine_tuning"]
 
    # set the first layers of the network to be non-trainable
    for layer in model.layers[:params["fine_tune_at"]]:
        layer.trainable = False
    for layer in model.layers[params["fine_tune_at"]:]:
        layer.trainable = True    
    return model

Keras image generators setup

Nothing special here, except I’ve tried to apply easy image augmentation, which didn’t impact metrics much.

train_datagen = ImageDataGenerator(
    preprocessing_function=preprocess_input,
    rescale=1.0 / 255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
)  # included in our dependencies

valid_datagen = ImageDataGenerator(
    rescale=1.0 / 255, 
    preprocessing_function=preprocess_input
)

train_generator = train_datagen.flow_from_directory(
    DATA_DIR + "/train",
    target_size=(224, 224),
    color_mode="rgb",
    batch_size=params["batch_size"],
    class_mode="categorical",
    shuffle=True,
)

validation_generator = valid_datagen.flow_from_directory(
    DATA_DIR + "/valid",
    target_size=(224, 224),
    color_mode="rgb",
    batch_size=params["batch_size"],
    class_mode="categorical",
    shuffle=True,
)

Callbacks setup

Callbacks setup for me is an essential part of network training. I was using the classical approach with logging, checkpoints and early_stopping.

log = callbacks.CSVLogger(os.path.join(MODELS_DIR, "bird-vs-not-bird-log.csv"))

checkpoint = callbacks.ModelCheckpoint(
    os.path.join(MODELS_DIR, "bird-vs-not-bird-weights-{epoch:02d}.h5"),
    save_best_only=True,
    save_weights_only=True,
    verbose=1,
)

lr_decay = callbacks.LearningRateScheduler(
    schedule=lambda epoch: params["base_lr"] * (0.9 ** epoch)
)

early_stopping = callbacks.EarlyStopping(
    monitor="val_loss", min_delta=1e-4, patience=5, verbose=1, mode="auto"
)

Training part

Two things worth mentioning:

warnings.filterwarnings — for avoiding garbage messages about missing images metadata

use_multiprocessing=True and workers=7 — gives a significant boost on multi-core machines

warnings.filterwarnings("ignore", "(Possibly )?corrupt EXIF data", UserWarning)with experiment.train():
    step_size_train = train_generator.samples // params["batch_size"]
    history = model.fit_generator(
        generator=train_generator,
        validation_data=validation_generator,
        validation_steps=validation_generator.samples // params["batch_size"],
        steps_per_epoch=step_size_train,
        epochs=params["epochs"],
        verbose=1,
        callbacks=[log, checkpoint, lr_decay],
        use_multiprocessing=True,
        workers=7,
        max_queue_size = 2000
    )

model.save_weights(os.path.join(MODELS_DIR, "mobilenet.bird-vs-not-bird.generic.h5"))

Whole training process took around 270 seconds per epoch, which is quite acceptable.

Pitfalls part

Here is the progress of network training with accuracy and loss metrics. It looks really nice unless I’ve added validation_accuracy and loss metrics.

“Validation Accuracy” and “Validation Loss” metrics from the latest experiment

Reviewing them, a knowledgeable person could easily say that a network is overfitting. That is the kind of problem with the dataset I was dealing with and it seemed that network wasn’t able to generalize enough to distinguish drawing from the photos. And is where early_stopping became very useful.

And again, I didn’t expect the result to be perfect.

Evaluating the results

After the network was trained, I’ve fed all initially downloaded images and moved all non-birdclassified images to a separate folder for manual review.

Of course, with such over-fitted network there were some false-positives images, but even having them, the initial goal was achieved

It became much easier to remove birds from the non-bird results than to look for some non-bird images in the list with hundreds of birds

Summary

Notebook [WIP]

https://nbviewer.jupyter.org/github/gaiar/birds-of-berlin/blob/master/nn/bird-vs-not-bird/bird-vs-not-bird-transfer-learning.ipynb

Tracking the progress

For tracking my network experiments I’ve been using service called Comet.ML. It allows keeping the track of all epochs and trials when training the neural networks and to share the results. My experiments available here: https://www.comet.ml/gaiar

What didn’t work

Trying to put more data for non-bird class —Network wasn’t able to generalize and was over-fitting badly. Initially, I’ve used CIFAR-100 classes images as download topics but didn’t bring any value.

Adding additional layers for classifier — The same problem again, a more complex network led to over-fitting

What did work

Dropout to fight over-fitting — I’ve added one dropout layer before final classifier layer. It helped the network to generalize better

Cleaning the dataset — I’ve removed all photo-like images from non-birds class. That improved the validation_loss metric.