PersonAttributes Classifier — training a multi-task neural network to detect attributes like age, gender, …, emotion

Published in

Analytics Vidhya

7 min readJan 21, 2020

The goal of this challenge is to detect a person’s gender, age, weight, carryingbag, footwear, emotion, bodypose as well as the imagequality all from a single image. Any neural network architecture is allowed, but the training has to be done from scratch, that means no pre-trained weights, and no transfer learning.
To complete this successfully we will require a convolutional neural network with multiple outputs, each output catering to a prediction.
For this challenge I ended up using a Densenet121 architecture, but any modern parallel network with skip connections should provide good results.

The training is done using Keras. I also tried one-cycle policy and cyclic LR in tensorflow, but found the keras solution simple and effective.
Before we take a look at the proposed network architecture let's take a look at the dataset we are dealing with.

You can download the dataset from person-data-gdrive for experimentation.
Also find the person-classifier-github-link which contains the notebook that can directly be run on google-colab.
If you are running the notebook on colab, just make sure to download the data and put it in your google drive’s My Drive as hvc_data.zip.
All the required files visualized below are present in the hvc_data.zip file.

Visualizing the data

Each image is labelled according to the dataframe given below:

Some examples of the images we have.

Let’s see all possible predictions for each category

Viewing the network’s head

As promised let’s have a look at the head (where the output’s arrive) of the network, so that we can understand the code better.
Some connections are not-visible to post a clearer image. Find the full network image here.

Now that we have an idea about the problem at hand, let’s start with the actual steps required to train an awesome classifier from scratch.

Data Preprocessing -
Converting to one-hot encoded.
Normalizing using the mean and std-dev.
Data Augmentation(cutout)
Building a keras Sequence to feed to the training loop.
Designing the architecture -
Choosing a backbone
Constructing the tower.
Constructing heads [final predictions] for each class.
Defining the training callbacks
Actual training.

Data PreProcessing

Neural networks expect the prediction labels to be one-hot encoded. So that the predicted probabilities [equal to the number of classes] can be directly matched against the correct labels.

Hence, we need to convert all our prediction labels to one-hot encoded with a prefix, so that it becomes easier to identify them.

(13573, 9)

The one-hot encoded data

Normalizing using the mean and std-dev

The original size of the images were (200x200), but our dataset contains resized versions of them at (224x224).

Since, we will be using a densenet as our architecture without the head, we don’t need the extra computation and hence we resize the images to (200x200).

This resizing is done in the PersonDataGenerator class.

Let’s create the keras Sequence which can be called to give a sequence of batches when called by fit_generator

Store attribute column names in variables

We prefixed columns with attribute names while creating one-hot encoded versions, so that we could collate columns with specific attribute names together, this will help identify which columns does a specific target belong to.

Let’s build the sequence class which will give batches of data as a generator, shuffle it and apply data augmentations if specified.

Let’s quickly call the PersonDataGenerator class to create the train and test generators. To implement cutout we use get_random_generator function.

Designing the architecture

Choosing a backbone

We are not using pre-trained weights and starting from scratch as that was one of the requirements of this exercise. We take a DenseNet121 backbone, without the head. This was chosen over Resnet50, but that should also provide comparable results.

Since this is a multi-task classification. The final model we are going for would be something like this:

Choose an architecture as a backbone (here we are choosing densenet121, as it worked better than resnet50 in intial tests), do not include the head, as we would build our own.
Build towers for each class, the architecture of the towers would largely remain the same.
Build the respective heads which are the outputs for each class.
Construct the overall model, specifying the inputs and the outputs.

Code for the backbone.

The final shape after relu is (6, 6, 1024).

This is passed through GlobalAveragePooling (further referenced as GAP) to average each feature map across the channel dimension. This brings the tensor shape to a flat (None, 1024).

I strongly believe this can be improved if instead of only using GAP, we also use GlobalMaxPooling and concatenate the outputs. That results in 2048 rank 1 tensor, which would have advantages of both mean and max results.

Here is a representation of what I mean:

Constructing the tower

The tower adds a batchnorm after GAP, to normalize the results after GAP and also a small amount of dropout, to improve resilience.
It is followed by a densely connected layer that reduces the nodes to 128, on which each head builds.

Constructing the heads

The head is the final layer that produces outputs.
Let’s take pose(one of the attributes we need to predict), it has 3 possible outputs, namely (front-frontish, back and side), so each possible output needs a final node which represents the probability of it’s occurrence.
num_units contains the mapping of each name and the number of categories present. build_head builds the heads of each category assigning appropriate final nodes as per num_units.

Code to build the complete Model

You can check out how the complete network looks here.

Defining the training callbacks

ReduceLROnPlateau -> To reduce learning_rate when the model val_loss does not improve by min_delta for some patience epochs
ModelCheckpoint -> To save the model weights, in specified directory.
EarlyStopping -> To stop training if val_loss does not improve by min_delta for patience epochs, restore the best_weights during training after stopping.

Let’s compile the model before we start the actual training.

The model compilation step has been implemented as a function, so that it can be called from any cell, and on any model (backbone or partial models).

Let the training begin

The training was carried on in the following sequence:

(100x100) images with all layers trainable.
(100x100) images with backbone frozen, so that the final layers could be fine-tuned.
(100x100) images with all trainable layers but aggressive data augmentation with a lr = 1e-4.
(200x200) images with all trainable layers.
(200x200) images with backbone frozen.

All the training loops had EarlyStopping enabled, so the number of epochs ranged from 8-15 in each case, after which the training stopped and the next step was run.

For the purpose of brevity, I will not include all training steps here, but you can always check the actual colab file , which has all the steps with logs.

If you wish to continue training, and wish to use my pre-trained weights, you can find them in the github repo models folder.

Results

Further experiments / Ways to improve the model

There can be a lot of ways to improve the results that we achieved here. I will keep updating the actual github repo with some of these recommendations down the line. If you liked the article or would like further updates, please `watch` the github repo or `star` it. It would motivate me to work on other similar articles.

So without further ado, I’m listing some recommendations in no particular order.

1. Better normalization and weight initialization.

2. Loss weights for each individual class.

3. Better image augmentation especially imgaug.

4. Try different architectures example Inceptionv4, EfficientNet, ResNext.

5. Loose the regularizations such as dropout.

6. Try One Cycle Policy and Cyclic Learning Rate, especially the implemenation by fast.ai.

7. Instead of only GAP use `concatenate(GAP, GMP)`, and build tower after that.

I hope to implement some of these and see the loss decrease and get better results, also if any of you implement and see some interesting results, would love to hear more about it.