Neural network course project: Adoption Prediction (from Kaggle)

Motivation

Published in

U Tartu Projects

8 min readMay 28, 2019

There are millions of stray animals (mostly dogs and cats) who live in shelters. If they find their new home it will save their lives. The goal of this project is to predict how fast a new animal will be adopted.

Dataset description

There are a database of stray animals from Petfinder.my.
Training dataset consist of 14993 profiles and gives us an important information about a stray pet, like its age, gender, colour, size, whether it is vaccinated and sterilised or not. Besides that some pets have photos and text description (like “Good guard dog, very alert, active..’’).
The task is to predict the adoption speed. There are five possible values of the adoption speed:

Pet was adopted on the same day as it was listed.
Pet was adopted between 1 and 7 days after being listed.
Pet was adopted between 8 and 30 days after being listed.
Pet was adopted between 31 and 90 days after being listed.
No adoption after 100 days of being listed.

Official test dataset consists of 3972 profiles. However, we have decided to split the training set into the training/validation (together 80% of data) and test sets (20% of data) to avoid numerical submission to Kaggle (also, the competition is already closed).

Data visualization

Some of our visualizations are presented on the following figures. In our notation, the plots are normalized in a sense that the values are normalized to the count of each group, meaning that the sum of heights of all bars is equal to one.

On figure 1, one can observe a general trends about adoption speed. In general, cats tend to be adopted faster than dogs, and the proportion of not adopted dogs is higher than the proportion of not adopted cats.

According to some Kaggle kernels that listed feature importance for the used models, age appears to be one of the most important features. On figure 2, we show a decision tree that uses age as the only feature. Obviously, it is impossible to reach good accuracy with such a simple decision tree, but we can try to understand what ‘key values’ of age are there and try to notice any dependencies.

We can use the values of an age that are used for splitting to make a categorical variable from age and then plot it. On figure 3 one can observe that more than 2/3 of young (under 2 months) animals are adopted within a month, while almost half of animals older than 5 months are not adopted at all.

On figure 4 it is observed that all sizes have approximately the same chance of adoption for both dogs and cats. Extra Large animals are a very small group, that is why this group differs a lot from the others.

On figure 5 we can observe that Fur Length feature is also not very meaningful (at least on its own), because its distribution by adoption speed is very similar to the distributions of the adoption speed for the whole animal type.

Preprocessing and Baseline models

Preprocessing

Before running experiments it is a good practice to inspect your data. For example, some data might be numerical, another one is categorical and also some data might give you useless or irrelevant information

To handle categorical variables usually, we used a one-hot encoding.
We applied min-max scaling for numerical data.

Also, we found out that almost every animal has a unique name, so the name does not provide any useful information and we decided to drop it. Besides names, we have dropped ID of rescuers and ID of animals.

By the end we selected three types of features:

Main features: all features given in the main dataset but RescueIDs, PetIDs, Names, Descriptions.
Text features: features obtained from Description field.
Image features: features obtained from Images.

Baseline models

Baseline models are models with main features as an input.
As we have a small dataset we have decided to use one non-neural method (we expect that it shows the best result) and two neural networks (because we make this project for Neural Networks course :)).

For non-neural method, we used Random Forest from sklearn library and for neural methods we took feed-forward and recurrent neural networks (keras library).

Random Forest
Random Forest (RF) is a widespread machine learning algorithm. It is an ensemble learning method for classification and regression tasks. It “combines” outputs of decision trees, thereby preventing overfitting.

We applied Randomized search on hyper-parameters (RandomizedSearchCV from sklearn library) to find better parameters.

Also, we took a look at the importance of features by using Random Forest. The most important feature is age of the animal and the second one is presence of photos.

Feed-forward neural network
We built a feed-forward neural network (FF). We played with different parameters of network as a number of layers, a number of neurons, losses, activation functions and others. As result, the best results are obtained by using two-layers neural network with Adam as optimizer and learning rate 0.0001, both layers contain 25 neurons and use relu as activation function for inner layers and softmax for the last layer, after the first layer goes Dropout layer with 10% dropping neurons, the loss was set to mean squared error (mse).

Recurrent neural network
We also decided to build Recurrent Neural Network (RNN).
We tried the same parameters we used to train FF and different types of RNN: simple RNN and LSTM.

Unfortunately, all our attempts showed very low numbers of accuracy. We assume we have not large enough dataset to use complicate neural networks.

Results for baseline models: Random Forest, feed forward and recurrent neural networks with the main features.

Models with text and image features

As we mentioned above, besides the main features we have also Description and Images of animals. We extracted features from text and images and added them as additional input to our neural networks.

Extract embeddings from text

There is a column “Description’’ that contains the different information about animals. For example,

“This kitten was given birth by a mother cat that we own. I already have many cats and cannot afford to keep anymore. It is cute and loveable. I hope that a cat-lover will give it a home.’’

“I just found it alone yesterday near my apartment. It was shaking so I had to bring it home to provide temporary care.’’

“Please feel free to contact us : Mr Tan’’.

As we can see, descriptions provide a wide variety of information and some of them does not give any relevant information about animals. For this reason we do not expect to get much better result compared to the baseline models.

As we wrote above, we do not have a lot of data, so to train embeddings from the scratch is not a good idea. We took two different pre-trained embeddings: GLOVE and LASER. GLOVE is non-contextual word embeddings whereas LASER provides sentence embeddings.

We also wanted to try BERT and ELMo embeddings because they are contextual word embeddings, but after analysis of the given descriptions we decided not to try them because texts do not provide meaningful information.

Models with the main features and LASER embeddings

We extracted LASER embeddings (dimension is 1024) and added them as input to the baseline model. We used two models:

Model 1: one input feed-forward network:

Concatenate all features into one vector

Model 2: two inputs feed-forward network:

Main features are the first input and embeddings are the second input

Models with the main features and GLOVE embeddings

We took 100-dimensional GLOVE embeddings (corpus glove.6B trained on Wikipedia 2014 and Gigaword corpus) and added them to the neural model.

For all models we also varied parameters: number of layers, number of neurons, batch size, learning rate and so on.

As we expected, adding text features did not improved accuracy significantly. All models showed more or less the same result.

Results for models with main and text features. LASER is laser sentence embeddings. GLOVE is glove word embeddings.

Models with the main features and image features

Input variation

As an input along with the main features we have extracted image feature using DenseNet121 model. More precisely, since DenseNet121 provides 1024 image feature, it gave us additional (14652, 1024) matrix. For the experiment, we have tried to vary the method of input to out baseline feed-forward network. There are three different types of input:

Joint input: image features simply concatenated with the main features.
Double input: image features and main features are separated
Triple input: in addition, main features are divided into numerical and categorical. Together with image features they create a new triple input.

Since each model has different number of input layers, the overall architecture is slightly changed.

Each model was trained using Adam optimizer, the inner activation is ReLU and the output activation is Softmax. Loss is calculated using cross-validation.

The corresponding results are shown in a table below.

Results for models with main and image (DenseNet121) features.

Models with text and image features

To investigate how text and image features affect on our models, we decided to train a model only with text and image features.

We extracted image features from DenseNet121 and text embeddings from LASER. These features are used as two inputs to the network, which are then concatenated and followed by one hidden layer with 16 units.

Results for the model with text (LASER) and image (DenseNet121) features.

Models with all features: main, text and image features

Feed-forward network on main, text and image features

For this method, we decided to use all types of features available.
We took one-hot-encoded main features, LASER embeddings text features and image features from a pre-trained MobileNet2.
We tried two models with different architectures. The main difference was in input shapes. The first one took all features concatenated while the second had two inputs, with image features going to first input and all other features going to second.

Model 1: one input feed-forward network:

Concatenate all features into one vector
3-layer feed-forward network

Model 2: two inputs feed-forward network:

1st: image features.
2nd: text and main features concatenated.

Model 3: three inputs feed-forward network:

1st: image features.
2nd: text features.
3d: main features.

Results for models with main, text (LASER) and image (MobileNet2) features.

All results together

Below we present the results all of models together. As we expected, the best model is Random Forest with the main features. Among neural network models, the best model is feed-forward with the main and text (LASER) features as two different inputs. It is worth noting that the most neural networks models showed more or less similar results.

Results for all trained models.

The project was made by Alina Vorontseva, Artem Domnich, Kateryna Peikova and Lisa Yankovskaya.