Hotel Image Categorization with Deep Learning

Trideep Rath
Life at KAYAK
Published in
8 min readJul 19, 2018

Our mission at KAYAK is to help people experience the world by giving them the information they need to make the right decision on hotels, flights, and rental cars. Images are a key element. They act as storytellers. They are one of the first things that capture our attention — well maybe after the price!

In this post, I will discuss how we use deep learning to categorize our hotel images.

What’s The Issue?

Users seek out hotel photos before making a decision to book. We integrate these images in our web pages differently to ease the decision-making process.

Fig 1. Hotel Search Page

On our hotel search result page (Fig 1), we display an optimal number of images per hotel to avoid cluttering and also to provide a comprehensive idea through varied imagery. It would not be a great user experience to show five bathroom images and miss out the bedroom, pool or the cool lobby.

Fig 2. Hotel Details Page

For our hotel detail page (Fig 2), we display a complete catalogue of images with various categories allowing users to drill down into the categories that interest them the most.

In order to provide the above functionalities to our web pages, image categorization makes sure images are put into one of several predefined classes.

What We Did in The Past
Our visual analytics and content team spends a lot of time manually curating and classifying hotel images. Currently, we receive thousands of images every day from our various providers. At this scale, manual classification is just too slow, expensive and near impossible.

In the past, we have tried conventional machine learning techniques to train multiple models; one to detect categories like bedrooms, bathrooms and food, and another for detecting pools and beaches. This approach involves a lot of feature engineering and carefully creating a balance of confidence thresholds for every category. This approach works for some categories but overall is quite inaccurate and tricky when extending its capabilities for new categories.

The Deep Learning Approach
The past decade has seen a resurgence in deep learning, from a niche field of research to a major part of many industrial applications. One of the significant impacts has been in the field of Computer Vision using Convolutional Neural Networks (CNNs). CNNs have become defacto for most image recognition/classification/detection tasks and have not only outperformed conventional machine learning techniques but also humans for a lot of these tasks. In this section, I will discuss the deep learning model currently in use to classify images into the 30 most important categories for our users.

The Dataset

One of the essential ingredients to train any deep learning model is to have a good amount of labeled training data. For years, we have been manually labeling our images with the help of an external agency. For the purpose of training, we had a labeled dataset of 800,000 images which is a good amount of data to have at our disposal.

There were two challenges with this dataset. Firstly, the dataset was highly imbalanced, with “bedroom” making up 25% of the total images, compared to “floorplan” with only 0.2%. We experimented with two data sampling techniques, one with only under-sampling the overrepresented categories and another with both under-sampling the overrepresented categories and oversampling underrepresented categories. We based this approach on the ideas from this paper[1].

Secondly, the manual labeling was not accurate, with up to 20% inaccuracy for certain categories. Some categories are easily confused like “lobby” and “living room”, “patio” and “balcony”, and some images contained multiple categories like a dining room plus kitchen.

We randomly selected 70% images for training, 10% as validation and 20% for testing. Sampling was performed only on the training set. The validation set was used to perform early stopping and hyperparameter optimization while the testing set was used for model evaluation.

Model Architecture and Training Details

There have been many successful neural network architectures in recent years for image classification. The various factors we considered for choosing the model were the number of parameters, it’s performance across different datasets, its size, and the availability of pre-trained weights. We used the InceptionV3 [2], which stood 2nd in the 2015 ImageNet Challenge and was a significant effort into reducing the number of parameters. This network uses a series of inception modules, which are basically mini models comprising of different-sized convolution filters operating on the same input feature map. The last layer of this network is a softmax layer giving probabilities for each category which adds up to 1.

Fig 3. InceptionV3 neural net architecture (source)

We initialized our model with pre-trained weights trained on Imagenet dataset. This technique is called transfer learning, where a model developed for a task is reused as the starting point for the model on the second task. We use batch normalization [4] during training, wherein the activations are scaled at every layer of the network for each batch. These techniques help with speeding up the training process [3] significantly.

Fig 4. Data Augmentation

During training we perform data augmentation of our images, meaning we slightly distort our images with random cropping, changing contrast, and color manipulation. It helps the model to generalize better [5], especially in our case where we have performed over-sampling.

We perform optimization using Adam [6], a variant of Stochastic gradient descent method and use a step decay on the initial learning rate.

Finally, we trained 8 models with different hyperparameter settings (learning rate, decay rate, batch size) and with two different data sampling techniques.

Implementation and Infrastructure

The InceptionV3 model implementation is based on TensorFlow’s Slim framework. We developed our own deep learning tool that we call “Deep KAYAK”, a high level API for Vision and NLP tasks which performs hyperparameter optimization, cross validation, ensemble learning[7] with models running in parallel.

Training was done using the p2.8xlarge AWS instance containing 8 NVIDIA Tesla K80 GPUs with 12 gb GPU memory each. The image batch fetching and preprocessing is performed with multi-threaded queues running on CPUs so that the GPUs are always busy with image batch fed to them. The training took about 3 days.

Fig 5. Multiple GPUs (left) and CPUs (right) running in parallel
Fig 6. Tensorboard training and validation loss

Tensorboard (Fig 6) was used for visualizing the loss, accuracy, mean accuracy per class and learning rate while the model was training.

Results

Some categories perform really well such as “bathroom”, “bedroom” and “floorplan” with above 90% accuracy. Its interesting to see that, the model performs so well for “floorplan” which contained the least number of images. Categories like “gym”, “laundry”, “food”, “restaurant” and “pool” have a decent performance with above 80% accuracy. Categories like “banquet hall”, “beach”, “conference”, “lobby” and “spa” do not perform well due to the noise in their labels, a lack of images for training and the multi-label nature of the images.

Fig 7. Images classified by the model

The best results were achieved with data sampling technique involving both under-sampling and over-sampling, with a batch size of 60, a learning rate of 0.0007 and a 0.5 decay rate. The sampling technique with both over-sampling and under-sampling consistently outperformed the other with ~1.5 % accuracy. All the results mentioned here are with the above trained model setting.

Confusion matrix in Fig 8, shows the performance of the model for all categories.

Fig 8. Confusion matrix with the testing set

The quantitative result shown is not a true representation of error rates as our test set has a lot of noise. We performed some manual inspection into the incorrectly classified images and found the model to be performing much better than what the numbers suggest. Fig 9, shows the images correctly predicted by model which were incorrectly labeled by humans.

Fig 9. Examples of Model performing better than humans

A good model would have a high precision ( i.e. how often the model predicts “bedroom” correctly) and also have a high recall (i.e. how many “bedroom” images were accurately predicted). We can increase the precision of the system by increasing the threshold (probability of the prediction), but this would reduce the recall. This is called a precision-recall tradeoff.

In our case, we would like it to have high precision of about 95% for the important categories like “bedroom” or “bathroom” but we can comprise with categories like “front desk”, “hall”, “stairs”, etc. This is why we set the threshold accordingly per category. All the predictions below the threshold are tagged unknown and sent for manual review. This would help us improve the model iteratively. Fig 10, shows the average precision-recall curve (left) as well as the precision-recall curve of a few individual categories (right), every point on the line corresponds to a certain threshold value. Area under Curve (AUC), is one metric to evaluate the performance of the model.

Fig 10. Average Precision-Recall curve (left), Classwise Precision-Recall curve (right)

Conclusion

This project aimed at exploring deep learning for automatically categorizing our hotel images into broader categories. This was a successful attempt to develop a system with higher accuracy and speed and has replaced our previous generation machine learning model.

In the next article I would discuss our next generation model which performs 140 multilabel classification using image caption dataset.

Acknowledgments

This project was possible due to a collective effort by various teams. John Graham and Mark Beznos from our Hotel Data Team worked towards various data engineering efforts to push the model to production. Sidra Michon from Visual Content analysis team for continuously providing inputs about the dataset. Denis Brunelle for his inputs about the quality of the classification. Rafael Baptista from the SEO team for his critical advice on infrastructure, model algorithm validation to further improvements. Thanks Andrew Huang, Atul Joshi and Jaime Walke for the kind help in proofreading this blog.

References

[1] Buda, Mateusz, Atsuto Maki, and Maciej A. Mazurowski. “A systematic study of the class imbalance problem in convolutional neural networks.” arXiv preprint arXiv:1710.05381 (2017).

[2] Szegedy, Christian, et al. “Rethinking the inception architecture for computer vision.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[3] Yosinski, Jason, et al. “How transferable are features in deep neural networks?.” Advances in neural information processing systems. 2014.

[4] Ioffe, Sergey, and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” arXiv preprint arXiv:1502.03167 (2015).

[5] Perez, Luis, and Jason Wang. “The effectiveness of data augmentation in image classification using deep learning.” arXiv preprint arXiv:1712.04621 (2017).

[6] Kingma, Diederik P., and Jimmy Ba. “Adam: A method for stochastic optimization.” arXiv preprint arXiv:1412.6980(2014).

[7] Ju, Cheng, Aurélien Bibaut, and Mark van der Laan. “The relative performance of ensemble methods with deep convolutional neural networks for image classification.” Journal of Applied Statistics (2018): 1–19.

--

--