Utilizing Deep Learning for Detecting Hotel Scenes

Dhata Karuni Mutia Masyita
tiket.com
Published in
9 min readApr 25, 2022

Don’t judge a book by its cover

You must have heard this line often, but it’s not 100% correct. We often judge something by the first thing we see, including choosing a place to stay. Tiket.com is willing to provide the best experience for users to easily spot what the hotel looks like. That means we should put a building or bedroom image as the main image instead of a bathroom image.

Images displayed on the hotel detail page. The hotel building is highlighted as the main image.

Given the examples above, we can see that the hotel building is being set as the main image for a particular hotel’s detail page. Typically, these images will be placed on the top part of the hotel detail page on tiket.com. When we click on one of the images, the gallery view will be opened containing the caption that is relevant to the focused image.

Therefore, we need to classify our images into several relevant categories: exterior, bedroom, bathroom, etc. This type of task was previously managed by the content team manually. As the number of hotels (including the images to be curated per hotel) grows exponentially, we need this laborious task to be performed quickly and automatically. In total, there are 23 categories that we can classify, as follows:

In this case, an image is only tagged with one label. Inherently, we can formulate this data science problem as a multiclass classification problem.

Data Source

The images are sorted and placed in the respective label folders. The folder could contain images with different file extensions, e.g. ​​.jpg, .jpeg, .png. In addition, the folder includes a CSV file of additional image URLs with the labels. We need to provide a preprocessing script to ease the training import pipeline. For this exploratory experiment, the dataset we simulated is very imbalanced. In total, there are 86K images available with the following distribution. In the next section, we will cover how these 86K images will be split into shards for further data processing.

Example of label distribution

TFRecord

TFRecord is a custom data format for storing a sequence of binary records that was designed by Tensorflow. Converting your data into a binary format may significantly improve the import pipeline’s performance and help reduce the model training time. More importantly, it requires less space on disk and takes less time to read from disk. TFRecord stores the data in a sequential manner. Consequently, it enables a fast streaming process due to low access times. It is also beneficial for us that TFRecord is natively integrated into TensorFlow’s tf.data API so that it is easier to perform batching, shuffling, and caching.

For image processing context, not only the raw image’s byte-string alone but also its metadata can be stored as TFRecord. We need to specify the structure of the data before you write it to the file. In this case, we store the features, i.e. raw image bytes, image label, and image size (length, width, and depth) into tf.train. Example structure, then serialize it and use a tf.python_io.TFRecordWriter to write it to disk.

While training an image classification model, we repeatedly run our training samples through a training process, and the order should be completely random for each epoch. If we store all images into one large TFRecord file, the shuffling process won’t fit into memory. Hence, we need to split the dataset into multiple TFRecord files (called shards). During each epoch, we shuffle the shard filenames to obtain global shuffling and use a shuffle buffer to obtain local shuffling.

Ideally, one TFRecord shard should be about 100MB in size. In our case, that comprises around 250 images per shard. For training purposes, the TFRecord files will be split into training and validation datasets. Therefore, it is extremely important for us to perform stratified sampling over the image classes while storing the image into the shard to ensure that each shard will have a similar distribution.

From the 86K records that we have in total, there are 344 shards generated. To ensure each shard has a similar distribution, we divide the records from each class equally among the shards. Then, the remaining images are distributed into some of the shards. In this case, the shards will contain exactly one more record for the particular class.

Converting raw images data into TFRecord files could reduce the training time by 59% per epoch.

EfficientNet

Deep learning has emerged for tackling image problems because deep learning has proven to reduce feature engineering effort on image data. In the world of computer vision, Convolutional Neural Network (ConvNets) is commonly used. ConvNets is often developed at a fixed resource budget and then scaled up for better accuracy.

Previously, it could be done by scaling up one of the network dimensions (i.e. depth, width, and resolution). These dimensions are described as follows.

  • Depth: the number of layers in a network
  • Width: the number of filters in a convolutional layer
  • Resolution: the height and width of the input image.
Model Scaling. (a) is a baseline network example; (b)-(d) are conventional scaling that only increases one dimension of network width, depth, or resolution. (e) is our proposed compound scaling method that uniformly scales all three dimensions with a fixed ratio. [1]

Google introduced EfficientNets in 2019, a family of models that consists of several architectures which are scaled up from the baseline architecture, EfficientNet-B0. It uses the compound scaling method for increasing the model size to achieve better accuracy. It uses a compound coefficient to uniformly scale network width, depth, and resolution.

The intuition behind the compound scaling method is that if the size of the input image is bigger, the network needs more layers to increase the receptive field and more channels to capture more fine-grained patterns.

The effectiveness of model scaling heavily depends on the baseline network. Instead of scaling up the model from the existing network (e.g. MobileNets and ResNet), a new architecture named EfficientNet-B0 was developed using Neural Architecture Search.

EfficientNet models produce better accuracy with less number of parameters [1]

There are 8 model architectures in EfficientNet (i.e. B0 to B7). The higher the degree used, the better accuracy we can achieve while requiring more resources. For this experiment, we could start with the baseline model and increase the model degree as long as it still meets our resource capacity.

We should note that each EfficientNet architecture requires a different input image size as depicted in the following figure.

EfficientNet Input Image Size [3]

Transfer Learning

Transfer learning for EfficientNet can significantly save both time and computational power required for model training. It helps to achieve higher accuracy as well. We use the pre-trained weights to initiate the weights instead of random weights. In this experiment, we use the pre-trained weights from ImageNet and Noisy Student.

Noisy Student pre-trained weights are produced by self-training a model. Noisy Student was trained over EfficientNet architecture. The model was firstly trained on ImageNet dataset, which is labeled, and then used as a teacher model to generate pseudo labels for unlabeled images. Subsequently, a larger model was trained with the combination of both labeled and pseudo labeled images. This process was trained iteratively by involving the student model to be treated as the teacher model. During the learning of the student model, noises were added (e.g. dropout, stochastic depth, and data augmentation) so that the student model could generalize better than the teacher model.

Fine-Tuning

The fine-tuning process is conducted in 2 steps. In the first step, only the first top layer is fine-tuned, and in the second step, some of the top layers are unfrozen.

K-Fold Cross-Validation

We use the k-fold cross-validation technique to find the best hyperparameters. First, the data are split into train-validation (90%) and test (10%) sets. During this fine-tuning process, the train validation set is randomly partitioned into k equal folds. During each iteration, one fold is left out from the training process and becomes the validation set to evaluate the model performance. The iterations are repeated for k times until each fold is used once as a validation set.

Hyperparameter

  • Base model: EfficientNet-B0, .., EfficientNet-B7
  • Augmentation layers (randomly crop, rotate, translate the image at the beginning)
  • Batch size
  • Learning rate
  • Number of unfreeze layers
  • Epochs
  • Pre-trained weights: ImageNet, Noisy Student

Experimentation

Our best model achieved 80.2% for top-1 accuracy. Due to the resource limitation, we experimented only up to EfficientNet-B2.

Best model

Here are the key takeaways that we could gain during the experiment:

  • Model with a higher degree produces better accuracy.
  • Turning off the augmentation layers improves the accuracy.
  • Changing the batch size does not give significant improvement.
  • The model might not converge if the learning rate is too small, even after increasing the epochs. It might cause the model to be stuck at a suboptimal solution.
  • Noisy student pre-trained weights perform better than ImageNet pre-trained weights.
  • Too few unfreeze layers could lead to worse accuracy while having too many layers also does not significantly improve the accuracy.

Score Threshold

EfficientNet output is the probability of an image belonging to each class. For example, an image might strongly belong to one class by having a probability score that reaches over 0.9 for one class. Occasionally, the probability scores might be equally distributed to several classes and tend to produce low scores. A low probability score indicates that the image may not be recognized before in the training set, or it contains more than one label.

Usually, we predict the class with the highest probability as the label. To avoid mislabeling, we set a certain threshold. If the probability score of the top predicted class passes the threshold, then the label will be set as that class. Otherwise, it will be labeled as other. We find the threshold that optimizes the F1 score of each class. Consequently, each class might have a different threshold. Using this approach, we predict about 5% of the data as other. When these images are discarded, the accuracy of the rest of the data achieves 83.2%.

Challenges

Noisy dataset

The continuous challenge that we faced to solve this problem is having a very noisy dataset. In other words, our dataset has a lot of mislabels! This problem is quite prevalent in the field of data science. We need a data labeler and clear instructions for labeling so that a clean dataset is obtained. Nevertheless, the EfficientNet model we built is still able to generalize well.

Multiple labels in an image

Some images might contain two or more scenes. Since we only accept one label, then the predicted label could be different from the actual label. However, the predicted label is also acceptable from the quality checking (i.e. manual assessment by human eyes).

Multiple interpretations in an image

Some images are also confusing due to their similar properties. For example, both the living room and common area have sofas. For humans, we could easily differentiate if we know the location, whether it’s inside or outside of a bedroom. In addition, several hotels do not have separate rooms for the restaurant and common area, so the scenes sometimes are easily misinterpreted.

Potential Improvements

  • Apply augmentation only to the minority classes to increase the samples.
  • Revisit the training data, building a teacher model from a small golden data to provide pseudo labels to the rest of the data.

References

[1] Tan, Mingxing, and Quoc Le. “Efficientnet: Rethinking model scaling for convolutional neural networks.” International conference on machine learning. PMLR, 2019.

[2] Xie, Qizhe, et al. “Self-training with noisy student improves imagenet classification.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

[3] https://keras.io/examples/vision/image_classification_efficientnet_fine_tuning/

--

--