CriticNet: Scaling the business with Deep Learning

Manos Loukadakis
Plum Guide
Published in
9 min readJul 12, 2021

--

Introduction

Plum Guide is not just a booking platform. Plum Guide is the benchmark for the best homes in the world. Our Mission is to become the new global quality standard for vacation homes. Just 3% of the homes in each destination win the Plum Award.

In the last 18 months, we have scaled our operations by 10x, from originally having homes in 12 large cities to now having homes in over 100 destinations.

The story I am going to tell you here is about how a Deep Learning model has helped to scale our acquisition efforts in only 1 year, despite having a smaller acquisition team.

Figure 1: Desert Ace in Palm Springs

Curating Homes — The old way

Plum Guide was founded back in 2015 and since then we have been continually adding the best vacation homes to our platform for guests to book. Before adding a home to our platform, there are two types of tests that the home needs to pass.

Firstly, the Acquisition team evaluates each home by reviewing all of its photos. The Acquisition team is strict, and I mean very strict :) meaning there are numerous criteria that the home must pass.

If the team sees potential, we then arrange for a Home Critic to visit the home, in order to ensure the home meets the Plum Guide standards. Sounds pretty demanding, huh?

At Plum Guide, we collect data from various short-term rental platforms to find hidden gems among the millions of homes that are out there. Previously, the Acquisition team would go through each home’s photos manually to decide whether the home was up to the Plum Guide standards or not.

For four years this wasn’t a problem, because large expansion wasn’t a major part of our business. During the pandemic, we re-evaluated our strategy and took the opportunity to significantly grow our acquisition, based on prior metrics this would translate to us reviewing ~5.5m images in order to reach the goals we desired. This would have been an overwhelming task, to say the least.

Figure 2: Comparison between a Plum Guide home and a low-quality home

Deep Learning for the win

The model we have built focuses on the first step of the testing, removing homes that are not meeting the Plum standards in order to reduce the manual workload.

From a business point of view, we wanted to accelerate the acquisition process and at the same time, we wanted to reduce the business costs of finding these homes. When the task came up, I was thinking that there are two solutions to this problem, a person with superpowers can grade homes (this never really crossed my mind) or a machine can do the job for us. We followed the second approach :)

CriticNet

Let’s start talking about the nerdy stuff now. Teaching a machine to be a home critic is a very hard problem, it’s tough as is for humans. Before digging into the technicalities I always like to think about the final product.

We want to build a system that gets as input the four indoor images per home (living room, kitchen, bedroom and bathroom) and “decides” if the home meets the Plum Guide standards or not (see Figure 3 below). As you can understand the ML task we need to solve here is a binary classification problem.

Figure 3: CriticNet as a black-box

Dataset

The dataset is being compiled from the Acquisition team and it is a continuous work of 5 years. There are two classes in the dataset, the home meets the Plum Guide standards or not. The distribution of the classes is shown in the graph below. The dataset is extremely imbalanced, only ~6% of the data are in class 1 (Plum Homes) and the rest ~95% don’t meet the Plum standards and this is an issue to our classification task.

Figure 4: Distribution of the data

The technique we applied to tackle the problem of the imbalanced dataset was data augmentation and undersampling the majority class. In the data augmentation, we created new images of homes that are in the minority class by transforming/processing them. We simply flipped, zoomed in, rotated, change the brightness in the images.

Data augmentation helped us avoid overfitting and reduce the bias on homes that their photography has high quality. There might be homes whose photographs are low quality but still meet our home’s standards.

Modeling

After preparing the dataset, it is time to work on the actual modeling. First of all, we need to come up with a Deep Learning architecture that receives 4 images as input and outputs the classification of the home given those images.

There are lots of models out there that are used for computer vision. As you can see in the image below, EfficientNet-B7 is the most efficient model in terms of accuracy. Also, it uses almost half of the parameters in contrast with the second most efficient model AmoebaNet-C. So according to these criteria, we picked the EfficientNetB7 as our base model.

Figure 5: Comparison of CNN models

The multi-input CriticNet model consists of 4 branches (each EfficientNet network process 1 image) of the EfficientNet model and at the end, we concatenated the fully connected layers of each branch. Then we applied a logistic regression which produces a score between 0 and 1 and expresses the probability that a given home is Plum or not.

The overall architecture is shown in the diagram below. Now that we have come up with the model it is time to train it.

Figure 6: CriticNet architecture

Training the model

The main challenge with deep learning models is that they need a lot of data (ideally millions) and lots of computing resources. However in order to save money and time we applied transfer learning and we fine-tuned the weights of our network in order to adjust the weights in our dataset. The EfficientNet model we used was pre-trained on the ImageNet dataset.

In transfer learning, we freeze the weights of the upper layers of the pre-trained network (convolutional layers) and we train from scratch only the weights of the logistic regression at the very end. Then we fine-tune the weights of the network by training the whole network for a few epochs with a low learning rate.

Fine-tuning stages

We trained the model in two stages:

  1. We start by training only the last dense layer (logistic regression) with a higher learning rate. This ensures that the newly added random weights are adjusted to the ImageNet convolutional weights. This burn-in period mitigates the risk of juggling around the convolutional weights at the training start and consequently slowing down the training process.
  2. Then, we train all weights in the CNN with a low learning rate, as figure 7 shows.
Figure 7: Training and validation cross-entropy loss

Evaluation

Offline Metrics

Looking at the validation loss for evaluation is not enough in a classification task, we should evaluate the model in more metrics to have a clearer picture. The model achieved F1 score of 78.43%, precision of 66.67%, and recall of 95.24%. Let us see what these results mean from a business point of view.

Precision is simply how accurate the model is when it predicts positive labels. So let us say that the model predicts 100 Plum homes but only 66 homes are truly Plum. This gives us a 66% precision score.

Recall indicates the accuracy of the model in the truly positive labels. Imagine in our dataset we have 100 Plum Homes and the model predicts 95 homes correctly and misses 5 homes which are predicted as low-quality homes (false negatives). This gives us a recall score of 95%.

The low precision score is mainly because the Acquisition team is very strict and even a good home may not meet the Plum standards. As a result, the team still has to check the homes which are classified as Plum manually, however, the amount of homes that need to be graded manually is significantly smaller, by a factor of 60%.

Figure 8, below, helps you realize how difficult the task can be. For simplicity, we present only the living room images but the score is derived from all the main room images of each home. Home 1 clearly doesn’t meet the Plum standards (you don’t have to be a home critic to know that) and the model predicts correctly by giving a score of 0.13 which is quite low. Easy!

There are cases that the distinction is not that obvious like in the case with Home 3. Even though it is marked as low quality, it is not clear why this home is not Plum. And this happens a lot and that is why the precision is low. However, the model achieves extremely good recall. This means that we don’t miss lots of high-quality homes (only ~5%).

Figure 8: CriticNet results

Business Metrics

From a business point of view, precision affects the cost of grading. The lower the precision, the more false positives we get so the more homes for the team to grade manually.

On the other hand, recall helping us evaluate how good the model is predicting Plum Homes, which is very important for the business. Recall gives us an estimation of how many homes we might be missing during the classification.

Apart from the evaluation during the training process, the most important metrics are the business metrics. Stakeholders don’t care about accuracies and F1 scores but how the model helps to achieve business goals and how to quantify the impact of the model using business metrics.

Since the rolling out of the model, the acquisition process accelerated by 30%, the costs of manually grading were reduced by 60%, and per batch the amount of high-quality homes is increased by ~50%-70%.

The acquisition team is not checking any more homes which are classified as low quality from CriticNet. However, due to the low precision score of CriticNet, the team still has to check manually the homes which are classified as Plum.

Conclusion

In this article, I explained to you briefly how we managed to optimize our home acquisition process by building a Deep Neural Network, gave you an idea of the architecture of the model and how we trained and evaluated it.

However, during this project, we faced lots of challenges related to the data. The data is far from clean. For every home, unfortunately, there are not only 4 images, one for every room, but on average there are 40 in total. Some of them are outside spaces and even worse the majority of them don’t depict the room space clearly so they don’t have any grading power.

In the next article, I will explain how we managed to overcome these challenges and built a system that gets raw images from homes, classifies them into rooms, and picks the best images to feed to CriticNet.

If you like our work and are interested in solving challenging problems, we are hiring engineers to work with the data science team on rebuilding the search experience. You can apply here!

--

--