How Casavo employs Computer Vision to provide instant feedback to Sellers

Alberto Bellini
Casavo
Published in
6 min readJul 5, 2022
Instant feedback on Casavo Visite mobile App

Introduction

In the last few years, with the advent of Deep Learning and the
exponential growth of computing power, we came up with brilliant
solutions to solve a wide variety of problems that affect our daily lives, either directly or indirectly.

At Casavo, we want to make it as efficient as possible to assess the status of a property by looking at its interiors, such as window fixtures, doors, and generally at all the rooms that make up an apartment.

To carry out the task, our team built an App that allows our sellers to upload images of their property while gathering all the information required to craft a buying offer. For months we had a whole team of Acquisition Specialists looking at the images to understand whether or not they were sufficient to continue the valuation process, and from time to time they had to request sellers to upload further images because some key details were missing (i.e. no pictures of the bathrooms).

This process was tedious, both for internal teams and for the sellers themselves, which had to repeat the same operations until all the required pictures were eventually uploaded. We knew we could do better.

As of today, we employ Deep Learning paired with Computer Vision techniques to provide an instant feedback to our sellers while they upload the pictures, telling them what is missing (e.g. kitchen) as they carry out the process. In this article we take a deep dive on how me made this possible.

The idea

Eventually we came up with the idea of developing a micro-service to classify images in various categories, such as bathroom, kitchen, bedroom, living room, window, door, and so on.
In turn we wanted the mobile app to call this service and provide an instant feedback to our users as they uploaded pictures.

We started by developing a classification model to distinguish various kind of rooms and objects by distributing probabilities over a set of labels, some of which are mentioned above.
Following best practices, we decided to take a pre-trained model (ResNet) and fine-tune it on our downstream task to classify real estate data. Pre-trained architectures are a really good choice when it comes to develop image classification models since most of the times they already know how to recognise high-level features that would otherwise require a lot of time and data to be learnt starting from scratch. We decided to begin our experiments with the smallest possible architecture among those available for this type of network, picking the one with with 18 layers (ResNet-18) to make sure we had a good compromise between speed and performance

We were not convinced by the initial results

Soon after training the model we understood that it had some limitations: the architecture was not really able to tell the difference between objects and rooms in those cases were both were present in the image, probably because we decided to combine two different tasks into a single one (Image Classification and Object Detection). For this very reason, given an image of a living room containing an entrance door, the model was uncertain about the actual output (i.e. whether it should have been a door or a living room).

After further experiments with other network topologies (VGGNet, InceptionNetV3) we decided to adopt a different strategy.

YOLO is all you need

We decided to develop two different models to carry out different tasks, and later combine their outputs. The goal of the first one was to perform room classification only, while the second one was in charge of doing object detection. With this approach we would have been able to build two specialised models that had greater accuracy and could scale better (adding an object class to the detector wouldn’t have required to train the room classifier again). Hence, we removed the object labels from the previously trained model and setup the environment to create our object detector to classify windows and doors.

The output of our object detection model on a living room

As of the writing of this article, YOLO is one of the best performing models for object detection, and it is employed in multiple real-time applications. Mesmerised by its performances, we decided to give it a try for our downstream task. This model comes with 80 different pre-trained labels on COCO (Common Objects in Context), but unfortunately did not contain either window or door labels.
We initially thought to carry out a fine-tuning step to integrate the latter ones into the model, because we wanted to have all the 80 classes too for possible later improvements. However we were worried that a fine-tuning step would have damaged the detection capabilities of the other labels, and for this reason we followed a different strategy:

  1. We trained an instance of YOLOv5 from scratch, just on these two labels.
  2. We downloaded a pre-trained instance of YOLOv5 to detect the other 80 objects (e.g. sink, oven, fridge, toilet, etc.).
  3. We combined the two models together to detect 82 objects, calling them in chain with some logic to avoid overlapping bounding boxes and detection ambiguity.

While we had plenty of annotated data for windows, we weren’t so lucky with doors and for this reason we had to annotate roughly 1000 samples by hand using tools such as CVAT combined with solutions like FiftyOne. It doesn’t take a ton of annotated data for YOLO to achieve good performances, and we were surprised to learn that with only a couple thousand observations we were able to reach really high accuracy.

Let’s put it all together

Eventually we ended up with the two models deployed on our kubernetes cluster as separate micro-services with their own dedicate HTTP APIs. Kubernetes gave us a lot of flexibility in configuring both horizontal and vertical scaling, and the load balancers ensured production-grade service uptime and reliability. Furthermore, thanks to Grafana we were actively monitoring performance, usage and drift.

The goal remained the same: being able to combine the outputs of these two models to provide instant feedback to our users. So, we defined a set of rules to tell, given an image, what it was actually representing. For instance, these were some of the rules we employed:

  • If the object detector finds a “door” which covers a “big-enough” portion of the image, the output will be “door”. Contrarily, if the coverage is not “big enough”, classify this image using the room classifier.
  • If the room classifier is not very confident that this image is a “bathroom”, seek for the “toilette” label in the object detector. If we find such label, return “bathroom” as output .

The rules were implemented on the backend of the application, to avoid integrating this logic in the model themselves which could eventually be used for other purposes, both internally or externally in Casavo.

At this point, the mobile App was finally able to receive instant feedback on the fly as the user uploaded the pictures during our data gathering step, resulting in a big leap forward both in terms of user experience and product development.

Conclusions

Machine Learning can really be a game changer if applied correctly to a given problem, and as of today we don’t have to always reinvent the wheel to achieve the desired outcome; there are plenty of tools and models that are really plug and play. Of course some tasks require some tuning and expertise, but companies shouldn’t be scared to integrate AI into their workflows and should definitely embrace the challenge and start using this technology, today.

However, keep in mind that the key to success is not hiring an expert team of Data Scientists and Machine Learning Engineers if there is not someone, like a product manager, who has vision and can drive the development of this technology to improve the product itself. Furthermore, it is of key importance to have people working even on the user experience, the design, the frontend, the backend, BI and of course infrastructure. For us it has always been a synergy of people working together to achieve the same goal.

We are really proud of what we have achieved so far, and look forward to the upcoming challenges!

Join us 🚀

Our Product & Tech team is continuously growing!
If you like our business and are interested to know more about us, feel free to check our open positions →. We’re always looking for awesome talents!

--

--

Casavo
Casavo

Published in Casavo

We believe in the crucial role of technology to make life easier. We love to leverage technology to create solutions for all the people that need to sell, live, or buy a residential property.

Alberto Bellini
Alberto Bellini

Written by Alberto Bellini

Machine Learning Engineer @ DocuSign | Applied Science