How Casavo Uses Deep Learning to Anonymise Images on Casavo’s Real Estate Platform
Introduction
As a real estate company, Casavo faced a significant challenge when it came to sharing images of properties with potential buyers. While sharing images is an essential part of the home-buying process, we needed to ensure that sensitive information, such as the current homeowner’s identity and living address, remained anonymous.
To address this challenge, we decided to leverage the power of deep learning to create a solution that would allow us to share images while keeping sensitive information private. In this article, we will detail the different deep learning technologies we utilised to achieve this goal. By sharing our experience, we hope to inspire other companies facing similar challenges to leverage the power of deep learning to find innovative solutions to complex problems.
Sharing Images with Potential Buyers
With Casavo’s mobile app, sellers can complete a remote visit of their property autonomously by uploading pictures of their home and floor plan. As part of this process, we ask sellers for their permission to share the images and details of their property with potential buyers before listing it on our platform. This allows us to increase the chances of finding the right match between seller and buyer earlier. If the seller agrees, their house pictures, along with other relevant information and the floor-plan, are shared with all interested potential buyers as a preview.
However, sellers often upload images that contain personal information, such as personal address, people’s faces, pictures within frames, without realising the potential consequences of sharing such sensitive data. This can lead to privacy concerns for both the seller and anyone appearing in the images, as well as potentially harming the seller’s chances of a successful sale.
Despite these issues, sharing images with potential buyers is an essential part of selling a property, as it allows them to get a better sense of the property’s condition and layout. Therefore, it is crucial to find a way to anonymise sensitive information in these images while still providing a clear and accurate representation of the property. This is where deep learning technologies come in, offering an efficient and effective solution to the problem.
To address this challenge, we utilised a combination of deep learning models to identify and anonymise sensitive regions of property images. In the following sections, we will explore the different models used and how they were combined to achieve the desired results.
Anonymising Images using Deep Learning
In order to anonymise the images uploaded by sellers, we utilised three pre-trained deep learning models that process images to accomplish different tasks. The decision to use pre-trained deep learning models for anonymising images uploaded by sellers was based on two main factors.
Firstly, training deep learning models from scratch can be a time-consuming and resource-intensive process. By utilising pre-trained models, we were able to save significant amounts of time and computing resources that would have been required for training our own models.
Secondly, pre-trained models have already been trained on large datasets, which provides a significant advantage in terms of accuracy and performance. The models we use have learned to identify patterns and features in images that are useful for the specific task they were trained for. This means that we could rely on the pre-trained models to effectively anonymise images without needing to spend additional resources on data collection and annotation.
Detecting Text — One of the models used was the CRAFT (Character-Region Awareness For Text detection), which is capable of identifying text within an image and creating a heat-map that highlight the location of the text as a probability distribution over all the pixels. This model was particularly useful in identifying personal information, such as names and addresses, that may have been present in the floor plans uploaded by sellers.
The heat-map created by CRAFT was passed through a binary threshold to compute a text segmentation mask, which was ultimately used to determine which areas of the image contained text to be anonymised. We decided to use heat-maps with binary threshold instead of bounding boxes to prevent creating masks which were too wide, potentially removing other portions of the image. For every pixel, a score between 0 and 1 was assigned as the probability that that pixel contained text. Afterwards, all pixel whose values were above a specific threshold were set to 1, while the others were set to 0. Following this approach we came up with segmentation masks that neatly separate the text from the background.
CRAFT is based on a fully convolutional neural network that utilises the VGG16 architecture and detects words through word-level bounding boxes. It is known for its effectiveness in identifying texts of various sizes, ranging from large to small texts, and its ability to generalise.
Detecting People — The second model that we employed is called YOLO (You Only Look Once). The process of detecting people and faces in images involves the use of computer vision algorithms that are trained on large datasets of annotated images to identify specific objects and features within the image. The latest version of YOLO (v8) has been trained on massive amounts of image data and can accurately detect and classify objects within an image in almost real-time.
When applied to the task of detecting people within images, YOLO v8 uses a combination of deep neural networks and computer vision techniques to analyse the image and identify the presence of people. We used this information to create a segmentation mask that isolates the person from the image background. Once again, the segmentation mask is essentially a binary image that assigns a value of either 0 or 1 to each pixel in the image, indicating whether it is part of the person or the background.
Once the masks created by the CRAFT and YOLO models were obtained, we combined them with a logical OR operator to obtain a mask containing all pixels to be anonymised. These sensitive regions could include personal information, faces, or any other elements that may have compromised the seller’s privacy. We then used computer vision techniques, such as convolutions (to dilate the masks) and blurring, to anonymise these regions and ensure that the images were safe to be shared with potential buyers.
NSFW Detector — To ensure that only appropriate images were shared with potential buyers, we recognised the need for an additional step in our pipeline: an NSFW (Not Safe For Work) detector. We incorporated this by utilising the detector from the Stable Diffusion V2 model created by Stability AI, which is capable of detecting potentially harmful or sensitive concepts within an image (such as young kids, explicit content and so on). The model is based on CLIP (Contrastive Language-Image Pre-Training) with a projection layer applied to these concepts and associated thresholds to determine whether an image is suitable for sharing with potential buyers.
By incorporating the NSFW detector into our pipeline, we were able to prevent the sharing of inappropriate images with potential buyers. The detector provided an additional layer of protection to ensure that only images that were safe and appropriate for public viewing were shared on our platform, after anonymisation.
Orchestration
Deploying such a service in production posed significant challenges. We chose to use CPU inference, as the volumes and throughput of images did not justify reserving a GPU. However, this resulted in inference times exceeding one minute per image, which ruled out a normal REST interface. Maintaining a TCP connection for that long was undesirable. To address this issue, we leveraged the RabbitMQ message broker, a popular tool in message queuing. By using RabbitMQ, we created consumer and producer queues of images (to be processed and anonymised), enabling us to manage the flow of images efficiently between micro-services in our distributed architecture. As a result, we were able to maintain optimal efficiency without compromising the system’s performance.
To further optimise our service, we leveraged KEDA, a Kubernetes Event-driven Autoscaling tool. With KEDA, we were able to turn the service on and off as needed. Specifically, KEDA would turn on the service container only when new images were present in the producer queue. It would then turn off the container after a specific amount of time had passed without any images in the queue. This allowed us to conserve resources and reduce costs, as the machine was only active when needed. Using RabbitMQ and KEDA together proved to be an effective solution for managing our image processing service in a scalable and cost-effective manner.
Conclusion
In the future, we could enhance this micro-service by integrating generative AI as a form of post-processing. This would allow us to seamlessly replace any content that was identified in the image, as if it was never there. If you’re interested in exploring the possibilities of generative AI in the real estate industry, you might find my previous article on the topic informative and engaging
In conclusion, Casavo’s use of computer vision and machine learning technologies has enabled us to improve our real estate services and offer a more seamless experience for our clients, enhancing the privacy of individuals in uploaded images and floor plans.
These results reflect our commitment to innovation and our dedication to creating a more transparent and efficient real estate marketplace. If you are passionate about transforming the industry and want to be part of a dynamic team that is leading the way, we invite you to explore our open positions and join us in revolutionising real estate. Together, we can use cutting-edge technology to make buying and selling homes simpler, faster, and more accessible for everyone. 🏠 🚀