The Tech Story behind Magic Eraser

Alice Lucas
Meero Product & Engineering
10 min readApr 5, 2024

At Meero we leverage AI to create beautiful pictures that increase engagement and conversion rates for businesses. One of our products, ProperShot, is specifically designed for Real Estate photography. I was recently given the opportunity to present at the Computer Vision Meetup of Paris one of ProperShot’s most popular features, Magic Eraser, accessible on propershot.com.

Meero presented at the Computer Vision Meetup of Paris on March 18th, 2024.

In this article, I will transcribe my presentation and delve into the technical details of Magic Eraser. Whether you’re a tech enthusiast or just someone who’s curious about the inner workings of image-based complex features, I hope you’ll find this article informative and engaging.

Meero and AI-based enhancement

Before diving into the technical part, let’s pause and define who we are today. Meero operates across three distinct verticals: Fashion with our AutoRetouch product, Real Estate with ProperShot, and Cars with CarCutter. While the product offerings within each vertical differ to meet unique business needs, they all share a common objective. Ultimately, our goal is to enhance the visual appeal of images, through a set of AI-based tools, making images more marketable for our diverse clientele.

With ProperShot, we offer a suite of tools designed for Real Estate agents who take property photos with their phone cameras. At the core of ProperShot lies a set of Computer-Vision-based enhancement algorithms that are automatically applied to all incoming photos in our pipeline. These features, which include HDR, color enhancement, vertical corrections, and sky replacement, are designed to improve the overall attractiveness of the images.

All incoming photos in ProperShot go through an automatic enhancement pipeline.

While our enhancement tools can significantly improve the visual appeal of photos, they cannot solve all issues that may arise in our real estate users’ images. For instance, cluttered or messy rooms, as well as rooms with unique but unpopular styles, such as an outdated grandmother’s living room, may still appear unattractive despite best enhancement efforts.

Making editing tools available to Real Estate agents

To complement our enhancement tools, we have recently shifted our focus towards developing editing tools that are accessible via the ProperShot web platform. These tools enable Real Estate agents to modify the content of their images, making them more marketable to potential buyers. One of our features is Home Staging, which allows agents to transform the style of a room into a more popular aesthetic, such as a a clean modern look, without altering the inherent structure of the room. However, in this article, we will be discussing a different but as powerful editing feature: Magic Eraser.

Left: Home Staging. Right: Magic Eraser. Both features are accessible on ProperShot’s web platform.

Accessible via the web app, Magic Eraser empowers our users to remove any unwanted objects from their enhanced images. This feature is particularly useful for de-personalizing spaces, such as removing soaps and towels from bathrooms or dirty clothes from laundry rooms. By providing Real Estate agents with greater control over the content of their images, we aim to help them create convincing visuals and effective listings that result in buyers projecting themselves.

Leveraging a state-of-the-art GAN model for fast deployment

Despite its mysterious-sounding name, “Magic Eraser” actually corresponds to a fairly classic inpainting task in Computer Vision. When we first began exploring how to implement Magic Eraser, we considered two families of models appropriate for the inpainting task: GAN-based models and diffusion-based models. GANs offered the advantage of a single forward pass and their availability of many off-the-shelf state-of-the-art models. However, we knew that GANs could sometimes struggle with realism when dealing with more complex cases. On the other hand, diffusion-based models, when trained properly, could deliver very impressive realism.

When comparing GAN-based models with diffusion-based models, we noticed a problematic behavior of diffusion models that was not suitable for our application. For instance, if we were to use Stable Diffusion XL, we observed that it generated new objects instead of simply removing the original ones. This was an undesirable outcome that we did not want to provide to our Real Estate agents. It would have been frustrating for users to attempt to remove objects, only to have new objects generated in their place. Due to the high level of risk involved and our objective being to deploy a working solution as quickly as possible, we decided to set aside diffusion models and focus on the use of GAN models for Magic Eraser.

The objects highlighted in yellow are those that should be removed. GAN-based results remove objects but may produce unrealistic results. Diffusion results are realistic, but may generate new objects.

LaMa: state-of-the-art GAN for inpainting

During our research phase, we discovered that the “Resolution-robust Large Mask Inpainting with Fourier Convolutions” (LaMa) model was the state-of-the-art for the inpainting task. The LaMa model offered several advantages for our specific use case. Firstly, it was trained on the Places dataset, a Real Estate dataset that aligned perfectly with our needs. Secondly, it was commercially available, which was important for our business requirements. Lastly, the LaMa repository was well-maintained and contained many helpful contributions that we could leverage for our specific needs, including a “refinement” feature which we will discuss in more details below.

Our initial experiments revealed that LaMa performed well when dealing with small mask sizes and random objects throughout the image. In such cases, the reconstructed texture was often realistic and of acceptable quality. However, when we attempted to remove larger objects, LaMa struggled and produced blurry, unrealistic textures. This was a significant problem, as it negatively impacted the quality of our images. Given the importance of image quality to our clients, we knew that we couldn’t settle for subpar results.

“Partial refinement” for artifacts reduction

Fortunately, researchers from GeoMagical Labs had made similar observations and proposed a solution to the problem. They observed that the LaMa model had been trained on image sizes that were much smaller than the mask/image sizes we were using in our tests. Essentially, the researchers discovered that processing a smaller-scale version of the image would produce fewer artifacts than when inpainting on the original high-resolution version. To address this issue, they proposed a refinement procedure. This involves using smaller-scale versions of the same image as the ground-truth signal to guide a new fine-tuning process and “overfit” the parameters on that one image. By repeating this process iteratively, we can fine-tune the model on the original high-resolution image, resulting in decreased artifacts in the final output.

Illustrating our partial refinement approach. The first part of the ResNet encoder is frozen, only the second-half is finetuned at the various scales.

While the refinement procedure significantly reduced the presence of artifacts, it introduced a new problem: increased inference time. Instead of a single forward pass, the procedure now required multiple forward and backward passes across various iterations and scales for a single image. Although tuning the learning rate, number of iterations, and number of scales could mitigate this issue, these parameters had only a minor impact on the total time. To achieve a balance between quality and inference time, we designed a “partial” refinement procedure. This involved freezing the first half of the ResNet layers in the LaMa module and fine-tuning only the second half. By doing so, we were able to achieve a healthy compromise between the quality of the inpainting results returned to our clients and the time required for inference.

Deploying Magic Eraser v1

After implementing the changes to the LaMa model described above, we were confident in the results and ready to deliver a solution to our clients. Since LaMa was conveniently written in PyTorch, we created a torchserve script to archive the model. We then wrapped the model into a Docker image and deployed it in our existing GCP environment. This allowed us to seamlessly integrate the new model into our services and provide our clients with a high-quality object removal solution.

After delivering the Magic Eraser feature to our clients, we received extremely positive feedback. It was an instant success, and our clients loved the new tool that they never knew they needed. However, we soon discovered that LaMa’s failure on large masks occurred more frequently than anticipated, even with the help of the “partial refinement” procedure. The model performed especially poorly when real estate agents attempted to remove multiple objects at once on a surface, resulting in failed reconstructions. At this point, we were in a comfortable position: Magic Eraser was deployed, receiving positive feedbacks, and we therefore had the opportunity to revisit the diffusion models to see if they could provide a better solution.

An example failure case of our LaMa model. LaMa fails at reconstructing the texture when a large amount of objects are removed at once. This motivated us to revisit diffusion-based approaches.

Re-visiting diffusion models for Magic Eraser v2

To begin our research on diffusion models for Magic Eraser, we benchmarked available diffusion models for the inpainting task. We used the FID score to evaluate realism and the LPIPS score to assess fidelity to the ground truth (i.e., a reconstruction score). Through our quantitative evaluation, we found that the Stable Diffusion XL model had the best metrics for inpainting, making it the most promising candidate for our object removal task. We applied Stable Diffusion XL to some of our test images and observed that, as observed earlier in our research, the model created new objects instead of completely removing them at the mask location. We knew that we needed to focus on this behavior to make the model suitable for our object removal task.

Stable Diffusion XL produces realistic texture, however it has the tendancy to generate new objects instead of removing all of them.

Fine-tuning Stable Diffusion XL on our professional Real Estate dataset

Stable diffusion models are trained on vast web-crawled datasets, such as the Laion 5B dataset, which may explain their bias towards object generation. Moreover, diffusion models have never been taught the concept of not generating something, making it challenging to prevent this behavior using off-the-shelf models. Engineering tricks, such as prompt engineering with positive and negative prompts for Stable Diffusion XL, did not significantly reduce object generation. To address this issue, we decided to fine-tune Stable Diffusion XL on our own dataset.

At Meero, we have a large collection of images from professional photo shoots, which correspond to vacant, clean, and tidy spaces with minimal clutter on surfaces. These images provide an ideal large dataset for fine-tuning the model for our Magic Eraser application and modify its output domain to those of professional-looking Real Estate images. For this training, we generate random masks to be used during the inpainting fine-tuning of Stable Diffusion XL and fine-tune using LoRA.

We discovered fairly quickly that fine-tuning Stable Diffusion on professional photoshoots alone was insufficient to reduce object generation behavior. Essentially, what we were lacking was a way to model this behavior through a working text. Specifically, if we could associate the unwanted generation behavior (i.e., generating a new object) with a specific sentence, we could take a significant step towards breaking this habit.

Example image + masks obtained from Magic Eraser v1 usage. The diffusion model can be trained to generate these original objects at the mask location, making sure to bind this generative behavior to a specific fixed prompt.

Fortunately, we obtained data from our clients using Magic Eraser v1, which provided us with exactly what we needed: masks corresponding to plausible objects to generate in the original image. With these mask-object pairs, we could teach Stable Diffusion XL to generate an object at the mask location and associate this type of generation with a specific, fixed caption. The caption we chose is “low-quality photo with scattered objects and trinkets”. We refer to this as the “wrong generation” caption, which during training we always associate with an image-mask generation corresponding to the generation of an object.

Here’s the final trick that ties everything together: by associating this “wrong-generation” caption with this particular behavior, we can insert it as the negative prompt during inference. This allows us to take control over the generation behavior of the diffusion model and significantly reduce the likelihood of generating an object instead of erasing it.

Left: Stable Diffusion XL. Middle: LoRA-finetuned SD XL without the ‘wrong generation’ negative prompt during inference. Right: results when including ‘wrong generation’ as the negative prompt during inference.

Comparing Magic Eraser v1 and v2

Just as we did with our LaMa model, we were able to quickly deploy our fine-tuned diffusion model and make it available to our clients as Magic Eraser v2. To highlight the differences between the two models, we provided some example pictures before. As can be seen, the diffusion-based model, Magic Eraser v2, consistently produces high-quality, realistic results with accurate semantic and texture reconstruction. In contrast, the GAN-based deployment of Magic Eraser is more prone to failure.

Left: input, Middle: Magic Eraser v1 (GAN), Right: Magic Eraser v2 (diffusion)

Final words

Magic Eraser is one of our most popular tools at ProperShot, and we are always looking for ways to improve it. One potential avenue is to incorporate instance detection to automatically suggest objects that our clients can remove to enhance the appearance of the room. Additionally, we are exploring the possibility of extending the feature to include an “empty room” option, which would remove all furniture from the room, allowing potential clients to better visualize themselves in the space.

That is it for the story behind Magic Eraser, one of our clients’ preferred features within ProperShot. Thank you for taking the time to read about our work, and please don’t hesitate to reach out if you have any questions!

--

--