Do pictures in real estate listings actually help us estimate the list price ?

Anthony Galtier
ILB Labs publications
10 min readSep 14, 2022

Estimating housing real estate price is quite a common topic, with an important literature on estimating prices based on a set of numeric and categorical features describing the characteristics of a property such as the location, surface, land size, number of bedrooms, age of the building… These hedonic approaches are usually sufficient to estimate the price range but lack precision.

Few have worked however to see if other types of data carry complementary information, enabling a more precise price estimation. In our previous posts, we showed that the textual descriptions of the property allow us to reach similar list price estimation performances as with tabular data. In this article, we will explore the photos of a property and see whether they contribute to improving the performance of the usual feature-based models.

To do so, we test two approaches, a first extracting an explicit set of features with traditional computer vision techniques, and a second relying on a Convolutional Neural Network to embed the information carried by the images and then compare each’s performance to a simple benchmark model based exclusively on the tabular set of features.

For more details, you can check out the code developed for this article here.

The data set

The data set used in this article is the same as the one used in our previous posts. The data set consists of 46K listings for which we collected 27 numeric and categorical features (location, surface, number of rooms, exposition…), a textual description and 1 to 6 photos. This data was scraped for French real-estate websites.

The photos collected are the photos displayed on the listing’s web page. In most cases, they are photos showing different views of the property, mostly indoor views for apartments and a mix of wide outdoor views and indoor views of specific rooms for houses and larger property types. These photos usually highlight the most notable characteristics of the property (swimming pools, verandas, large windows…). That said, angles, lighting and subject can differ significantly from one listing to another, some photos even containing watermarks, text, or logos.

Screenshot of a typical French real estate listing

Extracting valuable information from these images is not a straightforward task. To simplify things a bit, we chose to restrict our study to apartments and houses listed between 20K and 2M euros with less than 10 bedrooms. This leaves us with 69% (32K) of the original 46K listings.

This data set was then randomly split into a training set with 80% of listings (26K) and a test set with the remaining 20% (6K) of the listings kept aside to evaluate the performance of the different models.

Benchmark

Before diving into the photos, we create a model based exclusively on the numeric and categorical features describing the property. The price prediction performance of this model will be our benchmark. We will use it as a point of reference to evaluate how the performance is affected when information extracted from the listing’s photos is fed as well into the model.

Model

The model consists of a pipeline of pre-processing transformations followed by a final regression estimator. The pre-processing steps deal with outliers and reduce the columns specifying the number of rooms, bathrooms, and bedrooms into a single feature by principal component analysis (PCA). For the final estimator, we use a CatBoost regressor model, which is a supervised boosted tree-based model specifically adapted for data sets with categorical variables. The model was trained on a selection of 17 numeric and categorical variables of the train set with a 5-fold grid search cross validation to optimize the parameters of the model.

Variable distributions as observed in the train set

Results

The performance of the benchmark model was then evaluated on the test set. Predictions are off on average by 73.2K euros or 25.8% in relative terms. The median error of 41.5K euros or 16.9% in relative terms is better than the average absolute errors suggesting the model sometimes makes unusually large errors for a few properties. The R squared score reaches a value of 0.79 indicating that 79% of the variability in the price variable is explained by our models, which is quite satisfactory.

Benchmark model performance scores on the test set (left) and top 5 variables by feature importance (right)

Like most decision tree-based ensemble learning models (Random Forests, XGBoost…), CatBoost provides a measure of “feature importance” that measures how each feature contributes on average to the model’s prediction. When interpreting these values, one must keep in mind that the “feature importance” is shared proportionally between correlated features.

The most discriminating features for our benchmark model are the location and size of the property which makes sense. In the graph above, we should not interpret however that the size is less important than the location. Indeed, the feature importance of the size, number of rooms and land size variables are slightly understated due to their correlation.

With simple image feature extraction techniques

For each listing, we have 1 to 6 photos with most (70%) having 6 photos. Since over 99% of the photos are JPEG color images, they can easily be represented in RGB format where each pixel’s color is represented by a triplet of values corresponding to the quantity of red, green, and blue. The photos can also be represented in the HSV color space, an alternative format to the RGB model where the triplet of values for each pixel’s color describes its tint (“hue”), saturation and brightness (“lightness” or “value”).

The idea is thus to extract a set of simple features from these representations, to append them to the previously used numeric and categorial features and see if they enable the model to better predict the list price.

Extracted features

For each photo, we chose to extract shape related features such as the orientation (portrait, square or landscape) and the aspect ratio (4/3, 16:9…) and color related features (main color, RGB skewness, saturation, brilliance…). These image level features were then aggregated to the listing level by mean, frequency or other statistical measures depending on the nature of the feature.

Extracted image feature distributions aggregated at listing level as observed on the train set

Although these features do not explicitly describe the subject of the photos, they do carry some information. We noticed for example that photos of apartments tend to be less brilliant and saturated than photos of houses with less yellow and more red. There are also more portrait-oriented photos of houses than of apartments. However, is this enough to improve the performance of our model?

Model

We used the same Catboost regressor and the same training and test observations as for the benchmark model but adding 18 features extracted from the images to the original data set. The parameters of the model were optimized with a 5-fold randomized grid search cross-validation procedure, just like the benchmark model.

Results

This new model’s predictions are now off by an absolute average of 75.8K euros or 26.3% in relative terms. The median absolute error standing at 43.4K or 17.6% in relative terms here again remains better than the mean. The R squared as well stays unchanged at 0.78. Thus, the features extracted from the images did not improve the performance of the model, they may even have degraded it. That said, the difference in performance is small, it could just be due to the inherent variability of the model.

Looking at the feature importance graph, we see that the most influential features remain the same ones as for the benchmark. None of the features extracted from the photos appear to be of any importance in the model’s decision-making process.

Image model performance scores on the test set (left) and top 5 variables by feature importance (right)

With deep learning

Embedding images with a CNN

Our first deep learning approach is just like the previous one, but we use a fine-tuned CNN to extract features from the photos instead. The 138K photos associated to the 26K training listings are used to optimize the parameters of a CNN composed of a ResNet34 followed by a few regression layers. The weights of the CNN are fine-tuned on the task of predicting from each image the price of the corresponding listing.

After the fine-tuning step, we realized that predictions can vary significantly from one image to another. To mitigate this, we decided to retain only the average of the three median predictions from the photos of the same listing. It is this average prediction that we append as a new variable to the benchmark data set. That data is then used to train a Catboost model following the same procedure as previously.

Results

Already during the fine-tuning step, we noticed that the CNN struggled to make good list price predictions from induvial photos. Despite our attempt to mitigate the variability of the CNNs predictions by removing outliers, aggregating only the median predictions, our CNN outputs failed to improve the overall price prediction performance.

CNN based image model performance scores on the test set (left) and top 5 variables by feature importance (right)

The MAE at 73K euros and MAPE at 25.8% are not significantly different from the benchmark. The median absolute error at 41.8K and median percentage error at 16.8% are also no better than the benchmark. Furthermore, the feature importance of the average price prediction extracted from the photos is low suggesting that it does not contribute much to the model’s predictions.

End to end neural network

We tried a second deep learning approach that consists in a two-branch neural network where one branch performs an embedding of the photos, the other performs an embedding of the numeric and categorical variables, and a common part performs the regression from the concatenation of two embeddings.

Schematic representation of the end to end deep learning model

In the hope of making the entire network converge, the architecture and parameters of the image embedding, and tabular branch are each optimized separately first. For the image branch, we reuse the previously optimized parameters of the ResNet34 CNN. The common regression part of the network is optimized last, with the parameters of the two pre-trained branches frozen.

Results

We noticed immediately during the training of the embedding branches that the network could learn to predict prices well from the tabular data but struggled a lot more with the photos, just like in the previous section. When we put everything together in the two-branch network, the embeddings learned from the images seemed only to add noise to the training procedure. Indeed, the two-branch model learns slower and does not outperform the simple feed forward network of the tabular branch.

Conclusions, limitations, and next steps

The main conclusion we take out from our different attempts is that extracting insightful information from the photos published with a real estate listing is a challenging task. None of our proposed implementations have been capable of leveraging on the images to improve the benchmark model that relies only on numerical and categorical features.

Furthermore, unlike our previous work with textual data, we witnessed a lack of performance even when optimizing a CNN to directly predict the list price from the photos. This shows that it is not just difficult to improve the performance of the benchmark, it is inherently difficult to extract useful information from the photos.

The task of predicting a list price from photos of a property is thus a difficult one, even for a human when you think about it. Images, unlike textual descriptions or tabular data sets, do not explicitly highlight the key facts and value drivers of a listing, most commonly, the size and location of the property. Images also seem to carry more noise and superfluous information than other data types. The diversity of subjects and angles probably make it hard for algorithms to generalize and predict the price accurately.

That said, there is surely some value to extract from certain photos that highlight specific value-driving characteristics of the property. Alone, such information may not be enough, but we can expect it to be complementary to the usual housing variables.

An improvement we should consider next time is to be more specific in the selection of photos used to train our models. Semi-supervised learning could come in handy here to help us automatically select images that a human would deem useful to predict the price. Another idea would be to use domain expertise to identify certain value driving characteristics of properties that do not exist in the tabular data set (presence of a pool, a lawn, brightness of the rooms…) and then build classifiers to extract these features from the photos.

Acknowledgements

I would like to give credit and a special thanks to Chahnez Chouba for her more than significant contribution to the content of this article at the Institut Louis Bachelier Datalab.

--

--