Predicting rental income with machine learning

Ali Kokaz

Follow

Published in

Bricklane Tech

6 min readJan 6, 2021

--

Hello, I’m Ali from Bricklane’s Data team.

Previously, I wrote about how Bricklane uses technology to predict rental income, a value that is a critical part of how we choose which residential properties to invest in.

The article which explained how we automated a comparison valuation method to quickly predict rental incomes and scale up our Property team’s efficiency.

With that development, our analysts went from being able to review 5–8 candidates a day, to hundreds. The team could now review many property adverts quickly. However over time they found a human was still needed to verify/tune the final result. This was due to the model not using exactly matched comparables (usually these either do not exist, or the data is not readily available). Sometimes the predicted value was optimistic, on other occasions it was completely wrong.

Improvement through machine learning

Bricklane has been collecting data on properties since 2017, including thousands of highly accurate manual rental valuations which all properties selected for investment go through. Looking at these values compared to our automated predictions confirmed what the Property team were experiencing. The predictions were great at guiding them closer to the best candidates in a city, however there was still work to do to produce a reliably precise result.

Thinking about our comparative valuation approach, this made sense. Accurate comparative results rely on finding exact or very similar properties to compare to. While a human is very good at assessing and picking up details about a property such as furnishing condition and amount of light in a room, coding this to be performed systematically from a listing is difficult.

To account for all the intricate details that a human picks up when creating a valuation, and to create robust generalisations for when no comparable listings are available, I decided to use our data to train a machine learning model.

Moreover, using diagnostic methods, such as SHAP value analysis, allowed us to also identify and understand patterns that human valuations do not pick up on. More on that later.

Model & data selection

Bricklane’s unique datasets meant I had access to many different data points I could use to define features for each property. These are quite varied, for example — how many bathrooms a property has, where the nearest tube stations are, and how large the living space is.

Combining these with our manual valuation rental data, I could now train a regression model to predict a rental income value for incoming properties.

Since properties and their rental incomes are so varied, we needed a model that was highly robust in handling and predicting large changing values. Ensemble models are a great choice for such problem.

Ensemble learning is a technique that combines predictions from multiple machine learning algorithms that uses bagging to make a more accurate prediction than a single model by reducing overfitting.

Overfitting is the process of fitting your model to training data so tightly that when you introduce new unseen data, the model does not perform very well. This is because overfit models tend to mistake noise and outliers in data for meaningful patterns.

The diagram below shows the structure of an example ensemble model, in this case a Random Forest model. Notice how the final prediction is made up by averaging the results of multiple parallel decision trees.

A visual representation of a Random Forest Ensemble model. The predictions from each sub-model are aggregated to give the final prediction result.

This parallelism means ensemble models are more robust to overfitting than singular models. This is due to the fact each sub-model uses a sub-sample of the available training data, meaning models are less likely to converge on the same set of features.

You can either build all the sub-models at the same time, as is the case with Random Forest models, or you can try and improve performance further by building them additively (one after the other), actively trying to reduce the error with each new sub-model you build (known as differentiable loss function optimisation, or gradient descent).

**Gradient descent, the process of gradually decreasing a cost function (RMSE for example) by calculating the rate of change at each iteration (first differential)**

The dataset for this exercise contained both categorical and numerical features (which is something that should be considered when assessing what type of ML model to use for an exercise). Tree-based models (such as AdaBoost, Random Forest or XGBoost) are well suited to mixed datasets since they branch on discriminative features. In my case I used an XGBoost model due to its speed on Python, gradient boosting optimisation and its regularization support. The last part especially is one of the reasons why XGBoost has such better performance than other similar libraries.

**The effect of regularization, both green and blue functions have the same loss function, with the blue function much more likely to mistake noise for a meaningful signal, due to increased variance.**

The loss function in this case was Root Mean Squared Logarithmic Error (RMSLE) rather than the commonly used Root Mean Square Error (RMSE). The reason for this is because RMSLE takes the logarithm of both predicted and actual sides. It does not penalise large differences if both predicted and actual values are also large, meaning the model produced will be more robust to predicting values for expensive properties in addition to normal ones.

After several iterations, I found the model predictions were nearly as accurate as those produced by our Property team, in a much shorter time!

Additionally, breaking down the patterns of the final models using SHAP values allowed us to understand patterns of rental values in much more detail, turning anecdotal suggestions into statistical evidence.

SHAP value breakdown for the effects of Longitude on London Rental Price from an early model, notice how the model accurately describes the ‘Central London’ premium. Graphs like these allowed the Property team to further their understanding of areas, and validate the models produced.

Our Engineering team moved this model into our production system, and the impact was instant! Our platform was now directing the Property team from thousands of potential candidates to the handful worth spending time on, meaning the team can fly through areas in search of the best investment cases.

The cold start problem

While the machine learning model returns superior results to the comparative valuation, there are intricacies to using it. Firstly, it requires some tuning to cater for different areas of the UK — for example, the decisions it makes for the Birmingham rental market don’t work as well for Leeds.

Secondly, the model requires a number of manually valued properties in order to initially train it. Without a sufficient amount, the model cannot reach a conclusion on which decision produces a more accurate result. This means that if we don’t have enough valuation data for a particular area of the UK, it can be difficult to build a model for it. This is an issue known as the cold start problem.

Fortunately, the breadth of Bricklane’s private datasets have allowed me to mitigate this. Using a technique based on collaborative filtering, we are able to build models for these situations without making big sacrifices on precision.

I will talk about other methods you can use to combat cold-starting in a later article.

So, what’s next?

Integrating machine learning into our day-to-day operations made a huge impact. Coming from working in much larger organisations, it’s exciting to see your work making a difference rapidly. We’ve found machine learning can be applied to many parts of our decision making, begin informing our investment strategy, and even debunk some industry myths!

We hope to share more on these in future, be sure to subscribe to the Bricklane Tech Blog for updates.

Predicting rental income with machine learning

Improvement through machine learning

Model & data selection

The cold start problem

So, what’s next?

Written by Ali Kokaz