Introducing ASTOR: Suggesting real estate management policies with Machine Learning

Enabling asset managers of housing associations to create their own machine learning models to assist them with their decision making

R&D Labs
R&D Labs
Aug 14, 2018 · 11 min read

Artificial Intelligence (AI) is increasingly used in many domains nowadays. We see data used as input for such models becoming a precious commodity for our clients. With the right tools data can be transformed in valuable insights supporting decision making processes. At Tech Labs we put significant effort into enhancing traditional econometric models on its accuracy while retaining the economic interpretability and exploring innovative ways the make use of proprietary and public data available. In this article we show by example how Dutch housing associations can monetize their data with valuable insights on strategic level. We build a prototype called ASTOR which assist asset managers on creating policies for their real estate portfolio (e.g. Should build objects for instance be renovated, demolished or sold and if so, what will be the right timing?)

The data that is stored in housing associations’ systems contains a lot of information with regards to the real estate owned by them. Since this data is also labeled it can be used with machine learning classification algorithms to create predictive models. Also, machine learning is a field that requires extensive knowledge. Making machine learning approachable by novice users will enable domain experts to use their knowledge to improve models’ performance.

Background

Ortec Finance develops applications that allow housing associations to perform asset management in order to determine policies for complexes in their real estate portfolio. An asset manager analyses the characteristics of the different complexes of a housing association and produces a policy that is applied to the particular complex, in such a way that it contributes to the goals for the entire portfolio. Such goals can for instance be to achieve a target return for the organization, to achieve a sustainability target for the owned complexes, or a social target that aims to provide sufficient social housing, etc. Optimizing the asset management process is mandatory for Ortec Finance in order to be relevant in the market and to always provide clients with the best solutions.

Problem statement

It is possible to develop a system that can improve the asset management process with machine learning methods . However, machine learning requires extensive knowledge in order to be able to create reliable and valid models and also a broad knowledge of the specific domain. Developing a system that has predefined data pre-processing methods for machine learning and predicting new instances, will enable the creation of models that can improve the asset management process by suggesting policies to be applied. Also, an advising system that provides information and assistance regarding the various tasks of the process of creating a machine learning model, will enable the involvement of the asset managers and their knowledge of the data.

Methodology

Process workflow

A machine learning process will be defined that is able to suggest policies for complexes. The purpose of the machine learning process is to extract information about identifying the required functionality that the system needs to have. Subsequently, a system architecture is designed in order to produce a high-quality system. This will be followed by implementation of data pre-processing methods, implementation or integration of machine learning algorithms, implementation of prediction functionality, and implementation of methods to incorporate asset managers’ knowledge . An evaluation of the system will be performed as well as an evaluation of the machine learning models’ performance. The advising system will also be evaluated in order to find out if novice machine learning users can create machine learning models with satisfactory performance. Finally, it will be evaluated if the system actually improves asset management by analysing the predicted policies for complexes.

Machine Learning

Data Scraping

Data pre-processing and Data analysis

The main goals of the data analysis were to gain more knowledge about the data and to try to identify the relationships between the different variables. First to be inspected was what needs to be predicted by the machine learning model under construction:policies. Based on the available realistic data set it was decided to try to predict the sell policy of a complex, as in that data set the complexes assigned to that policy were sufficient (132 complexes) for machine learning, as opposed to the others which were very few (below 20). It can safely be assumed that if it is possible to predict the sell policy, it will be possible to predict other policies as well, when such data is available. Another significant conclusion arrived at during the data analysis was that the relationships between the different variables are linear, and for that reason linear machine learning algorithms must be used because they will produce better performance than non-linear machine learning algorithms. A relation can be observed in the following violin plot which gives insight into rent distribution:

Violin plot complexes’ distribution of the sell policy

The graphic shows that the median of the features looks separated and this means that they can be good features for classification, as they can be separated. The following scatter plot also gives an indication of a linear relationship:

Scatter plot taxation value and tax value

Inspecting these plots leads to the conclusion that data is linearly separable, but there is a cluster of observations between the sell policy and the not-sell policy. This means that separately, the pair of features are not the best features that can be used for predicting new instances, but a combination of both may produce good results.

The variables that were identified as having a linear relationship while using the plots were also recognized by using an algorithm. The algorithm recognizes the best K number of variables that are the most powerful predictors using the mutual information metric. The metric measures the mutual dependence between two variables. Since the algorithm recognizes the variables that were also identified during the inspection of the data, it means that the conclusion that linear machine learning algorithms will produce better results is most likely correct.

Machine learning model

  • Without any further preprocessing (1).
  • With optimal features (2) — selecting the optimal features using recursive feature elimination.
  • With optimal features, principal component analysis and normalized data (3).
  • With optimal features, principal component analysis and robust scaled data (4).
  • With optimal features, principal component analysis, normalized data and robust scaled data (5).

The following table contain the most performant option among all the algorithms.

Models’ metrics with 10 KFold CV

All the classifiers have similar performance. In this case, although with little difference, the model that shows the most acceptable trade-off between the False Negatives and the False Positives, indicated by the highest F1 score and highest accuracy, is XGBoost. To further improve the predictions by the model, the ROC curve is used to determine a threshold for the model that will produce even better trade-off between the False Negatives and False Positives.

XG Boost — ROC curve, AUC score with 10 KFold CV

Model creation is further improved using greedy grid search to optimize the parameters of the XGBoost machine learning algorithm. The final model metrics are as follows:

XGBoost metrics with optimal features, tuned parameters and threshold of 46% (10 KFold CV)

It can be concluded that the algorithm that most accurately makes predictions for the sell policy of a complex is XGBoost.

System Development

Tools and Techniques

The server is written in Python using Flask framework. The main reason for picking Python for this project is the scikit-learn library. It is available for Python and provides machine learning algorithms, data pre-processing and data analysis tools out of the box. Using Python to develop the server makes that library available to use in the system and implementation of some techniques is not necessary — they only need to be integrated from the library. In addition, there is also a library for Python called NumPy which is a powerful library that is used for scientific computing and multi-dimensional contained data. Various tasks are easier to perform using that library. The Flask framework provides an easy way to create endpoints for the server API that can also be used out of the box.

System

Initial screen of the application

The main screen of the application looks as follows:

Main screen of the application

It includes various components, which are described below:

  • Data component — A table was implemented that shows all the data that was uploaded to the server. It can be observed in the top left rectangle. All the different columns and rows of the data can be seen in the table. Also, the table provides sorting capabilities in case this is needed to analyse the data.
  • Data information description component — In the top right rectangle, statistics for the different features can be found. There is the type of each feature, how many times it occurs in the data, how many unique values there are (only for categorical), the top occurring value (only for categorical), the most common value’s frequency, the mean of the value (only for numerical), standard deviation (only for numerical), and the minimum, 25%, 50%, 75%, and maximum value in the column (only for numerical)
  • Recommendations — In the bottom right rectangle various recommendations appear that are generated based on data analysis. The recommendations are based on the machine learning algorithm used — XGBoost. The recommendations point the user to the right feature to use in the system, depending on analysis of the data.
  • Machine learning model — In the bottom left rectangle, the performance metrics of the machine learning model can be observed. When trained, the data is split in training and testing data. The machine learning model is trained with the training data and it outputs performance metrics based on the testing data. The metrics include accuracy score, F1 score, precision score and recall score. Furthermore, the user can see the testing data and also the predictions the model made for it.

At the top, the user can find a menu with the functionality of the system, which includes data source features, data pre-processing features, machine learning algorithms, as well as predict functionality.

Discussion

To evaluate the system a feedback session was conducted. There were four participants in the session — two domain knowledge experts (asset managers), and two people with domain knowledge expertise, as well as machine learning and software development knowledge. The goal of the session was to create models with at least 86% accuracy. All of the participants were able to pre-process the data and also create a model. What was interesting to see was that the people without machine learning knowledge started to understand the concept of machine learning, and by the end of the feedback session they were trying to improve the performance of the model by applying different pre-processing features. It was concluded that such systems are a powerful communication tool between domain knowledge experts and machine learning experts, and that such systems can enable novice machine learning users to apply machine learning techniques.

Code available @ https://github.com/OFTechLabs/astor

Ivaylo

R&D Labs

Ortec Finance's R&D Labs department is where we work and…

R&D Labs

Ortec Finance's R&D Labs department is where we work and experiment with the latest techniques in order to research its applicability on new and existing products in the field of Finance.

R&D Labs

Written by

R&D Labs

We work and experiment with both new modelling approaches and IT techniques and concepts in order to research their applicability to investment decision making

R&D Labs

Ortec Finance's R&D Labs department is where we work and experiment with the latest techniques in order to research its applicability on new and existing products in the field of Finance.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store