Introducing ASTOR: Suggesting real estate management policies with Machine Learning

Enabling asset managers of housing associations to create their own machine learning models to assist them with their decision making

Artificial Intelligence (AI) is increasingly used in many domains nowadays. We see data used as input for such models becoming a precious commodity for our clients. With the right tools data can be transformed in valuable insights supporting decision making processes. At Tech Labs we put significant effort into enhancing traditional econometric models on its accuracy while retaining the economic interpretability and exploring innovative ways the make use of proprietary and public data available. In this article we show by example how Dutch housing associations can monetize their data with valuable insights on strategic level. We build a prototype called ASTOR which assist asset managers on creating policies for their real estate portfolio (e.g. Should build objects for instance be renovated, demolished or sold and if so, what will be the right timing?)

The data that is stored in housing associations’ systems contains a lot of information with regards to the real estate owned by them. Since this data is also labeled it can be used with machine learning classification algorithms to create predictive models. Also, machine learning is a field that requires extensive knowledge. Making machine learning approachable by novice users will enable domain experts to use their knowledge to improve models’ performance.

Background

Housing associations in the Netherlands are mainly focused on providing affordable housing. In order to keep providing this service, strategic investments need to be made with regard to housing complexes (i.e. multi-family apartment complex) that are owned by a housing association. The investment in a housing complex is represented by a policy. Policies can be assigned to complexes in order for them to create value. Types of policies are rent, improve, sell and demolish policies. Determining the policy for a particular complex determines how the complex should be treated.

Ortec Finance develops applications that allow housing associations to perform asset management in order to determine policies for complexes in their real estate portfolio. An asset manager analyses the characteristics of the different complexes of a housing association and produces a policy that is applied to the particular complex, in such a way that it contributes to the goals for the entire portfolio. Such goals can for instance be to achieve a target return for the organization, to achieve a sustainability target for the owned complexes, or a social target that aims to provide sufficient social housing, etc. Optimizing the asset management process is mandatory for Ortec Finance in order to be relevant in the market and to always provide clients with the best solutions.

Problem statement

In the current process of asset management, data for complexes is manually processed and analysed by asset managers. The data that has to be analysed is extensive and it is challenging to take into consideration all the relevant information while determining a policy for a complex.

It is possible to develop a system that can improve the asset management process with machine learning methods . However, machine learning requires extensive knowledge in order to be able to create reliable and valid models and also a broad knowledge of the specific domain. Developing a system that has predefined data pre-processing methods for machine learning and predicting new instances, will enable the creation of models that can improve the asset management process by suggesting policies to be applied. Also, an advising system that provides information and assistance regarding the various tasks of the process of creating a machine learning model, will enable the involvement of the asset managers and their knowledge of the data.

Methodology

In order to address the problem statement, tasks are differentiated and specified. A visualization of the process described below looks as follows:

Process workflow

A machine learning process will be defined that is able to suggest policies for complexes. The purpose of the machine learning process is to extract information about identifying the required functionality that the system needs to have. Subsequently, a system architecture is designed in order to produce a high-quality system. This will be followed by implementation of data pre-processing methods, implementation or integration of machine learning algorithms, implementation of prediction functionality, and implementation of methods to incorporate asset managers’ knowledge . An evaluation of the system will be performed as well as an evaluation of the machine learning models’ performance. The advising system will also be evaluated in order to find out if novice machine learning users can create machine learning models with satisfactory performance. Finally, it will be evaluated if the system actually improves asset management by analysing the predicted policies for complexes.

Machine Learning

The following sections contain information about the data scraping, data pre-processing and machine learning techniques that were used with housing associations’ data.

Data Scraping

Several solutions that Ortec Finance provides to housing associations are web-based. The architecture of these web applications is client/server based and communicates via REST API protocol. Using the API, clients load and store information from and to the server which itself communicates with a database. Since the client and the server are implemented to exchange data using JSON format, which is an acceptable format for many programming languages, it was decided to extract the data from the server API rather than directly from the database. Also, the data from the server API is complete and ready to use, while if it was extracted directly from the database, complex SQL (Structured Query Language) statements would need to be developed to join the data from the different tables. To scrape the data from the servers, several scripts were created using the Python programming language that simulates login to the application and then makes several API calls to download the data. This resulted in downloading 632 complexes and 16486 rental units.

Data pre-processing and Data analysis

Since the data used for machine learning is a realistic data set from a housing association, some specific data pre-processing and data analysis steps are omitted in this article. The main pre-processing steps that were taken regarded i identifying corrupt data and removing it, changing types of variables in the data set, transforming categorical variables into numerical ones, where depending on the variable a different type of transformation was used, merging complexes data and rental units data, as well as removing outliers and removing one variable of very highly correlated variables pairs. This decreased the data set to 580 complexes. Some of the variables that complexes contain are rent, maximum demandable rent, market rent, tax value, preferred rent by the housing association, taxation value, tax value, rental class, social classification, energy class, number of rental units in the complex and more.

The main goals of the data analysis were to gain more knowledge about the data and to try to identify the relationships between the different variables. First to be inspected was what needs to be predicted by the machine learning model under construction:policies. Based on the available realistic data set it was decided to try to predict the sell policy of a complex, as in that data set the complexes assigned to that policy were sufficient (132 complexes) for machine learning, as opposed to the others which were very few (below 20). It can safely be assumed that if it is possible to predict the sell policy, it will be possible to predict other policies as well, when such data is available. Another significant conclusion arrived at during the data analysis was that the relationships between the different variables are linear, and for that reason linear machine learning algorithms must be used because they will produce better performance than non-linear machine learning algorithms. A relation can be observed in the following violin plot which gives insight into rent distribution:

Violin plot complexes’ distribution of the sell policy

The graphic shows that the median of the features looks separated and this means that they can be good features for classification, as they can be separated. The following scatter plot also gives an indication of a linear relationship:

Scatter plot taxation value and tax value

Inspecting these plots leads to the conclusion that data is linearly separable, but there is a cluster of observations between the sell policy and the not-sell policy. This means that separately, the pair of features are not the best features that can be used for predicting new instances, but a combination of both may produce good results.

The variables that were identified as having a linear relationship while using the plots were also recognized by using an algorithm. The algorithm recognizes the best K number of variables that are the most powerful predictors using the mutual information metric. The metric measures the mutual dependence between two variables. Since the algorithm recognizes the variables that were also identified during the inspection of the data, it means that the conclusion that linear machine learning algorithms will produce better results is most likely correct.

Machine learning model

The available data was split into training data and testing data — 70% and 30% respectively. The algorithms that were used for creating a machine learning model are Logistic Regression, Support Vector Machine, Random Forest and XGBoost. The performance of all the algorithms was tested with further pre-processing algorithms in the following way:

  • Without any further preprocessing (1).
  • With optimal features (2) — selecting the optimal features using recursive feature elimination.
  • With optimal features, principal component analysis and normalized data (3).
  • With optimal features, principal component analysis and robust scaled data (4).
  • With optimal features, principal component analysis, normalized data and robust scaled data (5).

The following table contain the most performant option among all the algorithms.

Models’ metrics with 10 KFold CV

All the classifiers have similar performance. In this case, although with little difference, the model that shows the most acceptable trade-off between the False Negatives and the False Positives, indicated by the highest F1 score and highest accuracy, is XGBoost. To further improve the predictions by the model, the ROC curve is used to determine a threshold for the model that will produce even better trade-off between the False Negatives and False Positives.

XG Boost — ROC curve, AUC score with 10 KFold CV

Model creation is further improved using greedy grid search to optimize the parameters of the XGBoost machine learning algorithm. The final model metrics are as follows:

XGBoost metrics with optimal features, tuned parameters and threshold of 46% (10 KFold CV)

It can be concluded that the algorithm that most accurately makes predictions for the sell policy of a complex is XGBoost.

System Development

The following section contains information about the tools and techniques used in developing the system as well as a description of the system’s functionality.

Tools and Techniques

The client is developed as a Web Application. In order to make development easier, the Angular front end framework is used. It was developed by Google and provides a strong architecture that is component-based and makes developing dynamic applications for the Web easier. In order to further facilitate the development, a library that provides state management for Web applications is used. It is called NGXS. The library provides a way to create states for an application and access any information in the application from all the different components. It reduces the complexity of data management in the application and makes controlling the data easier.

The server is written in Python using Flask framework. The main reason for picking Python for this project is the scikit-learn library. It is available for Python and provides machine learning algorithms, data pre-processing and data analysis tools out of the box. Using Python to develop the server makes that library available to use in the system and implementation of some techniques is not necessary — they only need to be integrated from the library. In addition, there is also a library for Python called NumPy which is a powerful library that is used for scientific computing and multi-dimensional contained data. Various tasks are easier to perform using that library. The Flask framework provides an easy way to create endpoints for the server API that can also be used out of the box.

System

The initial screen of the application looks as shown in the following Figure. The functionality of this segment of the application regards uploading data. A user can upload data in CSV or JSON format and send it to the server. After a successful upload, the user is redirected to the main screen of the application.

Initial screen of the application

The main screen of the application looks as follows:

Main screen of the application

It includes various components, which are described below:

  • Data component — A table was implemented that shows all the data that was uploaded to the server. It can be observed in the top left rectangle. All the different columns and rows of the data can be seen in the table. Also, the table provides sorting capabilities in case this is needed to analyse the data.
  • Data information description component — In the top right rectangle, statistics for the different features can be found. There is the type of each feature, how many times it occurs in the data, how many unique values there are (only for categorical), the top occurring value (only for categorical), the most common value’s frequency, the mean of the value (only for numerical), standard deviation (only for numerical), and the minimum, 25%, 50%, 75%, and maximum value in the column (only for numerical)
  • Recommendations — In the bottom right rectangle various recommendations appear that are generated based on data analysis. The recommendations are based on the machine learning algorithm used — XGBoost. The recommendations point the user to the right feature to use in the system, depending on analysis of the data.
  • Machine learning model — In the bottom left rectangle, the performance metrics of the machine learning model can be observed. When trained, the data is split in training and testing data. The machine learning model is trained with the training data and it outputs performance metrics based on the testing data. The metrics include accuracy score, F1 score, precision score and recall score. Furthermore, the user can see the testing data and also the predictions the model made for it.

At the top, the user can find a menu with the functionality of the system, which includes data source features, data pre-processing features, machine learning algorithms, as well as predict functionality.

Discussion

Based on the performance of the created models, it can be concluded that machine learning can be used in real estate asset management. Although the performance of the models is high enough to completely substitute the work of asset managers, its predictions can be used as recommendations for further analysis of complexes.

To evaluate the system a feedback session was conducted. There were four participants in the session — two domain knowledge experts (asset managers), and two people with domain knowledge expertise, as well as machine learning and software development knowledge. The goal of the session was to create models with at least 86% accuracy. All of the participants were able to pre-process the data and also create a model. What was interesting to see was that the people without machine learning knowledge started to understand the concept of machine learning, and by the end of the feedback session they were trying to improve the performance of the model by applying different pre-processing features. It was concluded that such systems are a powerful communication tool between domain knowledge experts and machine learning experts, and that such systems can enable novice machine learning users to apply machine learning techniques.

Code available @ https://github.com/OFTechLabs/astor

Ivaylo