Predict House Price in King County with Azure Machine Learning and Power BI (Part 1)

6 min readDec 14, 2021

King County, Washington, United States is the most populous county in Washington. A large number of residents attracts property sellers in this area. So in this Microsoft X Studi Independen Kampus Merdeka Capstone project, I take the case of MariBisnis, a company that wants to know the house price prediction in King County.

To solve this problem, Microsoft provides resources with less/no code to make predictions and applications to visualize.

Purpose

Train machines with existing data in order to make house price predictions.
With the existing data, it can be seen the trend of home sales from time to time as information on home sales business strategies
The existing data will be visualized so that the data can be understood by the managerial level.

Benefits

Using visually displayed analysis can summarize a picture of the current state of the property
The analyzed data will then be used for future decision-making.

Methodology

This is the process from understand the data until visualize into Power BI.

Understand the Data

Data description

Resources

Azure Machine Learning

Azure Machine Learning is a platform provided by Microsoft to do machine learning with less / no code. Users can use the available modules or customize the desired module. The function of Azure Machine Learning in this project is to perform regression analysis, predictive and unsupervised learning. The Azure Machine Learning settings used in this project are as follows:

Compute Instances

• Virtual Machines: CPU

• Virtual Machine: Standard_DS11_v2

Compute Clusters

• Location: West US

• Virtual Machine priority: Dedicated

• Virtual Machine type: CPU

• Virtual Machine size: Standard_DS11_v2

• Minimum number of nodes: 0

• Maximum number of nodes: 2

• Idle seconds before scale down: 120

• Enable SSH access: Unselected

2. Power BI

Power BI is a platform used to analyze and visualize data. For the data to be easily understood by others, for example by the managerial level, the data needs to be processed and tidied up. The version of Power BI used in this project is Power BI Desktop 2.96.1061.0 64-bit.

Predict House Price (Regression Analysis)

You can check video demo through this link.

Flowchart Predict House Price

Preprocessing

In doing preprocessing, the first is to enter the dataset. Load datasets as CSV on Azure Machine Learning. The dataset used is the MariBisnis.csv dataset. The dataset consists of 21613 rows and 21 columns. Select the column that will be used to carry out the next process. Select all columns except the id column because id is only a unique pointer and has no correlation with the price of a house. Delete the missing data by deleting all its rows. So if a row has no price then all values in that row are lost. Take the best features. Here are taken only 18. The reason for taking only 18 feature columns is from 19 features, which have a correlation with price, 18 features. The uncorrelated feature is the date so it is omitted. Select the column to be normalized. Normalization is used using Zscore or it can be called Standard Scaling.

The purpose of the standard scaler is to generalize the values that become features. For example, bathroom and sqft living, one of which is in the form of up to thousands, only one digit. The machine will tend to think that the living sqft is superior to the bathroom, even though their position is the same. So there is a need for scaling. Standard scaling has a range — ∞ to ∞. Divide the data by comparison of training data and data for validation by 7:3.

2. Modeling

Enter the model used. Here I used all the regression models available on Microsoft Azure Machine Learning. Train the model with a target that is “price”. To get the best results, the model needs to be adjusted to the parameters to find the maximum parameters in conducting the training. There are two approaches to modeling, using a model without hyperparameter tuning (as in the previous slide) or with hyperparameter tuning. Here I do both because it could be a model that does not use hyperparameter tuning has better results.

Azure Machine Learning has a Tune Model Hyperparameters module which can be used to find the best parameters in the model. After tuning, the module will show the best parameter rating with the benchmark in the form of a measurement standard that has been assigned. Here I use the Mean Absolute Error (MAE) as the measurement standard to determine which parameter is the best.

3. Evaluation

From here the model that has the best score is the boosted decision tree regression using hyperparameter tuning. What I noticed here is MAE or Mean Absolute Error. Error calculations that are quite often used are Mean Absolute Error or Root Mean Squared Error. This is because the Mean Absolute Error calculates a pure error value and is best used on data that, if there is a prediction error, has no significant effect. Meanwhile, in RMSE, it is used to calculate data whose error value is very important because it puts a high burden on data that has errors. The drawback is that it is difficult to find out the biggest cause of the error so I use MAE.

Training Pipeline

Real-Time Inference Pipeline

In the real-time inference pipeline, there are some changes that need to be adjusted. Previously, using the Maribusiness dataset, now the data is entered manually, this is test data. Then the evaluation model module was also removed. Select the column to be used in the dataset, the unused column is id. Enter the python script used to generate the prediction result. Deploy models by creating real-time end-points.

Test to Power BI

Test the model by Using Power BI. Connect with Azure Machine Learning then select one of the models to use. Here I choose a model for price predictions. The result will appear as in the following image.

From the average, it appears that the prediction results are not so good. But when viewed from the total, it is almost close. For example, in April, house prices were quite high, but the predictions showed that house prices were quite low. This is a form of prediction error.