Predict House Price in King County with Azure Machine Learning and Power BI (Part 2)

4 min readDec 14, 2021

So in the part 1, I have explained about Regression Analysis. In Part 2, I will explain about clustering.

You can read part 1 through this link:

Predict House Price in King County with Azure Machine Learning and Power BI (Part 1)

King County, Washington, United States is the most populous county in Washington. A large number of residents attracts property sellers in this area. So in this Microsoft X Studi Independen Kampus Merdeka Capstone project, I take the case of MariBisnis, a company that wants to know the house price prediction in King County.

To solve this problem, Microsoft provides resources with less/no code to make predictions and applications to visualize.

Purpose

Train machines with existing data in order to make house price predictions.
With the existing data, it can be seen the trend of home sales from time to time as information on home sales business strategies
The existing data will be visualized so that the data can be understood by the managerial level.

Benefits

Using visually displayed analysis can summarize a picture of the current state of the property
The analyzed data will then be used for future decision-making.

Create House Cluster

You can check video demo through this link.

Flowchart Create House Cluster

Preprocessing

The purpose of grouping house types is to make it easier to visualize, make it easier to target markets and convey insights. The steps in performing clustering, select the column used. Columns used here are all columns except id, date, longitude, and latitude. Clean up lost data. If there is a row that has a null value in the column, then all the values in that row will be deleted. Normalization is used using Zscore or it can be called Standard Scaling.

Divide the data into training data and data for validation with a ratio of 7:3.

2. Modeling

The data is modeled using K-Means Clustering then divided into 5 centroids. In general, clustering uses iterative techniques to group cases in a dataset into clusters that have similar characteristics. This grouping is useful for exploring the data, identifying anomalies in the data, and finally for making predictions. Clustering models can also help identify relationships in a data set that might not be logically obtained by simple tracing or observation. For this reason, clustering is often used in the early phases of machine learning tasks, to explore data and find unexpected correlations. When configuring a clustering model using the k-means method, it is necessary to specify a target number k that represents the desired number of centroids in the model. The centroid is the point that represents each cluster. The K-means algorithm assigns each data point to one of the clusters by minimizing the number of squares in the cluster.

Parameter:

create trainer mode: Single Parameter

number of centroids: 5

Initialization: K-Means++

random number seeds:-

Metric: Euclidean

Iterations: 100

assign label mode: Ignore label column

Train model Using all columns, except id, date, latitude, and longitudinal columns. Each row is included in a cluster.

3. Evaluation

Enter the data on the cluster that has been initiated. The cluster that has the most members is cluster no 3 as many as 2308.

Training Pipeline

Real-Time Inference Pipeline

Enter data manually in enter data manually module. Delete the previous modules, namely MariBisnis-preprocess, select column in the dataset, and evaluate the model. This endpoint will later be used to enter data into Power BI.

Test to Power BI

To get the desired model, connect with Azure Machine Learning then select the clustering model. The cluster results will appear in Power BI. With the results of the clusters in Power BI, the results are shown as follows where the most data is in cluster 3.