DATA STORIES | PREDICTIVE ANALYTICS | KNIME ANALYTICS PLATFORM

House Price Prediction using KNIME

A low-code solution with just a bunch of nodes

Wisnu Purnomo
Low Code for Data Science

--

An analysis regarding house price predictions will be very useful if we interact a lot with property business people. Apart from that, this analysis will also be useful for people buying houses, selling houses, real estate investors, property developers, banks and the government. With the many benefits that can be obtained from this analysis, I will carry out a house price prediction analysis using the free and open-source KNIME Analytics Platform.

Business Understanding

The output of this analysis is a prediction of house prices using various independent variables, for example house area, number of bedrooms and number of bathrooms, as well as other supporting variables. This can be used by various parties who need it for other purposes, for example to predict house prices in region X, create housing policies, and become the basis for market analysis.

Dataset

This analysis will use a dataset from the Kaggle.com website with the following link https://www.kaggle.com/datasets/muhammadbinimran/housing-price-prediction-data.

Data Understanding

The first step to take is to import the downloaded dataset into a KNIME worksheet using the CSV Reader node.

After importing, I will view the dataset using the Table View node so it will appear more or less like the following:

It can be seen from the table above that the data consists of 50,000 rows and 6 columns. In the table view, I can also see the data type of each variable. The explanation of each variable is as follows:

  1. SquareFeet: House area
  2. Bedrooms: Number of bedrooms
  3. Bathrooms: Number of bathrooms
  4. Neighborhood: House location
  5. YearBuilt: The year the house was built
  6. Price: House price

After that the data will be viewed for descriptive statistics using the Statistics node as follows:

From the descriptive statistics above, there are several figures that can be taken, such as the minimum area of ​​a house is 1,000 square feet and the maximum is 2,999 square feet with an average of 2,006 square feet. With an average home size of 2,006, the price you can get is $224,827. The averages for bedrooms, bathrooms, and year of construction were 3.4, 1.9, and 1985, respectively.

From the above statistics it can also be seen that there are no missing values ​​in the data, so I don’t need to bother dealing with missing values. Additionally, I can also see the top 20 values ​​of each variable by clicking the Up/down tab in the top left corner. The top-down data output is as follows:

I can use the top/bottom output to analyze which values ​​come out frequently and which values ​​come out rarely. Knowing the frequency of occurrence of each value will make it easier for me to carry out analysis related to frequency data.

Data Preparation

The first step I will take in this section is to see if there are any outliers in my dataset. To identify outliers, I will first create a box plot for each variable using the Box Plot node.

In the box plot above, it can be seen that in the Price variable there is outlier data or data that is too far from the mean. I had to remove outlier data using the Numeric Outliers node.

The node above is useful for eliminating outlier values ​​in my dataset. If I use the node above, it looks more or less like this:

After the outliers have been successfully removed, I will check again whether there are still outliers or not in the Price variable by using a box plot.

It can be seen in the visualization above that the outliers in the Price variable have been successfully removed and the number of outliers removed is 59, so the current number of rows is 49941 rows.

Then the data is ready to be taken to the next step, namely modeling.

Data Modelling

Data modeling is the step where data is further analyzed using various models and algorithms to gain insights. In this analysis I will use a multiple linear regression model with the dependent variable being Price and the independent variables being SquareFeet, Badrooms, Bathrooms, and Neighborhood.

Before entering the model, I will first divide the dataset into two, namely training data and test data using the Partitioning node.

I will split the dataset in the proportion of 80:20. This means that 80% of the data will be training data and 20% of the data will be test data.

I can see the amount of training data and test data in the node monitor section. Examples are as follows:

On the monitor node I see there are 2 outputs from the Partitioning node, namely the first partition and the second partition. The first partition refers to training data, while the second partition refers to testing data. After the data is divided, the next step is to model the data using the Linear Regression Learner and Regression Predictor nodes.

Using these two nodes, I can perform linear regression analysis in KNIME. The method is to connect the first partition output from the Partitioning node to the Linear Regression Learner node as shown in the image below:

Once the two nodes above are combined, I will configure the Linear Regression Learner node as follows:

At the top I entered the Price variable which will be used as the dependent variable. Meanwhile, for the independent variables I used SquareFeet, Bedroom, Bathroom, and Environment. After that I will run the Linear Regression Learner node to run the algorithm. When finished, I have the results of the linear regression that I have created, namely the results are like this:

From the output above, I will create an econometric equation as follows:

Price = 2,602 + 99.1 Square Feet + 4,967 Bedrooms + 2,877 Bathrooms — 745.7 Suburban + 1,395 Urban

The interpretation of the above equation is:

  1. Value 2,602 = This value shows the intercept, namely when x = 0. This means that if a house does not have all the independent variables described in the equation model, then the price is $2,602.
  2. 99.1 SquareFeet = This value shows that when the SquareFeet area increases by 1 foot, the price of the house will increase by $99.1, with other variables held constant.
  3. 4,967 Bedrooms = This value shows that when the number of bedrooms increases by 1 room, the house price will increase by $4,967, with other variables held constant.
  4. 2,877 Bathrooms = This value shows that when the number of bathrooms increases by 1 room, the house price will increase by $2,877, with other variables held constant.
  5. -745.7 Suburbs = This value shows that if the house is located in a suburban area then the house price will be $745.7 lower than other areas.
  6. 1,395 Urban = This value shows that if the house is in an urban area, the house price will be $1,395 higher than in other areas.

Apart from that, in the regression output table above I can also see whether each variable is significant or not, and I can also see the R2 score of the equation.

Once I know the output of the linear regression, my next step is to input the output into the Regression Predictor node to make predictions on the test data. The work flow is more or less as follows:

After the data is entered into the Regression Predictor node, the next step is to see how the predictions are made. To see the predictions that have been made, I can use the Table View node as follows:

In the table you can see that there is a variable called Prediction (Price) which is a variable used to store predicted values. Here I can see how much a house would cost if there were different independent variables. In this way, the house price analysis has actually been completed, and the next step that I will continue with is the next part, namely Model Evaluation.

Model Evaluation

I have to evaluate the prediction results above to find out whether the model is good or not. To carry out the evaluation, I will use the Numeric Scorer node. Here are the results of my model evaluation:

In the evaluation results table above, I will take two evaluation parameters, namely R2 and Mean Absolute Error (MAE). The R2 value is 0.56, which means that 56% of the dependent variable (price) can be explained by the independent variable. In my opinion, an R2 value of 56% is good enough to predict house prices. Apart from R2, I will also use MAE. The MAE value of 39.4 shows the average error value. My dependent variable, namely Price, has a relatively large data scale, namely up to thousands of dollars, so an MAE value of 39.4 is quite good.

Closing

That’s the end of my article about House Price Prediction Using KNIME. I hope this analysis can help readers a lot in carrying out the same analysis, or can be modified as needed. I would like to thank the readers who have taken the time to read my article.

I’ll provide an overview of the workflow of this analysis in case readers want to see the results of my work:

--

--

Wisnu Purnomo
Low Code for Data Science

Hello, I'm Wisnu Purnomo. I like analyzing data, especially data about the environment and regional economy. I also like creating data visualizations.