Implementing A Univariate Linear Regression Model In Java To Predict House Pricing

Andres Tse
CodeX
Published in
9 min readJun 29, 2022
Photo by Shahadat Rahman on Unsplash

This week’s post is going to be a little different than usual. Although we usually talk about Cyber here, today we will be delving a bit into machine learning algorithms- more specifically, how to implement a linear regression model that will allow us to predict prices of houses based on a single variable (the size per 1000 sqft). Shall we get started?

Introduction

There are two main methods that we can utilize to train models in machine learning- supervised learning and unsupervised learning.

In supervised learning, we usually feed properly labeled data to the algorithm. That is, we explicitly tell the algorithm what the “correct” answer is. For instance, if we are using classification (a type of supervised learning), we inform the algorithm whether a picture (input value or feature) contains a car, a cat, or a lizard (target). Based on the training set, the model is then able to predict whether the element in a future picture is a cat, car, or lizard.

Or, in a regression model (also another type of supervised learning), we provide the input (say the size of a house per 1000 sqft) and its correspondent target value (cost of the house). Using that information, the algorithm can correctly predict the house pricing outside of the training set just by taking in an input (size of the house).

The main difference between a regression model and a classification model is that the predictions of a regression model can have an infinite range, while the predicted values of classification will be much more limited in comparison.

I won’t delve too much into unsupervised learning, as it is outside the scope of this post, but with this type of machine learning the algorithm looks at the data provided and draws its conclusions (as opposed to us telling it what exactly we are looking for) based on the common patterns observed. Some examples of unsupervised learning include clustering, anomaly detection, and dimensionality reduction.

Getting Started

We will be implementing an algorithm that will take the size of a house per 1000 sqft and generate its corresponding price given the training set.

More specifically, we will be using a linear regression model. A linear regression model usually “draws” a line right in the middle of the data set and attempts to make predictions based on that line.

We are going to need to know a few equations to implement it:

Linear Regression model equation = f(x)=w*x + b or f w,b = w*x + b

This equation is what will draw the line in the middle of the data.

Cost Function equation = J(x) = 1/2m* mΣi=1 (f(x)^(i) -y^(i))²

This equation calculates the cost of our linear regression model. That is, it tells us the disparity between our predicted values (ŷ or values that the model predicts) versus our target values (values that we explicitly fed into the algorithm).

Gradient Descent Algorithm

This will allow us to reduce the cost of J(x) by constantly iterating the values of w and b until the cost is at its minimum.

repeat until convergence (meaning until each future iteration has minimal impact on the values of w and b) {

w = w- α (learning rate) * 1/m * mΣi=1( f(x)^(i) — y^(i))x^(i)

b = w- α (learning rate)* 1/m * mΣi=1( f(x)^(i) — y^(i))

}

Implementation

I will be using Java as my programming language and BlueJ as my IDE. You can use Python or whatever language you are comfortable with. The syntax utilized will vary depending on the language used.

First, I will be creating a new class called PricePredictor.

Next, I will assign a few fields to the class PricePredictor as well as create a constructor.

x_train is an array that contains the input values for the size of a house per 1000 sqft.

y_train is an array that contains the target values for the price of a house.

1.0 sqft will be $300.0 while 2.0 will be $ 500.0 (theoretically).

w from w*x+b

b from w*x+b

m is a variable representing the number of rows in a training set. I assign the value of m to be x_train.length which will return the number of elements in that array (2).

Since we do not know the optimal values for what w and b should be in order to have a linear regression model that fits the data perfectly, we will initialize w and b to 0 as a starting point.

Note that this implementation is mostly demonstrative- in a real-world scenario, you would seldom hard code your training set.

Next, I will be implementing a public, double method called CostFunction which will represent our cost function J(x) = 1/2m* mΣi=1 (f(x)^(i) -y^(i))².

First, I create a variable called cost and initialize it to 0.

Next, I run a for loop where i starts at 0 and ends at m-1 (which is essentially m times).

I assign variable f_wb to w*x_train[i] + b, where i represents the iteration number.

f_wb is our prediction value for the feature x^(i) given that w and b are currently 0.

I then assign cost_i to (f_wb — y_train[i])².

cost_i represents the cost between the predicted value (f_wb given x^(i) and w=0 and b=0) and the actual target value (y_train at i) squared.

Finally, for each iteration, I add cost_i to the variable cost.

After all iterations are complete and I exit the loop, I create a variable called total_cost , multiply cost by 1/2m and assign to it, just like in the equation.

I then finalize this method by making it return the variable total_cost.

Next, we will calculate the partial derivatives for cost function J in respect to both w and b. More specifically, we will be finding the values of the operations in parentheses and in bold. The method GradientValues will have the parameters w and b and will return an array of type double.

w = w- α *( 1/m * mΣi=1( f(x)^(i) — y^(i))x^(i))

b = w- α* (1/m * mΣi=1( f(x)^(i) — y^(i)))

Initialize dj_dw and dj_db to 0.

Create a for loop just like in the previous method that will run m times.

f_wb = w*x_train[i] + b to get predicted value or ŷ.

cost_i to get the disparity between ŷ and target values (y).

dj_dw_i will give us the derivative for cost function J in respect to parameter w specifically (as it is multiplied by x_train[i]) while we can just use cost_i as the derivative for cost function J in respect to parameter b.

We can then add these values to dj_dw and dj_db respectively.

Once we are out of the loop, we can multiply it by (1/m) just like in the equation.

I will then create an array gv of type double, store dj_dw and dj_db in it, and return gv so that I can use these values later on.

Next, the method GradientFinal returns an array of type double with parameters int inters (iterations) and alpha (learning rate).

We will start with a for loop that will iterate inters times. Inside the loop, we will call our previously constructed method GradientValues to get the derivatives of the cost function J with respect to w and b previously stored in an array and assign them to the gv.

We will now update the values of w and b, just like in the equation.

w = w- (alpha * gv[0] (our derivative for cost function J in respect to w that is stored in gv)

b= b- (alpha* gv[1] (our derivative for cost function J in respect to b that is stored in gv)

After the loop has been executed for inters amount of times, Let’s get out of the loop and then store these newly updated values in a new array called gf and return it.

So, we have done all this work, and we aren’t sure whether everything is correct or not. Let’s test it out by making a new class called tester.

In this class, I created a void method called testgf:

I will create a new instance of PricePredictor, call the method GradientFinal with 10000 iterations and a learning rate (alpha) of 1.0e-2 or roughly 0.010.

Then, I will iterate through each of the values in the array gf to get the optimized values of w and b for our linear regression model and print them out.

As you can see, the optimized value for w is roughly 199.992 and 100.0011 for b after 10000 iterations. Let’s plug these values in our equation to see if we get our target outputs.
f(x) = w*x+b
300= w*(1.0) + b >> 300 = (199.99)*1 + (100.011) = 300
500= w*(2.0) + b >> 500 = (199.99)*2 + (100.011) = 499.997 (basically 500 if we were to round it up).

Congratulations! We have now found out the values for w and b which will make our linear regression model very fitting to the data.

Finally, let’s create a method that will run as a mini-program and put all these concepts together:

In this program, we will ask the user for the size of the house per 1000 sqft, call GradientFinal to calculate the optimal values for w and b, plug in those values, predict the price for the house, and print it out. If the user inputs something other than a number, we will show an error message and allow the user to try again.
Let’s run it and see what it looks like:

I will input 1.2 as my size, followed by 1 and 2 (which were our target values).

As you can see, we get a predicted price of $340. We know it is correct because when we input 1 or 2 the predicted values do align with our target values.

Let’s try using some non-digit characters:

As you can see, the prompt just restarts as planned. Now let’s exit the program by typing in ‘EXIT’.

As you can see, the program has stopped running and we get a “See you next time” message.

Congratulations! You have just now implemented a univariate linear regression model using Java to predict house prices based on size.

If you want a model that is more predictive and applicable, you’d have to modify the code so that you are using a training set with real data as well as add more variables (number of bathrooms, flooring, etc.) that will contribute to the final pricing other than the size of the house. This tutorial only showcases a rough outline of how you’d go about implementing it.

A big shout out to DeepLearning.AI and Standford for creating the Machine Learning Specialization. This tutorial is heavily inspired by it (although the original course utilizes Python instead of Java).

Want to learn more? Visit:

https://www.coursera.org/specializations/machine-learning-introduction

--

--

Andres Tse
CodeX
Writer for

Learn something new about anything. Every day.