Linear Regression modelling and prediction using GraphLab

karthic Rao
4 min readJan 19, 2016

--

The objective with which I started out was to kick start my machine learning journey by taking a simple problem to solve. That is to predict the house price by coming up with a linear regression model using the training data set containing sales data of houses with various features.

Graphlab create is a open source framework for machine learning/Data science applications by Dato, it was easy to kickstart my machine learning journey using GraphLab create. The Dato launcher (should be installed after obtaining a non commercial key from their website) on Mac starts a IPython notebook in the browser window, it was really easy to begin without any hassle. So, lets begin!

First start the IPython notebook from the Dato Launcher and then import Graphlab

Then load the sales data

loading the sales data

Here I’m using graphlab canvas to visualize the data as scatter plot, below is the scatter plot of square foot size of the house and the price of the house.

scatter plot - sqft_of_house vs price

Then split the data set into training data set and test data set, I had earlier published a post on importance of splitting the data set into training data set and testing data set.

Then we have to come up with a model to predict the house price using the training data, in the first shot I’ll be using only the square foot size of the house as the feature influencing the house price.

linear regression model to predict house price with square feet size of the house as the feature

The method picks algorithms which are best suited to derive the model, here you’ll be abstracted from implementation of the machine learning algorithms used to derive the model, one has to dwell in the theories of linear regression to explore more about these algorithms used.

Then using the test data we’ll evaluate the model for its accuracy in prediction.

evaluating model accuracy on the test data

High max_error suggests the possibility of a data which stood out from the rest of data in the data set and rmse (Root Mean Square Error) seems fairly high too.

Let’s use matplotlib to visualize the plot of actual price of the house from the test data and the prediction values calculated from regression model we just came up with.

plot of the actual value from test data and the prediction values from the regression model

The stats in the blue represent the plot of the actual house price form the test data set and the green line passing through the value predicted by the simple regression model.

Let’s see the values of the coefficients of the model

coefficients of the regression model

But here we just used one feature, now lets come with a model by using other factors/features which could influence the price of the house, features like number of bedrooms, bathrooms, floors, zipcode etc..

Lets first create the list of the features,

feature list

Let’s now visualize these features from the sales data

stats of the features from the sales data

Lets build a model using all the features in the feature list

using graphlab.linear_regressoin.create its easy to create models

Let’s evaluate the new model for prediction accuracy on our test data

The error appears to be really less with considering more features for the model.

These code samples are exerts from the Introductory course on Machine Learning by University of Washington on Coursera. I’ll follow up with more posts on machine learning concepts and using graphlab create to develop machine learning applications… Until then, Happy Coding :)

--

--

karthic Rao

Co-founder at Stealth. Code with Love, Learn with Passion, Feel the music, Live like a hacker.