Linear Regression in Python with Scikit-Learn

Predict the housing price by Linear Regression

Kinder Sham
Analytics Vidhya
7 min readJun 15, 2020

--

Photo by Xavi Cabrera on Unsplash

Introduction

In supervised machine learning, there are two algorithms: regression algorithm and classification algorithm. For example, predicting house prices is a regression problem, and predicting whether houses can be sold is a classification problem.

The term “linearity” in algebra refers to a linear relationship between two or more variables.

In the simple linear regression discussed in this article, if you draw this relationship in a two-dimensional space, you will get a straight line.

Let’s consider a scenario where we want to determine the linear relationship between the house square feet and the house selling price. when we given the square feet of a house, can we estimate how much money can be sold?

We know that the formula of a regression line is basically: y = mx + b

where y is the predicted target label, m is the slope of the line, and b is the y intercept.

If we plot the independent variable (Square Feet) on the x-axis and dependent variable (Sale Price) on the y-axis, linear regression gives us a straight line that best fits the data points, as shown in the figure below.

Linear Regression in Python with Scikit-Learn

In this section, we will learn how to use the Python Scikit-Learn library for machine learning to implement regression functions. We will start with a simple linear regression involving two variables. In this regression task we will predict the Sales Price based upon the Square Feet of the house. This is a simple linear regression task as it involves just two variables.

Code

  1. Importing Libraries

To import necessary libraries for this task, execute the following import statements:

2. Import the dataset

The following command imports the CSV dataset via pandas:

3. Understand the data

Now let’s explore the dataset. Execute the following script: df.shape.

After that, you should see the following output. The dataset contain 1,460 rows and 2 columns. Let’s take a look at what our dataset actually looks like. enter the df.head()which will retrieves the first 5 records from our dataset.

To see statistical details of the dataset, we can use df.describe():

Finally, we can draw data points on the two-dimensional graph to focus on the dataset and see if we can manually find any relationship between the data. We use df.plot() function of the pandas dataframe and pass it the column names for x coordinate and y coordinate, which are “SquareFeet” and “SalesPrice” respectively. We can use the below script to create the graph:

From the graph above, we can clearly see that there is a positive linear relation between the house square feet and the house sales price.

4. Preparing the Data

Now we have ideas about the details of data statistics. The next step is to divide the data into “attributes” and “target labels”. Attributes are independent variables, and target labels are dependent variables whose values ​​are to be predicted. In our dataset we only have two columns. We want to predict the Sales Price based upon the Square Feet of the house. Therefore our attribute set will consist of the “SquareFeet” column, and the label will be the “SalesPrice” column. To extract the attributes and labels, execute the following script:

The attributes are stored in the X variable. We specified “-1” as the range for columns since we wanted our attribute set to contain all the columns except the last one, which is “SalesPrice”. Similarly the y variable contains the labels. We specified 1 for the label column since the index for “SalesPrice” column is 1. Remember, the column indexes start with 0, with 1 being the second column.

Now that we have our attributes and labels, the next step is to split this data into training and test sets. We’ll do this by using Scikit-Learn built-in train_test_split() method:

The above script splits 80% of the data to training set while 20% of the data to test set. The test_size variable is where we actually specify the proportion of test set.

5. Modelling

We have split our data into training and testing sets, and now is finally the time to train our model. Execute following script:

With Scikit-Learn it is extremely straight forward to implement linear regression models, as all you really need to do is import the LinearRegression class, instantiate it, and call the fit() method along with our training data. This is about as simple as it gets when using a machine learning library to train on your data.

In the theory section we said that linear regression model basically finds the best value for the intercept and slope, which results in a line that best fits the data. To see the value of the intercept and slop calculated by the linear regression algorithm for our dataset, execute the following script to retrieve the intercept:.

The resulting value you see should be approximately: 13330.26. And the then execute the following script to retrieving the slope (coefficient of x):

The result should be approximately: 110.26.

This means that for every one unit of change in Square Feet, the change in the SalesPrice is about 110.26.

6. Predictions

Now that we have trained our algorithm, it’s time to make some predictions. To do so, we will use our test data and see how accurately our algorithm predicts the percentage score. To make pre-dictions on the test data, execute the following script:

The y_pred is a numpy array that contains all the predicted values for the input values in the X_test series.

To compare the actual output values for X_test with the predicted values, execute the following script and the output looks like this:

Although our model is not very accurate, the predicted value is close to the actual value.

7. Evaluation

The final step is to evaluate the performance of the algorithm. This step is particularly important for comparing the performance of different algorithms on specific data sets. For regression algorithms, three evaluation indicators are usually used:

  1. Mean Absolute Error (MAE) is the mean of the absolute value of the errors.
  2. Mean Squared Error (MSE) is the mean of the squared errors.
  3. Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors

The Scikit-Learn library pre-built the functions that can be used to find out these values for us. Let’s find the values for these metrics using our test data. Execute the following script:

You can see that the value of root mean squared error is 62560.28, which is large than 10% of the mean value of the Sales Price i.e. 180921.2. This means that our algorithm is just average.

There are many factors that contribute to this inaccuracy, some of which are listed here:

  1. The features we used may not have had a high enough correlation to the values we were trying to predict.
  2. We assume that this data has a linear relationship, but this is not the case. visual data can help you determine.

Next

To improve the model, we need to use more independent variable and try difference model, such as SVR, Lasso and ElasticNet.

Thanks for reading! If you enjoyed the post, please appreciate your support by applauding via the clap (👏🏼) button below or by sharing this article so others can find it.

At the end, I hope that you can learn the how to use the simple linear regression techniques. You can also find the full project on the GitHub repository.

--

--

Kinder Sham
Analytics Vidhya

Data scientist, cycling and game player enthusiast. Focus on how to use data science to answer questions.