Linear Regression

Güldeniz Bektaş
Analytics Vidhya
Published in
6 min readNov 7, 2020

--

Photo by Jeswin Thomas on Unsplash

Our goal is to find a relationship between variables in machine learning. We have many algorithms to use for every use cases. Linear regression is one of the popular one and the first one you will learn. In this article, I will give you a basic introduction about linear regression.

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data.

One of the observed data is independent value, as we call it ‘x’, and the other observed data is dependent value, as we call it ‘y’. That means y’s value is changing according to x’s change.

Before we use Linear Regression algorithm, we should first be sure whether our data has a relationship between the variables. Correlation and scatter plot can help us to discover a relationship between variables.

If analysis can be done with a single variable (x1), it is called simple linear regression. If analysis can be done with more than one variable (x1, x2, x3…), it is called multiple linear regression.

To put it in the most unique (!) example, imagine that you are buying a house and the important thing for you is the area of the house. The price of the house depends on the area of the house. Here, the area of the house is x, our independent variable. The thing that depends on x is the price of the house, y. x is our feature number one. If the area of the house is the only independent variable we have, it means that we can use simple linear regression. But apart from the area of the house, location, number of rooms, and age affects the price of the house too. Now we have four features that influences the price of the house. So, we will use multiple linear regression.

The data we have can have a graph like this with its regression line:

From one of my multiple linear regression project

As you see, our line doesn’t pass through every dot. If it would, it would be perfect but it is impossible to have that kind of perfect relationship.

This line doesn’t happen by magic. There is some mathematical magic going on under the hood. And this equation, lies under the regression line:

Hypothesis Function

This function calls ‘Hypothesis Function’.

  • y’ is the value we are trying to predict, the house’s price.
  • b (w0) is the y axis’ intercept or we can call it ‘bias’. It is the value that balances all the things we do.
  • x1 is our number one dependent value. If we move on with simple linear regression, which this equation belongs to, area of the house is x1.
  • w1 is the weight of the future number one.

The equation for the multiple linear regression:

x1 is the area of the house, x2 location of the house, x3 number of room of the house and so on… w1, w2, and w3 are the weights of the features.

You can see math is an important part of machine learning, and this is just the beginning.

And what if our line would be wrong?

I mean, how can we update our w0 (b) and w1? How can we choose which values pair can form the best line for our data?

With Cost Function!

What is that ‘Cost Function’?

Our goal is to minimize the difference between the estimated y value and the actual y value. So, we need to update the w0 and w1 values to reduce the difference.

Cost Function (J)

Cost function of the linear regression is the Root Mean Squared Error between predicted y and real y value.

What is Gradient Descent?

At a theoretical level, gradient descent is an algorithm that minimizes functions. Given a function defined by a set of parameters, gradient descent starts with an initial set of parameter values and iteratively moves toward a set of parameter values that minimize the function. This iterative minimization is achieved using calculus, taking steps in the negative direction of the function gradient.

Gradient Descent steps down the cost function. Size of each step known as Learning Rate. We should be careful about choosing the learning rate.

  1. You shouldn’t choose the learning rate very large. You can miss the local minimum.
https://www.geeksforgeeks.org/gradient-descent-in-linear-regression/

2. If you choose the learning rate too small, it will take more time to achieve to the local minimum.

https://www.geeksforgeeks.org/gradient-descent-in-linear-regression/
https://machinelearningmastery.com/understand-the-dynamics-of-learning-rate-on-deep-learning-neural-networks/

Blue line is a good choice for a learning rate.

Let’s use Sklearn Library to code what we have learned by now.

First thing first we need our libraries to work with.

import numpy as np
import matplotlib.pyplot as plt
import pandas pd

And we need our dataset (you can find data here):

# We are using read_csv method of Pandas to read our datadata = pd.read_csv("satislar.csv")

We have two columns; ‘aylar’ and ‘satislar’ (months and sales). We should seperate them as x, independent variable ‘aylar’, and y, dependent variable ‘satislar’.

X = data.iloc[:, 0:1].values
# ':' means take all the rows, '0:1' means take only first column,
# and turn it into an array with .values
y = data.iloc[:, -1:].values
# '-1:' means take only the last column, ’satislar' column

We have succesfully split our data as X and y. Now, we can split them again for our train and test data.

# we'll use Sklearn Libraryfrom sklearn.model_selection import train_test_split

After you import the library, we can write one single line of code to do the split:

# You always have to follow this order to split the dataX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.22, random_state=0)
  • X is the data that contain the values of X_train and X_test, y ise the data that contains the values of y_train and y_test.
  • test_size=.22 represent the proportion of the dataset to include in the test split. By default, this value is 0.33 but we have a small dataset. I gave smaller value.
  • random_state = 0 controls the shuffling applied to the data before applying the split.
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()

We create an object called ‘lin_reg’. We will fit our model, then we will make our model to predict:

lin_reg.fit(X_train, y_train)
y_pred = lin_reg.predict(X_test)

Done! You can check y_pred and y_test to see how well did you do. But there is a more easy way to do that.

# you can plot the resultsplt.scatter(X, y, color = 'red')
plt.plot(X_test, y_pred, color = 'yellow')

Looks fine!

# and you can print r square valuefrom sklearn.metrics import r2_score
r2_score(y_test, y_pred)
>> 0.9774483391303704

R square score is very close to 1. That means our model works fine.

--

--