Analytics Vidhya
Published in

Analytics Vidhya

Machine Learning: Simple Linear Regression Using Python

We humans get better with experience and age. Ever wondered how machines get better in the Data Science field? Data Modeling uses machine learning algorithms, in which the machine learns from historical data to develop a model to make a prediction of new data.

Machine Learning models are classified into two categories:

  1. Supervised learning method: This method has historical data with labels. Regression and Classification algorithms fall under this category.
  2. Unsupervised learning method: No pre-defined labels are assigned to historical data. Clustering algorithms fall under this category.

When function f maps from the input variable X to output variable Y:

The classification algorithm is the task of predicting a discrete class label.
For example, an email or text can be classified as belonging to one of two classes: ‘spam’ and ‘not spam’ is a classification problem.

The regression algorithm is the task of predicting a continuous quantity.
For example, predicting the performance of a company in terms of revenue based on historical data is a regression problem.

To learn more about the models' classification in Machine Learning, you can click on my article here.

What is Regression Analysis and when can we use?

Regression analysis is a method of predictive modeling that explores the relationship between a dependent (target) variable and a predictor (s) variable. This method is used for forecasting, modeling time series, and finding the relationship of a causal effect between the variables. In other words, Regression is connecting the dots among variables.

For instance, if a company has to hire an employee and negotiate the salary then it considers the features, such as experience, level of education, role, the city they work in, and so on. In a regression problem case, we consider each employee-related data of a company as one observation.

To make it even more simpler to understand we can take it as :

‘In regression analysis, we usually consider some phenomenon of interest and have a number of observations. Each observation has two or more features. Following the assumption that (at least) one of the features depends on the others, which we try to establish a relation among them.’

Perhaps people, including myself, who usually get their feet wet while learning algorithms in Data Science often think that Linear and Logistic regressions are the only forms of regressions but it is so important to be aware that there are several types of these techniques in the field of predictive modeling:

  1. Simple and multiple linear regression
  2. Polynomial regression
  3. Ridge regression and Lasso regression (upgrades to linear regression)
  4. Decision trees regression
  5. Support Vector Machines (SVM)

In this post, let's confine us learning to a thorough understanding of Simple Linear Regression, which is one of the important and commonly used regression techniques.

To get our basics right at the granular level of the linear relationship between variables, it is good to put in words like ‘exhibiting a directly proportional change in two related quantities’.

NOTE: A linear regression model based on dimensions:
in two dimensions is a straight line
in three dimensions it is a plane;
in more than three dimensions, a hyperplane.


A linear function has one independent variable and one dependent variable. The independent variable is x and the dependent variable is y.

  • a is the constant term or the y-intercept. It is the value of the dependent variable when x = 0.
  • b is the coefficient of the independent variable. It is also known as the slope and gives the rate of change of the dependent variable.

Let’s take the salary prediction dataset to build the linear regression model.
For reference here is a link to Dataset.

#importing the librariesimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Visualizing the facts is always better than keeping the equation in the head blindly, right! Can we see how two variables are distributed using scatterplot? Let’s plot our data points on a 2-D graph to view our dataset and see if we can spot any relationship between the values.

dataset.plot(x=’YearsExperience’, y=’Salary’, style=’o’)
plt.title(‘Work Experience vs Salary’)

It is easy to quote that ‘salary increases as the number of years of work experiences increases’. But from the above graph, this is not the case, we can notice that 3 years experienced is earning more than 5 years experienced one!!. So here is our disappointment, all the observations are not in a line. Meaning, we cannot find out the equation to calculate the (y) value. :(

Oh wait, it’s not that bad as we thought, so don't worry. Now, carefully observe the scatter plots again. Did you see any pattern?
All the points are not in a line BUT they are in a line-shape! It’s linear!

We can also check how salary values have been distributed in the given dataset.

plt.title('Salary Distribution Plot')

From the above plot, we can infer that salary distribution is between 40000 to 125000.

Python Code:

  • X: the first column which contains Years Experience array
  • y: the last column which contains the Salary array

Next, we split 80% of the data to the training set while 20% of the data to test set using the below code.
The test_size variable is where we actually specify the proportion of the test set.

X = dataset[‘YearsExperience’].values.reshape(-1,1)
y = dataset[‘Salary’].values.reshape(-1,1)
  • regressor = LinearRegression(): our training model which will implement the Linear Regression.
  • in this line, we pass the X_train which contains the value of Year Experience and y_train which contains values of particular Salary to form up the model. This is the training process.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)regressor = LinearRegression(), y_train) #training the algorithm

Yay, we have built our model, now we can use it to calculate (predict) any values of X depends on y or any values of y depends on X. This is how we do it:

y_pred = regressor.predict(X_test)

We can also compare the values of actual and predicted given by our model, to be sure of how well the model is working.

df = pd.DataFrame({‘Actual’: y_test.flatten(), ‘Predicted’: y_pred.flatten()})
df1 = df.head(25)
plt.grid(which=’major’, linestyle=’-’, linewidth=’0.5', color=’green’)
plt.grid(which=’minor’, linestyle=’:’, linewidth=’0.5', color=’black’)

Now, the task is to find a line that fits best in the above scatter plot so that we can predict the response for any new feature values. (i.e a value of x not present in the dataset). This line is called the regression line.

plt.scatter(X_test, y_test, color=’gray’)
plt.plot(X_test, y_pred, color=’red’, linewidth=2)

From the above results, we can confidently say our model is good to use.

The values that we can control are the intercept and slope. There can be multiple straight lines depending upon the values of intercept and slope. Basically what the linear regression algorithm does is it fits multiple lines on the data points and returns the line that results in the least error.

Let’s find the values of slope and intercept, to form a regression line.

#To retrieve the intercept:
#For retrieving the slope:
Intercept of the model: 25202.887786154883
Coefficient of the line: [9731.20383825]

We call to say that function is, y= 9731.2 x+25202.88
If we want to check the salary for 5 years' experience then, from the above function we get y=73,858.8..predicts good!!!



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store