Linear Regression

Pınar Yazgan
Data Science Earth
Published in
3 min readJan 12, 2021

Regression analysis is one of the most important areas in statistics and machine learning. There are many regression methods. Linear regression is one of them.

What Is Regression?

Regression is a statistical measurement that reveals the relationships between variables and provides estimates accordingly.

For example, by observing employees in a company, we can understand how their salaries change depending on features such as their experiences, education levels, roles, cities they work in etc.

This is a regression problem that each employee’s data represents observation. While experience, education level, role and city are independent variables in this problem, salary is accepted as a dependent variable because it changes depending on these variables.

Similarly, there is a mathematical link between house prices and variables such as location, number of rooms, and distance to the city center.

Regression problems generally have a continuous and unlimited dependent variable. Inputs of the problems can be continuous, discrete or even categorical data such as gender, nationality, brand, etc.

It is common practice to show the output of the problem with 𝑦 and its input with 𝑥. If there are two or more independent variables, they can be represented as the vector 𝐱 = (𝑥₁,…, 𝑥ᵣ). Where 𝑟 is the number of entries.

When Do We Need Regression?

We need regression to answer whether some events affect others, how they affect them, or what kind of relationship between several variables.

For example, we can use regression to determine whether and to what extent experience or gender affects salaries.

Regression is also useful when you want to predict an answer using a new set of predictions. For example, we can try to predict a home’s electricity consumption for the next hour, considering the outdoor temperature, time of day, and the number of residents in that household.

Regression is used in many different fields such as economics, computer science, social sciences. The importance of regression is increasing day by day with the increasing amount of data and increasing awareness about the value of data.

Simple Linear Regression

First, let’s talk about the basic concepts of linear regression:

Simple linear regression consists of generating a regression model (equation of a line) that explains the linear relationship between two variables.

The dependent variable as Y and the independent variable as X are defined.

Linear regression equation: y = ax + b

a slope,

y dependent variable,

x independent variable,

b indicates how much the line will shift.

In this article we will show the application of linear regression in pyhton.

Linear Regression Application

First, let’s talk about which libraries we will use.

  • Matplotlib → Python library is used for data visualization.
  • Pandas → Python library is used for data analysis.
  • Scikit-Learn (sklearn) → Open source machine learning library is widely used.

First, let’s add the necessary libraries to our model.

import matplotlib.pyplot as plt
import pandas as pd

Let’s read our data with the help of the read_csv method, which is widely used in the pandas library.

data = pd.read_csv ('sales.csv')

Months, Sales

8,19671.5

10,23102.5

11,18865.5

12,2662.5

14,19945.5

19.28321

19.60074

20,27222.5

20,32222.5

23,27594.5

25.31609

25.27897

25,28478.5

27,28790.5

29,30555.5

31.33969

32,33014.5

34.41544

37,40681.5

38.4697

42.45869

43,48136.5

47.50651

50.56906

53,58715.5

55.52791

59,58484.5

59,56317.5

64,61195.5

65.60936

Let’s separate the #months and sales columns.

“Sales” is dependent variable that changes according to the months,

“Months” is the independent variable.

months = data [['Months']]
sales = data [['Sales']]

# Let’s split the data for training and testing.

from sklearn.cross_validation import train_test_split
x_train, x_test,y_train,y_test = train_test_split(months,sales,test_size=0.33, random_state=0)

Let’s build the linear regression model.

from sklearn.linear_model import LinearRegression
lr = LinearRegression()

Let’s learn y_train from x_train.

lr.fit(x_train,y_train)

Let’s make estimation of our model with our test data.

prediction = lr.predict(x_test)
print(prediction)

[[27093.02545779]

[30493.96560965]

[24258.90866458]

[30493.96560965]

[22558.43858866]]

Let’s visualize our data now.

# Sorting by index
x_train = x_train.sort_index()
y_train = y_train.sort_index()
#Drawing a chart
plt.plot(x_train,y_train)
plt.plot(x_test,lr.predict(x_test))
#Create labels for chart title, x and y.
plt.title("Sales By Months")
plt.xlabel("Months")
plt.ylabel("Sales")

Thus, we have created the prediction graph corresponding to the test value. In addition, we have drawn the line closest to the given points.
Now let’s look at the value of R2:

from sklearn.metrics import r2_score
print("Linear R2 value:")
print(r2_score(y_test,lr.predict(x_test)))

Linear R2 value:
0.923833189633769

R-Square (R2): It is a statistical measure of how close the data is to the fitted regression line. This value is 0.924. So we can say that we achieved a very good result.

Pınar Yazgan

Business Intelligence Specialist

--

--