Introduction to Simple Linear Regression with Examples

Pushkar
Codersarts Read
Published in
6 min readJun 19, 2023

Introduction to Simple Linear Regression with Examples

Regression analysis is a statistical tool that is used to establish relationships between two or more variables. It is used to predict future outcomes based on past data. In this article, we will focus on simple linear regression, which involves a single independent variable and a single dependent variable. We will cover the basic introduction, objectives of simple linear regression, what is correlation coefficient, examples, variables, equation of simple linear regression, and an example of a regression line.

Simple linear regression is a statistical method that is used to study the relationship between two variables, where one variable is the independent variable and the other variable is the dependent variable. The aim of simple linear regression is to establish the linear relationship between these two variables.

Objective of Simple Linear Regression

The objective of simple linear regression is to find a line of best fit that represents the linear relationship between the independent variable and the dependent variable. This line of best fit can then be used to make predictions about the dependent variable for a given value of the independent variable.

What is Correlation Coefficient?

The correlation coefficient is a statistical measure that indicates the degree of association between two variables. It ranges from -1 to +1, where -1 indicates a perfect negative correlation, +1 indicates a perfect positive correlation, and 0 indicates no correlation. The correlation coefficient can be calculated using the following formula:

r = (nΣxy — ΣxΣy) / sqrt((nΣx2 — (Σx)2)(nΣy2 — (Σy)2))

where:

  • n is the number of data points
  • Σxy is the sum of the products of the corresponding values of x and y
  • Σx and Σy are the sums of the values of x and y
  • Σx2 and Σy2 are the sums of the squares of the values of x and y

Example

Suppose we want to investigate the relationship between the number of hours studied and the score obtained in a test. We collect data from 10 students and obtain the following results:

Hours studied (x) Score (y) 2 50 3 60 4 70 5 80 6 90 7 100 8 110 9 120 10 130 11 140

Using the formula for the correlation coefficient, we get:

r = (10(1650) — (55)(770)) / sqrt((10(385) — (55)2)(10(12155) — (770)2)) r = 0.9974

Since the correlation coefficient is close to +1, we can conclude that there is a strong positive correlation between the number of hours studied and the score obtained in the test.

What is Variables?

In simple linear regression, there are two types of variables: independent variable and dependent variable. The independent variable is the variable that is used to predict the value of the dependent variable. The dependent variable is the variable that is being predicted.

Equation of Simple Linear Regression

The equation of simple linear regression is given by:

y = mx + c

where:

  • y is the dependent variable
  • x is the independent variable
  • m is the slope of the line of best fit
  • c is the y-intercept of the line of best fit

The slope of the line of best fit can be calculated using the following formula:

m = (nΣxy — ΣxΣy) / (nΣx2 — (Σx)2)

The y-intercept of the line of best fit can be calculated using the following formula:

c = (Σy — mΣx) /n

where:

  • n is the number of data points
  • Σxy is the sum of the products of the corresponding values of x and y
  • Σx and Σy are the sums of the values of x and y
  • Σx² and Σy² are the sums of the squares of the values of x and y.

Example of Regression line

An example of a real-world regression line can be seen in predicting housing prices based on the size of the house. Suppose a real estate agent wants to predict the price of a house based on its size. They collect data from 10 houses and obtain the following results:

House size (square feet) Price (INR) 1000 10,00,000 1200 12,00,000 1400 14,00,000 1600 16,00,000 1800 18,00,000 2000 20,00,000 2200 22,00,000 2400 24,00,000 2600 26,00,000 2800 28,00,000

Using simple linear regression, we can find the equation of the regression line that best fits this data. First, we calculate the slope of the line of best fit using the formula:

m = (nΣxy — ΣxΣy) / (nΣx² — (Σx)²)

where:

  • n is the number of data points (in this case, n=10)
  • Σxy is the sum of the products of the corresponding values of x and y
  • Σx and Σy are the sums of the values of x and y
  • Σx² is the sum of the squares of the values of x

Plugging in the values from the table, we get:

m = (10(66,30,00,000) — (2,10,00,000)(2,10,000)) / (10(1,36,00,000) — (2,10,000)²) m ≈ 1740.46

Next, we can calculate the y-intercept of the line of best fit using the formula:

c = (Σy — mΣx) / n

Plugging in the values from the table, we get:

c = (16,00,000 — (1740.46)(1400)) / 10 c ≈ 10,73,977.39

Therefore, the equation of the regression line that best fits this data is:

Price = 1740.46(size) + 10,73,977.39

This equation can be used to predict the price of a house based on its size. For example, if a house has a size of 1500 square feet, we can predict its price as:

Price = 1740.46(1500) + 10,73,977.39 Price ≈ 25,60,000 INR

This code will display a scatter plot of the data points and the regression line, and it will print the predicted price of a 1500-square-foot house based on the regression line equation.

import matplotlib.pyplot as plt
import numpy as np

# define the dataset
house_size = [1000, 1200, 1400, 1600, 1800, 2000, 2200, 2400, 2600, 2800]
price = [1000000, 1200000, 1400000, 1600000, 1800000, 2000000, 2200000, 2400000, 2600000, 2800000]

# calculate the regression line equation
n = len(house_size)
x = np.array(house_size)
y = np.array(price)
xy = x * y
x_squared = x ** 2
m = (n * np.sum(xy) - np.sum(x) * np.sum(y)) / (n * np.sum(x_squared) - np.sum(x) ** 2)
c = (np.sum(y) - m * np.sum(x)) / n

# plot the data and regression line
plt.scatter(house_size, price)
plt.plot(house_size, m * x + c, color='red')
plt.xlabel('House size (square feet)')
plt.ylabel('Price (INR)')
plt.title('Regression line of house prices based on size')
plt.show()

# make a prediction using the regression line equation
house_size_pred = 1500
price_pred = m * house_size_pred + c
print(f'Predicted price of a {house_size_pred}-sqft house: {price_pred:.2f} INR')

Output

Thank you

☎️If you’re struggling with your Machine Learning, Deep Learning, NLP, Data Visualization, Computer Vision, Face Recognition, Python, Big Data, or Django projects, CodersArts can help! They offer expert assignment help and training services in these areas, and you can find more information at the links below:

Don’t forget to follow CodersArts on their social media handles to stay updated on the latest trends and tips in the field:

You can also visit their main website or training portal to learn more. And if you need additional resources and discussions, don’t miss their blog and forum:

With CodersArts, you can take your projects to the next level!

If you need assistance with any machine learning projects, please feel free to contact us at 📧 contact@codersarts.com.

--

--