Simple Linear Regression | Explanation and Code from scratch

Published in

DS/AIwithRishabh

5 min read4 days ago

Introduction to Simple Linear Regression

Linear regression is a fundamental statistical technique used for understanding the relationship between two continuous variables. It’s widely used in various fields, including economics, biology, engineering, and social sciences, to model and predict outcomes. One of the simplest forms of linear regression is Simple Linear Regression, which involves a single independent variable and a dependent variable.

Why Linear Regression?

Linear regression is popular for several reasons:

Simplicity: The linear relationship is straightforward and easy to understand.
Interpretability: The results of linear regression are interpretable, providing insights into the relationship between variables.
Predictive Power: Linear regression models can be used for prediction, allowing us to estimate the dependent variable based on the independent variable.
Foundation for More Complex Models: It serves as the foundation for more complex regression models and machine learning algorithms.

Example: CGPA and Package Offers

To illustrate Simple Linear Regression, let’s consider a practical example involving students’ CGPA and their package offers after graduation. Here, the CGPA is the independent variable, and the package is the dependent variable.

Data Generation

We will generate a random dataset where:

The CGPA ranges from 5 to 10.
The package offers are linearly dependent on CGPA with some added noise.

Here’s the code to generate and plot the dataset:

import numpy as np
import pandas as pd
import plotly.graph_objects as go

# Seed for reproducibility
np.random.seed(42)

# Generate random CGPAs between 5 and 10 for 100 students
cgpa = np.random.uniform(5, 10, 100)

# Generate packages using a linear relationship with some noise
# Package = 5 + 1.5 * CGPA + noise
noise = np.random.normal(0, 2, 100)
package = 5 + 1.5 * cgpa + noise

# Create a DataFrame
data = pd.DataFrame({
    'CGPA': cgpa,
    'Package': package
})

# Plot the data using Plotly
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=data['CGPA'],
    y=data['Package'],
    mode='markers',
    name='Data Points'
))

fig.update_layout(
    title='Student CGPA vs Package',
    xaxis_title='CGPA',
    yaxis_title='Package (in LPA)',
    template='plotly_dark'
)

fig.show()

Simple Linear Regression Model

In Simple Linear Regression, we aim to fit a line that best describes the relationship between CGPA (independent variable) and package (dependent variable). The linear equation can be written as:

Package= β0+ β1×CGPA+ ϵ

to make it more easier

Package(y) = m*CGPA(x) + b

Where:

β0 : β0 is the intercept.
β1 : β1 is the slope of the line.
ϵ : epsilon is the error term.

Fitting the Model

Till now our aim was to find a line segment which passes near to most of the points in the dataset.

But let’s change it :

As y-predicted (y^) (yhat) i image = m*x + b

So now we need that value of m and b which will minimize this Error(E)

Now there are two ways to calculate m,b

Ordinary Least Square (used when dealing with low dimensional data)
Gradient Descent (used when dealing with high dimensional data)

So, For learning the mathematics behind the calculation of m,b using ordinary least square go through this webpage — https://statproofbook.github.io/P/slr-ols.html

This method minimizes the sum of the squared differences between the observed values and the values predicted by the linear model.

The formulas for the slope (m) and intercept (b) are:

here intercept(b) is calculated which is beta0 here in terms of m(Beta1)

As we have already created the dataset above now let’s split the dataset into training and testing subsets. You can use the train_test_split() function from Scikit-learn to do so:

from sklearn.model_selection import train_test_split

X= data['CGPA']
y = data['Package']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Let’s write our own Mera_SLR class from scratch

class Mera_SLR:
    '''
    A class which implements a simple linear regression model
    '''
    def __init__(self):
        self.b = None
        self.m = None
    
    def fit(self,X,y):
        '''
        Used to calculate slope and intercept coefficients
        
        :param X: array, single feature
        :param y: array, true values
        :return: None
        '''
        
        numerator = np.sum((X-np.mean(X))*(y-np.mean(y)))
        denominator = np.sum((X - np.mean(X))**2)
        
        self.m = numerator/denominator
        self.b = np.mean(y) - (self.m * np.mean(X))
    
    def predict(self,X):
        '''
        Makes predictions using the simple line equation
        :param X: array, single feature
        :return: None
        '''
        if not self.b or not self.m:
            raise Exception('Please call `SimpleLinearRegression.fit(X, y)` before making predictions.')
        return self.b + self.m * X

Finally, let’s make an instance of the Mera_SLR class, fit the training data, and make predictions on the test set. The following code snippet does just that, and also prints the values of b and m coefficients:

model = Mera_SLR()
model.fit(X_train, y_train)
preds = model.predict(X_test)

model.b, model.m

calculated value of coefficients b and m

And now here is our predicted line on the data

Now to determine if our model is good or bad or worst we need to take help of some of the regression metrics like : MAE , MSE , RMSE ,R2Score, AdjustedR2_score

We will discuss these metrics with their importance in upcoming blogs till then make sure to follow my blog to support me and let me know your views on my content.

Thanks for Reading, and please stay tuned to the blog if you’re interested in more machine learning from scratch articles.