Linear Regression end to end analysis part 1

4 min readMay 24, 2024

As the name suggests it is an algorithm works on the linear combination of the features of input data.

Problem Statement

We have a set of labelled data say, (xi, yi) for i = 1, 2..., N.

N is the size of collection.

We want to build a function f(x)= wx+b as a linear combination of feature vectors say w and b. Here w is a vector and b, a real number.

The question is how you find values for w and b so that the prediction yhat is close to the true target yi for many or maybe all training examples (xi, yi).

Solution

To answer that question, let’s first take a look at how to measure how well a line fits the training data. To do that, we’re going to construct a cost function.

The prediction yhat minus the target yi is called the error.

The expression (f(xi) -yi)^2 in the above objective is called the loss function.

The average loss is called the cost function.

To build a cost function that doesn’t automatically get bigger as the training set size gets larger by convention, we will compute the average squared error instead of the total squared error and we do that by dividing by N like above.

The extra division by 2 is just meant to make some of our later calculations look neater, but the cost function still works whether you include this division by 2 or not.

Possible queries

Why is the loss in linear regression a quadratic function?

One practical justification of the choice of the linear form for the model is that it’s simple. Why use a complex model when you can use a simple one? Another consideration is that linear models rarely overfit.

Overfitting is the property of a model such that the model predicts very well labels of the examples used during training but frequently makes errors when applied to examples that weren’t seen by the learning algorithm during training.

2. How can we minimize the cost function?

Gradient descent is an algorithm that you can use to try to minimize any function, not just a cost function for linear regression.

Now it's time to apply the theory on simple linear regression…

I will walk you through my code, incorporating a few modifications that I learned, as I explored new techniques.

To understand the coding part, review my Kaggle notebook using a simple dataset on salary prediction here.

Step 1: Data Preprocessing

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('/kaggle/input/salary-dataset-simple-linear-regression/Salary_dataset.csv')
X = df.iloc[ : ,   : 1 ].values
y = df.iloc[ : , 1 ].values

The iloc function is used to select specific rows and columns from a Data Frame based on their integer positions. This ensures X is a 2D array.

This is to avoid errors: If X is not in the correct shape, it can lead to errors when fitting the model. For example, scikit-learn models require the feature array to be 2D.

Using df['salary'] gives you a 1D Series, which you can convert to a 2D array using .values.reshape(-1, 1) to ensure compatibility with machine learning models that expect input features in a 2D array format. This step is crucial when preparing data for models that expect the input in a specific shape. That is why I used reshape in my notebook.

from sklearn.model_selection import train_test_split

# Assuming X is your feature set and y is your target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The random_state parameter is used in various functions and methods within machine learning libraries like scikit-learn to control the randomness involved in processes such as data splitting, random sampling, and initializing parameters. The primary purpose of setting a random_state is to ensure reproducibility of results.

Setting a random_state to a fixed integer value ensures that the sequence of random numbers generated is the same each time you run the code. This makes your results reproducible and easier to debug.

Step 2: Fitting Simple Linear Regression Model to the training set

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor = regressor.fit(X_train, y_train)

# Print the coefficients
print("Intercept:", regressor.intercept_)
print("Coefficient:", regressor.coef_[0])

Step 3: Predicting the Result

# Make predictions
prediction = regressor.predict(X_test)

Step 4: Visualization

# Scatter plot
plt.scatter(X_train, y_train, color='brown')
# Plot the regression line
plt.plot(X_train, regressor.predict(X_train), color ='blue')

Visualizing the test results

plt.scatter(X_test , y_test, color = 'red')
plt.plot(X_test , regressor.predict(X_test), color ='blue')

Step 5: Metrics calculation

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Calculate metrics
mae = mean_absolute_error(y_test, prediction)
mse = mean_squared_error(y_test, prediction)
r2 = r2_score(y_test, prediction)

# Print the metrics
print("Mean absolute error: %.2f" % mae)
print("Mean squared error: %.2f" % mse)
print("R2-score: %.2f" % r2)

This concludes the prediction.

Follow me for more insights😎