Machine Learning Algorithm: Linear Regression
Background
The origins of linear regression can be traced back to the 18th century, when mathematician and astronomer Pierre-Simon Laplace used regression analysis to study the relationship between astronomical observations and celestial motion. However, it was not until the 19th century that regression analysis was formalized and developed as a statistical method.
One of the earliest and most influential contributions to the development of linear regression was made by the English statistician Francis Galton, who used regression analysis to study the relationship between height and other physical characteristics in families. Galton’s work laid the foundation for the use of regression analysis in genetics and biology.
In the early 20th century, the Norwegian statistician Carl Friedrich Gauss and the Italian statistician Alfonso Catoni made important contributions to the development of regression analysis, including the formulation of the method of least squares, which is still widely used in regression analysis today.
In the 1920s and 1930s, the development of linear regression was further advanced by the work of the British statistician Ronald A. Fisher, who made important contributions to the design of experiments, hypothesis testing, and the use of regression analysis in the social sciences.
Today, linear regression is widely used in many fields, including economics, finance, engineering, biology, and psychology, and is considered a fundamental tool in predictive modeling and data analysis.
The Basics
Linear Regression is a simple yet powerful machine learning algorithm used for predictive modeling.
It is used technique for continuous target variables
It is a supervised learning algorithm that tries to fit a linear equation to the relationship between a dependent variable (target) and one or more independent variables (predictors).
The main objective of linear regression is to minimize the difference between the observed target values and the predicted target values.
The basic equation for a simple linear regression model with a single predictor variable x is:
y = b0 + b1 * x
where:
y is the target variable,
x is the predictor variable,
b0 is the intercept, and
b1 is the slope or coefficient of the predictor variable.
For multiple linear regression with more than one predictor variable, the equation becomes:
y = b0 + b1 * x1 + b2 * x2 + … + bn * xn
where
x1, x2, …, xn are the n predictor variables and
b1, b2, …, bn are the corresponding coefficients.
and it is particularly useful when you want to understand the relationship between a predictor variable and a target variable and make predictions based on that relationship.
Linear Regression Algorithms
Some popluar Linear Regression Algorithms which can be used for predicting a continuous target variable based on one or more predictor variables.
1. Ordinary Least Squares (OLS) Regression
2. Ridge Regression
3. Lasso Regression
4. Elastic Net Regression
5. Bayesian Linear Regression
6. Polynomial Regression
7. Stepwise Regression
8. Theil-Sen Regression
9. Principal Component Regression (PCR)
10. Support Vector Regression (SVR)
Implementation of Linear Regression in Python
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
# Load the data
data = pd.read_csv('data.csv')
# Split the data into target(dependent) and predictor (independent) variables
X = data[['predictor_variable']]
y = data['target_variable']
# Create the linear regression object
reg = LinearRegression()
# Fit the regression model to the data
reg.fit(X, y)
# Get the intercept and coefficient
intercept = reg.intercept_
coefficient = reg.coef_[0]
# Make predictions using the model
y_pred = reg.predict(X)
# Calculate the mean squared error to assess the performance of the model
mse = np.mean((y_pred - y) ** 2)
#For multiple linear regression with multiple predictor variables
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
# Load the data
data = pd.read_csv('data.csv')
# Split the data into target and predictor variables
X = data[['predictor_variable_1', 'predictor_variable_2', ..., 'predictor_variable_n']]
y = data['target_variable']
# Create the linear regression object
reg = LinearRegression()
# Fit the regression model to the data
reg.fit(X, y)
# Get the intercept and coefficients
intercept = reg.intercept_
coefficients = reg.coef_
# Make predictions using the model
y_pred = reg.predict(X)
# Calculate the mean squared error of the model
mse = np.mean((y_pred - y) ** 2)
Limitation of Linear Regression
Linear regression is a simple and powerful tool for modeling the relationship between a dependent variable and one or more independent variables. However, it also has some limitations and drawbacks that must be considered:
- Linearity assumption: Linear regression assumes that the relationship between the independent and dependent variables is linear. This means that if the relationship between the variables is non-linear, the results of the regression will be incorrect and misleading.
- Outliers: Linear regression is sensitive to outliers, which can have a significant impact on the regression results. Outliers can be caused by errors in the data or by extreme values that are not representative of the underlying relationship between the variables.
- Collinearity: Linear regression assumes that the independent variables are not highly correlated with each other. This is known as the assumption of independence or lack of collinearity. If the independent variables are highly correlated, the regression results can be unstable and difficult to interpret.
- Overfitting: Overfitting occurs when the model fits the training data too well, but fails to generalize to new data. This can result in a model that is overly complex and has poor predictive power.
- Limited complexity: Linear regression is limited in its ability to model complex relationships between variables. For example, it cannot capture non-linear relationships or interactions between variables.
Example of Linear Regression
Step 1: Generating data and plotting it
In this example, we generate 100 samples of the independent variable x
using numpy.random.rand()
.
We also generate some random noise using numpy.random.normal()
.
We define the true underlying relationship between x
and y
to be y_true = 3 * x + 2
, and we add the noise to y_true
to generate the observed dependent variable y_observed
.
Finally, we plot the data using matplotlib.pyplot.scatter()
.
This generates a scatter plot of x
versus y_observed
.
import numpy as np
import matplotlib.pyplot as plt
# Set the random seed for reproducibility
np.random.seed(0)
# Generate some random data for the independent variable
x = np.random.rand(100, 1)
# Generate some random noise for the dependent variable
noise = np.random.normal(0, 0.1, size=(100, 1))
# Define the true underlying relationship between x and y
y_true = 3 * x + 2
# Add the noise to y_true to generate the observed dependent variable
y_observed = y_true + noise
# Plot the data
plt.scatter(x, y_observed, s=10)
plt.xlabel('x')
plt.ylabel('y')
plt.show()
# The Plot of the given data can be seen the image below
Fig shows the plot of the geneated data
Step 2: Generated data is used to fit a linear regression model using the available algorithms such as the Ordinary Least Squares (OLS) algorithm
In this example, we first generated the data for the independent variable x
, the noise for the dependent variable, and the observed dependent variable y_observed
, as in the previous example.
Then, we created an instance of the LinearRegression
model from the sklearn.linear_model
module, which implements the OLS algorithm.
We fit the model to the data using the fit()
method, passing the x
and y_observed
arrays as arguments.
Finally, we printed the model coefficients, which include the intercept and the slope of the linear regression line.
import numpy as np
from sklearn.linear_model import LinearRegression
# Generate some random data for the independent variable
x = np.random.rand(100, 1)
# Generate some random noise for the dependent variable
noise = np.random.normal(0, 0.1, size=(100, 1))
# Define the true underlying relationship between x and y
y_true = 3 * x + 2
# Add the noise to y_true to generate the observed dependent variable
y_observed = y_true + noise
# Create an instance of the linear regression model
model = LinearRegression()
# Fit the model to the data
model.fit(x, y_observed)
# Print the model coefficients
print("Intercept: ", model.intercept_)
print("Slope: ", model.coef_[0])
Output:
Intercept: [2.01018978]
Slope: [2.95599675]
Step 3: Plot the linear regression line generated by the OLS algorithm using the data generated
We first generate the data for x
, y_true
, noise
, and y_observed
using the same code as in the previous examples.
We then create an instance of the LinearRegression
model and fit it to the data using the fit()
method.
The predict()
method of the model is used to generate the predicted values of the dependent variable for each value of x
.
We use these predicted values to plot the linear regression line along with the scatter plot of the data.
The linear regression line is plotted using the plot()
method of matplotlib.pyplot
. We pass x
as the x-axis values and model.predict(x)
as the y-axis values to generate the line. The color
parameter is used to set the color of the line to red for better visibility.
Finally, we add the axis labels and call the show()
method of matplotlib.pyplot
to display the plot.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Generate some random data for the independent variable
x = np.random.rand(100, 1)
# Generate some random noise for the dependent variable
noise = np.random.normal(0, 0.1, size=(100, 1))
# Define the true underlying relationship between x and y
y_true = 3 * x + 2
# Add the noise to y_true to generate the observed dependent variable
y_observed = y_true + noise
# Create an instance of the linear regression model
model = LinearRegression()
# Fit the model to the data
model.fit(x, y_observed)
# Plot the data and the linear regression line
plt.scatter(x, y_observed, s=10)
plt.plot(x, model.predict(x), color='r')
plt.xlabel('x')
plt.ylabel('y')
plt.show()
The resulting plot shows the scatter plot of the data along with the linear regression line fitted using the OLS algorithm.
Linear Regression Facts: beyond the ordinary
- Linear regression is not just for straight lines: Despite its name, linear regression can be used to model relationships that are not necessarily straight lines. For example, polynomial regression can be used to model relationships that are curved, while logistic regression can be used to model binary outcomes.
- Linear regression can be used for classification: Linear regression can be used for classification problems, such as determining whether a customer is likely to make a purchase or not. This is done by transforming the dependent variable into a binary outcome, such as 0 or 1, and using linear regression to model the relationship between the independent and dependent variables.
- Linear regression can handle non-linear relationships: Linear regression can handle non-linear relationships between variables by transforming the independent variables. For example, taking the logarithm of the independent variables can help to linearize the relationship and improve the results of the regression.
- Linear regression can be used for time series analysis: Linear regression can be used for time series analysis, such as forecasting sales or stock prices. This is done by using the time variable as the independent variable and the dependent variable as the value to be predicted.
- Linear regression can be used for feature selection: Linear regression can be used for feature selection by examining the coefficients of the independent variables. Features with a high coefficient are more important in explaining the variation in the dependent variable, and can be used to identify the most important predictors.
Cheers for reading!!!
Please clap and subscribe!