Linear Regression in Machine Learning

13 min readJul 25, 2023

Linear regressions are often referred to as “boundary-based” methods because they are used to determine a decision boundary that predicts a continuous output based on input features.

Regression is typically used for predicting real-valued target variables based on data. Therefore, when you have a real-valued target column or real-valued data we perform linear regression.

In linear regression, the goal is to find a linear equation that minimizes the distance between the predicted output and the actual output.

This task of regression, linear regression will try to perform these tasks using nothing but the equation of a line.

The boundary-based method is nothing but a line.

Considering dimensions, in one dimension, we have something called a dot equation.

This dot can be expanded into two dimensions to create an equation in the form of a line, slope, and intercept.

In 3 dimensions, we will get a plane, for n dimensions we will get a hyperplane.

The problem with n-dimensional data cannot visualize higher dimensions as it’ll have multiple features. We can only visualize up to three dimensions. So, for beyond three dimensions, we will use maths to identify what kind of shapes are we getting in those higher dimensions.

when it comes to boundary-based methods, we’ll deal with linear boundaries, which means we’ll be dealing with lines, planes, and hyperplanes.

So now let’s try to take an example given the height of an individual to predict its weight.

Linear regression tries to find a line that fits the data in the best possible manner, this will be one of the best-fit lines and the equation of the line is

y = mx + c

we can have multiple lines, but linear regression will try to find the best-fit line, which fits the relationship between height and weight very perfectly.

This equation that linear regression for the above example is

weight = M*(height)+C

so if we know the M and C values, we can predict the weight of the person.

where “weight” represents the predicted output variable, “height” is the input feature, “M” is the slope, and “C” is the y-intercept (the value of the predicted output when the input feature is 0). This equation represents the relationship between the input feature and the predicted output, allowing us to make predictions based on the given data.

Now let’s try to look at the equation of a Line/Plane/Hyperplane and see what these M and Cs are.

Equation of Line/Plane/Hyperplane

The dimensions along the Y-axis and the X-axis, respectively, are represented by the variables y and x. The X-axis values represent the input feature x, while the values on the Y-axis correspond to the predicted output y and m is the slope, which is calculated as the tangent of the angle (θ) formed with the X-axis in a right-angle triangle. This trigonometric tangent ratio is important in determining how the Y-axis values change concerning the X-axis values, enabling us to model the relationship between the variables and make predictions based on this linear relationship.

The 90-degree angle that’s being formed over x, so using this tan θ, you can always find out what is the angle that is being made by the line with respect to the x-axis, and that value is nothing but the slope of the line (M = Tan θ).

This θ is nothing but the angle between the line and x-axis, and The C is nothing but the intercept, with respect to the x and y-axis (x is zero), it implies that the linear regression line passes through the origin in the context of both the x and y axes. Even in higher dimensions, where intercepts need to be considered for both axes.

At what point will the y-axis intercept the given line (0, c).

Note

The slope is the angle created by the line with respect to the x-axis and the slope can be computed as the tangent of that angle, Whatever angle is being formed over the x and y-axis, you just take the tangent of that angle, we’ll get the slope of the line. An intercept is nothing but the point on the Y-axis where the line cuts or meets the Y-axis.

These two dimensions we’ll be calling it two dimensions vector space, the reason we are calling it a vector space is because it not only considers magnitudes but also considers the direction in which the line is facing. The presence of both magnitudes and directions gives rise to the concept of vectors. Thus, whenever we encounter quantities characterized by both magnitude and direction, we are essentially working within a two-dimensional vector space.

The parameters of a line are M and C. To draw a unique line, we need both M and C to draw the line in two dimensions.

There are an endless number of parallel lines that can be created by drawing lines parallel to the initial line while maintaining the same angle. Due to this lack of uniqueness, we cannot identify unique line based on the angle information. Therefore, to uniquely define a line, we need to consider both the slopes ‘M’ and intercept ‘C’. even in n-dimensional space, where we need multiple slopes (one for each dimension) and one intercept to precisely determine a specific line.

Boundary-based methods, such as machine learning algorithms, learn from data by utilizing various statistical and mathematical techniques. The process typically involves performing exploratory data analysis (EDA) to gain insights into the data, identifying patterns, and then representing these patterns mathematically to create a model.

During exploratory data analysis (EDA), we analyze the data to understand its distribution, calculate measures like the mean and standard deviation, and identify any outliers or missing values. Understanding the data distribution helps in gaining insights into the underlying characteristics of the dataset.

In bivariate analysis, we investigate the relationship between two continuous variables using statistical measures like the Pearson correlation coefficient. This coefficient quantifies the strength and direction of the linear relationship between the variables. The Pearson correlation coefficient ranges from -1 to +1. A high positive value indicates a strong positive correlation, while a low value suggests a weak or no linear relationship between the variables. A value of 0 indicates no linear relationship between the variables.

the numeric value indicates the strength of proportionality i.e., if the Pearson correlation coefficient is positive (range between 0 to 1) then it is directly proportional between the variables and if the Pearson correlation coefficient is negative (range between -1 to 0) then it is inversely proportional between the variables.

using rate of change we can identify the relationship between the variables.

The formula for the Pearson correlation coefficient

Where mu x is the mean of the x variable, mu y is the mean of the y variable, xi is each observation in x and yi is each observation in y, σx is the standard deviation of x, σy is the standard deviation of y. The range of ρ is -1 ≤ ρ ≤ +1.

Rate of change:

The rate of change refers to how much one variable (usually the dependent variable) changes concerning a unit change in the other variable (usually the independent variable), using the Pearson correlation coefficient we cannot find the slope. so, this is where linear regression comes in, In the context of linear regression, the rate of change is nothing but slope (m).

In linear regression find a straight line that best fits the relationship between two variables. The slope represents the change in the dependent variable (y) for a unit change in the independent variable (x). It determines the steepness or the angle of inclination of the line.

A high correlation coefficient (ρ) indicates a strong linear relationship between two variables, and consequently, the slope (m) will be significant, representing a substantial rate of change between the variables. Conversely, a low correlation coefficient (ρ) indicates a weak or no linear relationship, and the slope (m) will be close to zero, suggesting a negligible rate of change between the variables.

Disadvantages of the Pearson correlation coefficient in non-linear data it fails, and it can’t detect the rate of change i.e., slope.

So linear regression finds the line that best fits the data.

Equation of a Line(2D)

ax + by + c = 0 (a, b and c are constants, x and y are 2 dimensions)

Equation of a plane(3D):

ax + by + cz + d = 0 (a, b, c, and d are constants, x, y, and z are 3 dimensions)

Equation of a Hyperplane (5D):

Equation of a Hyperplane (100 D):

In 100-dimensional Hyperplane will have 99 slopes and 1 intercept.

General equation of Line/Plane/Hyperplane

w0 is an intercept in all the dimensions. For D-dimensional Hyperplane will have D-1 slopes and 1 intercept.

Formulas representing in dot product and vector space.

In this context, w and x are both vectors with d-dimensions.

Vector Space and Dot Product

Equation of a hyperplane in D-dimensions

The expression w₁x₁ + w₂x₂ + w₃x₃ + … + wdxd + w₀ represents a linear combination of the variables x₁, x₂, x₃, …, xd with their respective coefficients w₁, w₂, w₃, …, wd, and an additional constant term or bias term w₀. This is a standard way to represent linear models, such as linear regression, where we multiply the input variables by their corresponding weights, sum the products, and add a bias term to get the final prediction or output.

the dot product can be represented in the above three different forms

This equation helps in performing Linear regression and logistic regression. In Linear regression, it finds the line that fits the data, without the equation of a line we cannot determine the Rate of change.

Linear regression finds a line that best fits the historical data. Line in the sense Slope and Intercept (M and C)

Best Fit is nothing but the line with the minimum average squared error.

Minimum average squared error

In simple linear regression, the goal is to find the best-fitting line (a straight line) that represents the relationship between two variables: a dependent variable (Y) and an independent variable (X) (1 input and 1 output). The best-fitting line is the one that minimizes the overall error between the observed data points and the predicted values on the line.

Multiple Linear Regression is an extension of simple linear regression that allows us to model the relationship between a dependent variable (we want to predict) and multiple independent variables (predictor variables). In multiple linear regression, we try to find the best-fitting hyperplane (a multi-dimensional plane) through the data points in a higher-dimensional space.

Though not a requirement, it is advised to ensure that all input features are independent for linear regression., i.e., avoid multi-collinearity among the input features. It helps in preserving model interpretability.

How to find the best line? Using Gradient descent, we can find the best line.

Gradient descent

Gradient Descent is a widely used optimization approach in machine learning, aimed at minimizing the cost function by iteratively adjusting the model’s parameters to reduce the error between actual and predicted outcomes. The primary goal of gradient descent is to minimize the convex function by parameter iteration.

Choosing a slower learning rate allows the algorithm to converge to the global minima, but it can be computationally expensive and time-consuming. On the other hand, a faster learning rate may cause the model to overshoot and end up in an undesired position, making it challenging to return to the correct track for reaching the global minima. Therefore, an appropriate learning rate should be selected, neither too slow nor too fast, to efficiently reach the global minima during the optimization process.

Steps Required in Gradient Descent Algorithm

First, initialize any random line.

2. Find the error.

3. Then try to change the slope and intercept such that the error reduces.

The cost function, also known as the error function or loss function, is a measure of how well the model’s predictions match the actual target values in the training data. The goal of linear regression is to find the best-fitting line that minimizes the difference between the predicted values and the true target values.

The properties of the cost function are continuous and convex.

Steps for Linear Regression:

1. EDA

2. Understanding the problem statement (input and output)

3. Train the algorithm using Linear Regression

4. Error/Residual analysis on training data

a. Distribution of residuals:

The distribution should be normal/Gaussian with zero mean.

b. IID (Independent and Identical):

Check for the patterns in residuals there should be no pattern.

c. Homoscedasticity:

The variance of residuals is the same for any value of x

5. Predictions

6. Evaluation metrics

We can get an infinite no of lines on the given data but by using the optimizer called Gradient descent with the optimization equation of linear regression we can get the best-fit line.

Gradient descent is an iterative algorithm that helps in solving the optimization equation (with a convex cost function).

Assumptions of Linear regression:

1. Linearity Assumption: Input and output must have a linear relationship.

2. Residuals should follow a normal distribution with zero mean.

3. IID: Residuals should be independent of each other and follow identical distribution.

4. Homoscedasticity: Residuals should follow a constant variance.

5. Independence of observation (Assumptions for data points)

6. Multicollinearity check:

Advantages:

Simple implementation: Implementing linear regression is simple, and it is easier to interpret the output coefficients.

Linear regression is highly effective at fitting datasets with linearly separable patterns, and it is frequently utilized to understand the underlying nature of relationships between variables.

Once the model is trained, making predictions becomes very fast and efficient.

Regularization helps reduce overfitting. Overfitting occurs when a machine learning model fits a dataset extremely tightly and hence also includes noisy data. This has a detrimental effect on the model’s functionality and reduces its test set accuracy. Regularization is a technique that can be easily implemented and is capable of effectively reducing the complexity of a function to reduce the risk of overfitting.

· Feature importance: Linear regression can be used for feature selection or variable ranking, as it assigns coefficients to each predictor, indicating their relative importance in predicting the target variable.

Disadvantages:

The training phase of Linear regression is time-consuming as it tries to find the best-fit line from an infinite possibility.

However, because the bounds of the linear regression technique are linear, outliers can have a significant impact on the regression.

Linear regression is susceptible to underfitting, which is a situation where the model fails to adequately capture the underlying patterns in the data. This typically happens when the linear function is too simple to accurately represent the relationships within the data.

Sensitive to outliers: An outlier or extreme value that deviates from the other data points in the distribution is an outlier in a data set. Data outliers may seriously affect a machine learning model’s performance and often result in models with poor accuracy.

Linear Regression assumes that the data points are independent of each other. However, it does not assume multicollinearity hence any multicollinearity must be removed before applying linear regression.

Application of linear regression

Finance: In finance, linear regression is employed to model stock returns, asset pricing, risk analysis, and portfolio optimization.

Economics: Linear regression is widely used in economics to analyze the relationship between economic variables such as demand and price, GDP and unemployment, or inflation and interest rates.

Marketing: Linear regression helps marketers understand the impact of marketing campaigns on sales and customer behavior.

Real Estate: Linear regression is employed in the real estate industry to predict housing prices based on various features such as location, size, and amenities.

Sports Analytics: Linear regression is used in sports analytics to evaluate player performance, assess team strategies, and predict match outcomes.

Conclusion

The linear regression model consists of a single parameter and establishes a linear relationship between the dependent and independent variables. By employing the cost function, we can determine the optimal values for the intercept and slope, leading to the best-fit line for the given data points. Utilizing Gradient Descent, the cost function is iteratively minimized in the direction of the steepest descent, and the learning rate plays a crucial role in this optimization process.

Linear regression proves to be effective when the relationship between the dependent and independent variables follows a linear pattern. Despite its flexibility, it is essential to assess the dataset for its assumptions to determine whether linear regression is a suitable choice for modeling the data accurately. By understanding the principles of linear regression and its underlying assumptions, data analysts can make informed decisions about its applicability and ensure meaningful results in their analyses.

Thank you for reading. Please let me know if you have any feedback.

My Other Posts

K-Nearest Neighbor(KNN) Algorithm in Machine Learning

Naïve Bayes Algorithm | Maximum A Posteriori in Machine Learning