Linear Regression From Scratch- 3 Methods

Alex
8 min readFeb 25, 2022

--

Introduction:

In the first part of this article series, we used gradient descent. In this second part of the article series, we will be implementing linear regression using the maximum likelihood estimator and maximum a posteriori.

I encourage you to watch this video “MLEbefore reading so that you will have a better understanding of what will be discussed in this article.

Article Outline:

  • Data set-up
  • Maximum Likelihood Estimator
  • Maximum a Posteriori
  • Summary
  • What’s next

Data set-up:

Here is a copy of the same data used in the first part of the article series.

In machine learning, it is common to split our data into 20% testing and 80% training. Can visit this article for more information on this.

Maximum Likelihood Estimator:

Maximum Likelihood Estimator (MLE) is an approach to finding the best distribution to fit some data. In linear regression, we use it to also get the best parameters = {m,b} to our line y=mx+b.

Consider our x data on its own:

import matplotlib.pyplot as plt
samples = X.shape[0]
zero_axis = np.zeros(samples)
plt.scatter(X,zero_axis)
plt.xlabel('House(ft^2')
plt.show()

Our x data can be represented as a normal distribution — a bell-shaped curve made up of two parts: mean(mu)and standard deviation(sigma).

import scipy.stats as stats
import math
mean = np.mean(X)variance = np.std(X)std = math.sqrt(variance) # standard deviation formulasamples=X.shape[0]# generating samples of X up to 3 std
data = np.linspace(mean - 3*std, mean + 3*std, samples)
plt.plot(x, stats.norm.pdf(data, mean, std))plt.xlabel('House area(ft^2)')plt.title("X distribution")plt.show()

The likelihood of the mean and standard deviation fitting to a sample data can be represented by a Gaussian probability density function (pdf).

Gaussian PDF equation.

Suppose we want to find the likelihood that some distribution with a random mean and std fits a single data point in our dataset:

Graph of likelihood from Gaussian PDF.

The likelihood of this distribution fitting our single data point x is % 0.0024. To find the likelihood of this distribution fitting all of our data we multiply the results since the events or likelihood are all independent.

This is also equivalent to the expression below:

Mathematical expression for independent events.

If we were to try all possible mean values for the same std in our Gaussian PDF, we would get the graph below:

Gaussian PDF for mean

Our goal is to find the best mean value. The best mean value is where the slope is 0 as illustrated below.

Graph of the maximum mean value.

To find the best parameters for our machine learning model to predict new data, we will need to do a few variable changes to our Gaussian PDF:

We are interested in finding the best distribution not for our features (independent variables x) but for our label (dependent variable y).

Our mean is now represented by mx+b. We do this so once we find our parameters={m,b} represented by theta, for any input x we can predict an output y or make a distribution that fits y.

We are interested in finding the best parameters where the slope of our mean is 0. This can mathematically be expressed by this equation:

Arguments maximum

To find where the slope is 0, we need to differentiate this equation. However, it is much easier to work with logs so we will take the log of this:

Monotonic transformation.

The maximum for the likelihood or Gaussian PDF is still at the same location as the log of it although the values may be different. This is called monotonic transformation.

It is common to use the negative log-likelihood version, as shown below:

Negative log-likelihood.

The maximum of the log-likelihood is at the same location as the minimum of the negative log-likelihood. Similar to x² and -x² from high school.

Tip: Always use the log for probabilities since it is easier to work with.

Note: Feel free to skip the math derivation part if find the math too complicated and go straight to the code implementation.

Math Derivation of MLE:

Our likelihood equation to minimize:

Let us simplify this expression first by using the log multiplication rule. This changes log(x1 * x2…*xn) to a summation series (logx1+logx2…+logxn)

Caution: Lots of math (matrices, and derivates) for finding the best parameters coming up.

Now let's substitute our variables into the Gaussian PDF equation, and simplify, where our mean is the line mx+b.

Simplifying likelihood of Gaussian PDF.

Let us find the best value for our bias by differentiating and isolating it.

Partial derivative of bias.

We put our bias in terms of y and our weight and x data so that when we solve for our weights, we can later solve for our bias. We will substitute this back into our likelihood equation and simplify.

Changing scalar to vector form.
Partial derivative of weights.

Bingo! We now found a formula that gives us the best parameters ={m,b} in terms of our weight.

MLE Code Implementation:

We will use the following two-equation derived earlier to implement MLE for our weights and bias:

MLE for weights
MLE for bias

Let us now predict our test data set and see it on a graph.

model =LinearRegressionMLE(X_train,y_train)parameters = model.train()weights = parameters[0:-1,:]bias = parameters[-1,:]prediction =  model.predict(X_test)plt.title('House area vs Price')plt.xlabel('House area(ft^2)')plt.ylabel('Price($)')linear_equation = "y={:0.2f}x+{:0.2f}".format(weights[0][0],bias[0])plt.plot(X_test,prediction,color='k',label=linear_equation)plt.scatter(X_test,y_test,label='Correct output')plt.legend()plt.show()
A graph of line of best fit from our model.

Our parameters, as shown by the line, are a good fit for both our training and testing data. If you remember from the last article, this means we have a low bias (low training error) and low variance (low testing error).

Maximum a Posteriori:

Maximum a Posteriori (MAP) is similar to MAE but with a prior, that is- a hypothesis about our data. It uses Bayesian Inference to estimate the best distribution that fits a data. Learn the meaning of the terms below here.

Bayesian Inference equation

Since we are only interested in the likelihood of the parameters, we can remove the denominator since it has no effect on it.

The posterior is proportional to the likelihood and prior.

To get the best parameters to fit the distribution of our label or to predict it, we need to minimize the following:

The likelihood equation we will use for Bayesian Inference is the same Gaussian PDF.

Math Derivation of MAP:

Partial derivative of our bias and weights.

Bingo! We now found a formula that gives us the best parameters ={m,b} in terms of our weight and bias. The bias remains the same formula as in MLE.

MAP Code Implementation:

We will be implementing the following two equations derived earlier:

MAP for weights
MAP for bias

Let us now predict our test data set and see it on a graph.

prior = 10model = LinearRegressionMAP(X_train,y_train,prior)params = model.train()weights = params[0:-1,:]bias = params[-1,:]prediction = model.predict(X_test)linear_equation = "y={:0.2f}x + {:0.2f}".format(weights[0][0],bias[0])plt.plot(X_test,prediction,color='k',label=linear_equation)plt.scatter(X_test,y_test,label='True output')plt.xlabel('House area(ft^2)')plt.ylabel('Price($)')plt.title('House area vs Price')plt.legend()
A graph of best line of fit from our model

Our machine learning model is performing well with similar parameters to that of MLE.

You may wonder, how do we choose a prior? Consider what the prior is made up of:

If the numerator goes to 0, this means we have no variance- our true output equals the prediction. If the denominator gets very large, we have a useless prior. Both of these can make our prior equal 0.

If we know either of these values we would use them In practice, however, we usually don't know about those things or only know one of the two. So we just pick any arbitrary value and experiment with it.

MAP for linear regression is also known as Ridge Regression which helps give us low variance but higher bias so that we do not get overfitting for our training data by using regularization- that is our prior.

Summary:

Gradient Descent: An iterative optimization algorithm for minimizing our loss and getting the best parameters. This is most commonly used in machine learning when the dataset is small to medium size.

Maximum Likelihood Estimator: A method to estimate the parameters by maximizing the likelihood. Used in deep learning and is most common when we want to build a statistical model.

Maximum a Posterior: A special case of MLE with a prior. Used when we know about the prior or in ridge regression using regularization to prevent overfitting. Otherwise, we use MLE or gradient descent.

What’s next:

MLE, MAP, and GD give the best peak of our parameters.

That’s it for this article series. In the next one, we will use machine learning to develop text classification and summarization for businesses from scratch. Stay tuned until then. Great job, hurray!^-^

--

--