ML & DL — Linear Regression (Part 2)

fernanda rodríguez
Analytics Vidhya
Published in
4 min readMay 8, 2020

Linear regression defines the relationship between two variables, but how to find the “best” combination of parameters?

Photo by Atharva Tulsi on Unsplash

In this article, you will find:

  • A brief introduction to linear regression,
  • Cost function and optimization problem,
  • Gradient descent and learning rate,
  • Back-propagation and optimizers,
  • Implementation of linear regression with Keras in the Jupyter Notebook,
  • Partial summary.

Linear regression

Linear regression is a simple machine learning algorithm that solves a regression problem [1].

In statistics, linear regression is a technique to model the relationship between a dependent variable y and one or more independent variables x.

The output is a linear function of the input [1]

  • ŷ= Wx + b, linear regression.
  • Hypothesis ŷ is the value that the model predicts,
  • Parameters W is the parameter that determines how each affects the characteristic prediction and b that controls the fixed displacement prediction.

Parameters

Parameters are values that control the behavior of the system.

  • The objective is to find the “best” possible set of parameters W and b to describe the data.
  • However, first, we need to define the error/cost.

Cost function

Define a model performance measure.

One way to measure model performance is to calculate the mean square error (MSE) in the test set[1].

  • In statistics, the MSE measures the mean of the squares of errors or deviation.
  • L = 1/n ∑(y⒤ — ŷ⒤)², mean square error (MSE).

The difference between the hypothesis ŷ from the real value y.

Optimization problem

Nevertheless, we need to minimize the cost of hypothesis ŷ, based on the
model parameters W and b.

  • min(L) = min(1/n ∑(y⒤ — ŷ⒤)², minimize the cost function.

Possible solutions:

  • Analytics or
  • Numeric: Optimization algorithms that iterate over the data set, for example, gradient descent.

Gradient descent

In calculus, gradient descent is a first-order iterative optimization algorithm to find the minimum of a function[1].

  • The Gradient is an operation that takes a multi-variable function and returns a vector in the direction of the maximum slope in the graph of the original function.
  • ∇L=[𝜕L/𝜕W, 𝜕L/𝜕b]ᵀ, gradient descent.

If we want to go down, we have to do is walk in the opposite direction to the gradient. This would be the strategy to minimize the cost function.

Learning rate ϵ

The learning rate is the positive scale ϵ that determines the step size[1].

  • [W’, b’]ᵀ = [W, b]ᵀ — ϵ∇L, learning rate.

Learning rate ϵ could be a problem in two ways:

  • If the step is too small, we will move slowly to the minimum,
  • If the step is too big, we may end up jumping beyond the minimum.

Back-propagation

Method for calculating the gradient.

While another algorithm, such as gradient descent, is used to perform learning (optimization) using that gradient[1].

Two phrases, forward-propagation, and back-propagation.

  • Forward-propagation

The input x provides the initial information that propagates and finally produces ŷ.

Forward-propagation by ma. fernanda rodríguez r.
  • Back-propagation

We use the value of the cost function to calculate the error. The error value is propagated backward in order to calculate the gradient with respect to the weights.

Gradients are calculated using the chain rule.

Back-propagation by ma. fernanda rodríguez r.

Back-propagation algorithm

  • After forward-propagation, we obtain an output value that is the predicted value.
  • We use a loss function L to calculate the error value.
  • We calculate the error gradient for each weight,
  • We subtract the gradient value from the weight value.

In this way, we approach the local minimum.

Problems with gradient descent

The traditional method of gradient descent calculates the gradient of the entire data set but will perform only an upgrade.

So it can be very slow and difficult to control for very large data sets that do not fit in memory.

As a solution, modifications were created to the original method known as optimizers.

Optimizers

There are extensions (modifications to the original method) that try to solve the problems of the gradient descent.

Basic [1]:

  • SGD: Stochastic Gradient Descent.

Adaptive learning rates [1]:

  • AdaGrad [2].
  • RMSprop [3].
  • Adam [4].

Github code

In this repository, you will find the implementation of a linear regression step by step with Keras in the Jupyter Notebook.

Partial summary

Linear regression defines the relationship between two variables.

How to find the “best” combination of parameters?

  • Cost function: Mean squared error
  • Gradient descent
    Back-propagation to calculate the gradient
    Optimizers (downward gradient extensions)

The process of finding the best combination of parameters (optimization) is called training.

All theoretical and practical implementations: Linear regression, Logistic regression, Artificial neural networks, Deep neural networks, and Convolutional neural networks.

References

[1] Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.

[2] Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159.

[3] Hinton, G. (2012). Neural networks for machine learning. Coursera, video lectures

[4] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. abs/1412.6980.

--

--

fernanda rodríguez
Analytics Vidhya

hi, i’m maría fernanda rodríguez r. multimedia engineer. data scientist. front-end dev. phd candidate: augmented reality + machine learning.