ML & DL — Linear Regression (Part 2)
Linear regression defines the relationship between two variables, but how to find the “best” combination of parameters?
In this article, you will find:
- A brief introduction to linear regression,
- Cost function and optimization problem,
- Gradient descent and learning rate,
- Back-propagation and optimizers,
- Implementation of linear regression with Keras in the Jupyter Notebook,
- Partial summary.
Linear regression
Linear regression is a simple machine learning algorithm that solves a regression problem [1].
In statistics, linear regression is a technique to model the relationship between a dependent variable y and one or more independent variables x.
The output is a linear function of the input [1]
ŷ= Wx + b
, linear regression.- Hypothesis
ŷ
is the value that the model predicts, - Parameters
W
is the parameter that determines how each affects the characteristic prediction andb
that controls the fixed displacement prediction.
Parameters
Parameters are values that control the behavior of the system.
- The objective is to find the “best” possible set of parameters
W
andb
to describe the data. - However, first, we need to define the error/cost.
Cost function
Define a model performance measure.
One way to measure model performance is to calculate the mean square error (MSE) in the test set[1].
- In statistics, the MSE measures the mean of the squares of errors or deviation.
L = 1/n ∑(y⒤ — ŷ⒤)²
, mean square error (MSE).
The difference between the hypothesis ŷ
from the real value y
.
Optimization problem
Nevertheless, we need to minimize the cost of hypothesis ŷ
, based on the
model parameters W
and b
.
min(L) = min(1/n ∑(y⒤ — ŷ⒤)²
, minimize the cost function.
Possible solutions:
- Analytics or
- Numeric: Optimization algorithms that iterate over the data set, for example, gradient descent.
Gradient descent
In calculus, gradient descent is a first-order iterative optimization algorithm to find the minimum of a function[1].
- The Gradient is an operation that takes a multi-variable function and returns a vector in the direction of the maximum slope in the graph of the original function.
∇L=[𝜕L/𝜕W, 𝜕L/𝜕b]ᵀ
, gradient descent.
If we want to go down, we have to do is walk in the opposite direction to the gradient. This would be the strategy to minimize the cost function.
Learning rate ϵ
The learning rate is the positive scale ϵ
that determines the step size[1].
[W’, b’]ᵀ = [W, b]ᵀ — ϵ∇L
, learning rate.
Learning rate ϵ
could be a problem in two ways:
- If the step is too small, we will move slowly to the minimum,
- If the step is too big, we may end up jumping beyond the minimum.
Back-propagation
Method for calculating the gradient.
While another algorithm, such as gradient descent, is used to perform learning (optimization) using that gradient[1].
Two phrases, forward-propagation, and back-propagation.
- Forward-propagation
The input x
provides the initial information that propagates and finally produces ŷ
.
- Back-propagation
We use the value of the cost function to calculate the error. The error value is propagated backward in order to calculate the gradient with respect to the weights.
Gradients are calculated using the chain rule.
Back-propagation algorithm
- After forward-propagation, we obtain an output value that is the predicted value.
- We use a loss function
L
to calculate the error value. - We calculate the error gradient for each weight,
- We subtract the gradient value from the weight value.
In this way, we approach the local minimum.
Problems with gradient descent
The traditional method of gradient descent calculates the gradient of the entire data set but will perform only an upgrade.
So it can be very slow and difficult to control for very large data sets that do not fit in memory.
As a solution, modifications were created to the original method known as optimizers.
Optimizers
There are extensions (modifications to the original method) that try to solve the problems of the gradient descent.
Basic [1]:
- SGD: Stochastic Gradient Descent.
Adaptive learning rates [1]:
- AdaGrad [2].
- RMSprop [3].
- Adam [4].
Github code
In this repository, you will find the implementation of a linear regression step by step with Keras in the Jupyter Notebook.
Partial summary
Linear regression defines the relationship between two variables.
How to find the “best” combination of parameters?
- Cost function: Mean squared error
- Gradient descent
• Back-propagation to calculate the gradient
• Optimizers (downward gradient extensions)
The process of finding the best combination of parameters (optimization) is called training.
All theoretical and practical implementations: Linear regression, Logistic regression, Artificial neural networks, Deep neural networks, and Convolutional neural networks.
For those looking for all the articles in our ML & DL series. Here is the link.
References
[1] Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.
[2] Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159.
[3] Hinton, G. (2012). Neural networks for machine learning. Coursera, video lectures
[4] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. abs/1412.6980.