Optimization Algorithms in Machine Learning: A Comprehensive Guide to Understand the concept and Implementation.

Koushik
9 min readDec 6, 2023

--

Optimizer, the engine of machine learning.

gradient descent. source:dmitrijskass

Definition: In the context of machine learning, optimization refers to the process of adjusting the parameters of a model to minimize (or maximize) some objective function. The objective function is often a measure of how well the model performs on a task, such as minimizing the error on a set of training data. The process involves finding the optimal set of parameters that result in the best performance of the model.

There are various optimization algorithms used in machine learning to find the optimal set of parameters. These algorithms are responsible for updating the model parameters iteratively during the training process. Some common optimization algorithms include:

Gradient Descent: Gradient Descent is a first-order iterative optimization algorithm widely used in machine learning and optimization problems. Its primary purpose is to minimize a differentiable cost or loss function by iteratively adjusting the parameters of a model.

w = wα * δloss​/δw

Start with an initial set of parameter values (weights) for the model. In each iteration, compute the gradient of the loss function with respect to the parameters. where w represents the parameters, α is the learning rate, and δloss​/δw is the gradient of the objective function. Simplify the function you would like to minimize by using the first-order polynomial. This expression δloss/δw represents the rate of change of the loss with respect to the weight parameter and is a crucial component in optimization algorithms like gradient descent. The goal is to adjust the weights in the direction that minimizes the loss, and the derivative provides information about the slope of the loss function at a specific point in the parameter space.

Objective: Gradient Descent aims to find the minimum of a cost or loss function. This function represents the difference between the predicted values of the model and the actual values (labels) in a training dataset.

Convergence and Challenges: Convergence is achieved when the algorithm reaches a point where further iterations do not significantly change the parameters. However, challenges such as choosing an appropriate learning rate and dealing with saddle points and local minima can affect convergence.

Let’s see how to code it.

import numpy as np

def gradient_descent(X, y, w, learning_rate, iterations):
m = len(y)

for _ in range(iterations):
predictions = np.dot(X, w)
errors = predictions - y
gradient = np.dot(X.T, errors) / m
# α = learning_rate and gradient = δloss​/δw
w = w - learning_rate * gradient

return w

Stochastic Gradient Descent (SGD):

Stochastic Gradient Descent (SGD) is an optimization algorithm used in machine learning and deep learning for training models. It is an extension of the standard gradient descent algorithm, where instead of computing the gradient of the entire dataset, SGD computes the gradient and updates the model parameters for each training example individually or in small batches.

Why Use SGD:

Computational Efficiency: Computing the gradient using the entire dataset can be computationally expensive, especially for large datasets. SGD allows for more frequent updates with lower computational cost.

Faster Convergence: Since updates are made more frequently, SGD often converges faster than traditional gradient descent, especially when dealing with non-convex loss functions.

Memory Efficiency: Working with individual or mini-batches of data requires less memory compared to the entire dataset, making SGD suitable for cases where memory is a constraint.

Escape Local Minima: The stochastic nature of SGD introduces randomness in the optimization process, helping the algorithm escape local minima and explore the parameter space more effectively.

It’s an extension of gradient descent where the parameters are updated using a single randomly chosen data point at a time, making it computationally less expensive.

Stochastic gradient descent

where i represents the randomly chosen data point. Let’s see how to code it.

import numpy as np

def stochastic_gradient_descent(X, y, theta, learning_rate, iterations):
m = len(y)

for _ in range(iterations):
for i in range(m):
rand_index = np.random.randint(0, m)
X_i = X[rand_index, :].reshape(1, -1)
y_i = y[rand_index]
prediction = np.dot(X_i, theta)
error = prediction - y_i
gradient = X_i.T * error
theta = theta - learning_rate * gradient

return theta

In summary, SGD is advantageous for its computational efficiency, faster convergence, and ability to handle large datasets. However, it introduces variance in updates, and the choice of an appropriate learning rate is crucial to balance exploration and exploitation during optimization. The decision to use traditional gradient descent or SGD depends on factors such as dataset size, computational resources, and the nature of the optimization problem.

RMSprop (Root Mean Square Propagation): RMSprop is an optimization algorithm designed to address some of the limitations of traditional gradient descent, especially when dealing with non-convex and poorly conditioned optimization problems. It is particularly useful for problems with sparse data and helps to adaptively adjust the learning rates for different parameters.

How RMSprop Works: RMSprop adapts the learning rates of each parameter individually by dividing the learning rate by the root mean square of recent gradients. This allows the algorithm to automatically decrease the learning rate for parameters that have steep and rapidly changing gradients and increase the learning rate for parameters with small or slowly changing gradients.

Algorithm gist:

def RmsProp(X, y, theta, learning_rate, beta=.9, iterations, epsilon=1e-7):

v_t = np.zeros_like(theta)
# RMSprop optimization loop
for _ in range(iterations):
# Compute the gradient of the loss function
predictions = X.dot(theta)
errors = predictions - y
gradient = X.T.dot(errors) / len(y)

# Update the exponentially decaying average of squared gradients
v_t = beta * v_t + (1 - beta) * gradient**2

# Update the parameters using RMSprop
theta -= (learning_rate / (np.sqrt(v_t) + epsilon)) * gradient

return theta

In summary, RMSprop helps mitigate issues like slow convergence and oscillations in the learning rate by adapting the learning rates for each parameter based on the historical gradients. It is a popular choice in practice, especially for training deep neural networks, and can be particularly effective in scenarios where the landscape of the optimization problem is challenging or has varying curvature. The algorithm provides a balance between the aggressive adaptability of methods like AdaGrad and the more stable behavior of methods like SGD.

Adam (Adaptive Moment Estimation): Adam is an optimization algorithm used for training machine learning models. It combines ideas from both momentum optimization and RMSprop (Root Mean Square Propagation). Adam adapts the learning rates for each parameter individually based on the historical gradients of those parameters.

Why Adam ??

Adaptive Learning Rates: One of the key advantages of Adam is its adaptive learning rate mechanism. It adjusts the learning rates for each parameter based on the historical gradients, allowing it to perform well across different types of parameters and features.

Efficiency in Sparse Gradients: Adam is well-suited for sparse gradients, which is common in tasks like natural language processing. It maintains a separate adaptive learning rate for each parameter, making it less sensitive to the scale of the gradients.

Combining Momentum and RMSprop: Adam combines the momentum term to accelerate convergence in the parameter space with the RMSprop term to adaptively scale the learning rates. This combination helps handle noisy gradients and navigate through saddle points efficiently.

Effective in a Wide Range of Applications: Adam has shown effectiveness in a wide range of deep learning applications and is widely used in practice due to its robust performance and ease of use.

Algorithm

Adam algorithm. source: springer

Explaination of Adam parameters:

First Moment Estimate (mt​): This moving average is similar to the momentum term and represents the exponentially decaying average of past gradients. It helps the optimizer continue moving in the right direction even if the gradient updates become noisy.

Second Moment Estimate (vt​): This moving average is similar to the squared gradients used in RMSprop. It represents the exponentially decaying average of the past squared gradients and helps adapt the learning rates for each parameter.

The updated parameter (θt+1​) is then calculated using the following formula:

  • α is the learning rate.
  • ϵ is a small constant to prevent division by zero.
  • t is the iteration step.
  • Jt​ is the gradient of the loss function with respect to the parameters at iteration t.

Let’s see how to code:

import numpy as np

def adam(X, y, theta, learning_rate, iterations, beta1=0.9, beta2=0.999, epsilon=1e-8):
m = len(y)
m_t = np.zeros(theta.shape)
v_t = np.zeros(theta.shape)
t = 0

for _ in range(iterations):
for i in range(m):
t += 1
X_i = X[i, :].reshape(1, -1)
y_i = y[i]
prediction = np.dot(X_i, theta)
error = prediction - y_i
gradient = X_i.T * error

m_t = beta1 * m_t + (1 - beta1) * gradient
v_t = beta2 * v_t + (1 - beta2) * (gradient**2)

m_t_hat = m_t / (1 - beta1**t)
v_t_hat = v_t / (1 - beta2**t)

theta = theta - learning_rate * m_t_hat / (np.sqrt(v_t_hat) + epsilon)

return theta

In summary, Adam combines the advantages of momentum and RMSprop to provide an adaptive and efficient optimization algorithm. It has become a popular choice for training deep neural networks due to its robust performance and ease of use.

Run this main function to compare the synthetic data to several optimizers. Furthermore, remember to tune hyperparameters and use real-world datasets.

import numpy as np

def mean_squared_error(y_true, y_pred):
return np.mean((y_true - y_pred)**2)

# Generate synthetic data
np.random.seed(40)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Add bias term to X
X_b = np.c_[np.ones((100, 1)), X]

# Initial parameters
theta_initial = np.random.randn(2, 1)

# Set hyperparameters
learning_rate = 0.003
iterations = 1000
batch_size = 32

# Apply optimization algorithms
theta_gd = gradient_descent(X_b, y, theta_initial, learning_rate, iterations)
theta_sgd = stochastic_gradient_descent(X_b, y, theta_initial, learning_rate, iterations)
theta_rmsprop = RmsProp(X_b, y, theta_initial, learning_rate, beta=.9, iterations, epsilon=1e-7)
theta_adam = adam(X_b, y, theta_initial, learning_rate, iterations)

# Evaluate the models
predictions_gd = np.dot(X_b, theta_gd)
predictions_sgd = np.dot(X_b, theta_sgd)
predictions_rmsprop = np.dot(X_b, theta_rmsprop)
predictions_adam = np.dot(X_b, theta_adam)

# Print Mean Squared Error for each model
print("Mean Squared Error (Gradient Descent):", mean_squared_error(y, predictions_gd))
print("Mean Squared Error (Stochastic Gradient Descent):", mean_squared_error(y, predictions_sgd))
print("Mean Squared Error (RmsProp):", mean_squared_error(y, predictions_rmsprop))
print("Mean Squared Error (Adam):", mean_squared_error(y, predictions_adam))

Thank You!

--

--