Optimization Algorithms and Interactive Visualization (Part 2)​​

Shuvam Das
deepkapha notes
Published in
9 min readMar 13, 2023

Defining the Adamax optimization algorithm for numerical optimization in Python. (Equation explanation)

RMSprop is a gradient-based optimization algorithm used for deep learning. It adapts the learning rate based on the magnitude of the gradients, allowing for faster convergence and better performance.

The equation for RMSprop can be expressed as:

g_t = ∇J(θ_t)
E[g²]t = βE[g²]{t-1} + (1-β)g_t²
θ_t = θ_{t-1} - α/(sqrt(E[g²]_t) + ε) * g_t
where:
θ_t is the updated value of the parameters at time step t.
θ_{t-1} is the previous value of the parameters.
α is the learning rate, which determines the step size at each iteration.
g_t is the gradient of the loss function at time step t.
E[g²]_t is the exponentially weighted moving average (EWMA) of the squared gradients at time step t.
β is the decay rate for the EWMA, typically set to 0.9 or 0.99.
ε is a small constant added to the denominator to prevent division by zero.

The algorithm calculates the EWMA of the squared gradients at each time step, which acts as a measure of the variance of the gradients. This is then used to adjust the learning rate for each parameter. If the gradient is large and the variance is low, the learning rate is increased to take larger steps in the parameter space. If the gradient is small and the variance is high, the learning rate is decreased to take smaller steps in the parameter space.

The RMSprop algorithm adapts the learning rate based on the history of the gradients, allowing it to converge faster and handle varying gradient magnitudes. It is particularly useful when dealing with sparse data, where the gradients can be very noisy and vary greatly in magnitude.

Defining the RMS optimization algorithm for numerical optimization in Python. (Code explanation)

# gradient descent algorithm with rmsprop
def rmsprop(objective, derivative, bounds, n_iter, step_size, rho):
# track all solutions
solutions = list()
# generate an initial point
x = bounds[:, 0] + np.random.rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
score = objective(x[0], x[1])
# list of the average square gradients for each variable
sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]
# run the gradient descent
for it in range(n_iter):
# calculate gradient
gradient = derivative(x[0], x[1])
# update the average of the squared partial derivatives
for j in range(gradient.shape[0]):
# calculate the squared gradient
sg = gradient[j]**2.0
# update the moving average of the squared gradient
sq_grad_avg[j] = (sq_grad_avg[j] * rho) + (sg * (1.0-rho))# build solution
new_solution = list()
for i in range(x.shape[0]):# calculate the learning rate for this variable
alpha = step_size / (1e-8 + np.sqrt(sq_grad_avg[j]))
# calculate the new position in this variable
value = x[i] - alpha * gradient[j]
new_solution.append(value)# store the new solution
solution = np.asarray(new_solution)
solutions.append(solution)
# evaluate candidate point
solution_eval = objective(solution[0], solution[1])# report progress
print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
return solutions

The given code defines two optimization algorithms, namely RMSProp and Adamax, and demonstrates their usage with an objective function and derivative function. The code is written in Python, and the numpy library is used to perform numerical computations.

The first algorithm, RMSProp, is used to optimize a given objective function and its derivative function. The numpy library is used to seed the pseudo-random number generator to ensure the reproducibility of the results. The bounds of the input are defined using the numpy library’s array function. The number of iterations is defined as 60, and the step size for the gradient descent search is defined as 0.01. Additionally, the momentum parameter for RMSProp is set to 0.99. The optimization algorithm is called by passing the objective and derivative functions, bounds, number of iterations, step size, and momentum as arguments. The solutions obtained during the optimization process are stored in a list, which is then used to generate a plot of the solutions.

Implementing Adamax Optimization Algorithm to Minimize Objective Function. (Equation Explanation)

Adamax is a variant of the Adam optimization algorithm that uses the infinity norm (maximum absolute value) of the gradients instead of the L2 norm to perform adaptive learning rate adjustment. It is particularly useful when dealing with very large gradients, such as in deep learning models with many parameters.

The equation for Adamax can be expressed as:

g_t = ∇J(θ_t)
m_t = β_1 * m_{t-1} + (1-β_1)g_t
u_t = max(β_2*u_{t-1}, abs(g_t))
θ_t = θ_{t-1} - α/(1-β_1^t) * m_t / (u_t + ε)
where:
θ_t is the updated value of the parameters at time step t.
θ_{t-1} is the previous value of the parameters.
α is the learning rate, which determines the step size at each iteration.
g_t is the gradient of the loss function at time step t.
m_t is the exponentially weighted moving average (EWMA) of the gradients at time step t.
β_1 and β_2 are the decay rates for the EWMA, typically set to 0.9 or 0.999 and 0.999, respectively.
u_t is the EWMA of the infinity norm of the gradients at time step t.
ε is a small constant added to the denominator to prevent division by zero.

The algorithm first calculates the EWMA of the gradients (m_t) and the EWMA of the infinity norm of the gradients (u_t). The learning rate for each parameter is then adjusted based on the ratio of these two quantities. If the gradients are very large in some dimensions, the learning rate is scaled down by the infinity norm to avoid overshooting the minimum.

The Adamax algorithm also includes the bias correction terms used in Adam to adjust the estimates of the first and second moments of the gradients:

m_t = m_t / (1 — β_1^t)

u_t = u_t / (1 — β_2^t)

Overall, Adamax is a powerful optimization algorithm that can handle large gradients and provide stable and efficient convergence. It is widely used in deep learning and other machine learning applications.

Implementing Adamax Optimization Algorithm to Minimize Objective Function. (Code Explanation)

# gradient descent algorithm with adamax
def adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2):
solutions = list()
# generate an initial point
x = bounds[:, 0] + np.random.rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
# initialize moment vector and weighted infinity norm
m = [0.0 for _ in range(bounds.shape[0])]
u = [0.0 for _ in range(bounds.shape[0])]
# run iterations of gradient descent
for t in range(n_iter):
# calculate gradient g(t)
g = derivative(x[0], x[1])
# build a solution one variable at a time
for i in range(x.shape[0]):
# m(t) = beta1 * m(t-1) + (1 - beta1) * g(t)
m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]
# u(t) = max(beta2 * u(t-1), abs(g(t)))
u[i] = max(beta2 * u[i], abs(g[i]))
# step_size(t) = alpha / (1 - beta1(t))
step_size = alpha / (1.0 - beta1**(t+1))
# delta(t) = m(t) / u(t)
delta = m[i] / u[i]
# x(t) = x(t-1) - step_size(t) * delta(t)
x[i] = x[i] - step_size * delta
# evaluate candidate point
score = objective(x[0], x[1])
solutions.append(x.copy())
# report progress
print('>%d f(%s) = %.5f' % (t, x, score))
return solutions

The second algorithm, Adamax, is defined as a function that takes the same arguments as the RMSProp function, along with additional parameters alpha, beta1, and beta2. In this algorithm, an initial point is randomly generated within the bounds defined earlier. The moment vector and weighted infinity norm are initialized to 0, and the gradient descent iterations are performed for the specified number of iterations. During each iteration, the gradient is calculated, and the solution is built one variable at a time. The moment vector and infinity norm are updated based on the gradient, and the step size is adjusted according to the iteration number. Finally, the new candidate point is evaluated, and the solutions obtained during the optimization process are stored in a list. The progress of the optimization is reported by printing the iteration number, the candidate point, and its score.

Overall, the given code demonstrates the usage of two optimization algorithms, RMSProp and Adamax, for a given objective function and derivative function. The code can be modified by changing the input bounds, the number of iterations, the step size, and the momentum/alpha/beta parameters to obtain different solutions. The progress of the optimization process can be visualized by plotting the solutions obtained during the process.

The given code block includes two main parts: optimization algorithms (Adamax and Rmsprop) and a data visualization with Plotly. The first part is focused on implementing the Adamax and Rmsprop optimization algorithms. The goal of these algorithms is to find the minimum of a given objective function. The algorithms require the objective function, its derivative, initial bounds of input values, and some hyperparameters like step size, momentum, and factors for averaging gradients.

The Adamax algorithm starts by setting up the range of input values, the number of iterations, and the hyperparameters like step size, beta1 (for average gradient), and beta2 (for average squared gradient). Then it initializes the solutions list and generates an initial random point within the input bounds. After that, it calculates the gradient of the objective function at the current point and updates the moment vector and weighted infinity norm for each variable. Then it calculates the step size and delta for each variable and updates the variables using the delta and step size. Finally, it evaluates the objective function at the new point and stores the solution in the solutions list. The process is repeated for a predefined number of iterations, and the function returns the solutions list.

Optimization Algorithms and Data Visualization with Plotly in Python

The Rmsprop algorithm is similar to Adamax, but it includes a different momentum calculation. It initializes the solutions list, generates an initial random point, and calculates the gradient of the objective function. Then it initializes the square of gradients and calculates the step size and delta for each variable. The momentum is updated using the square of gradients and the current gradient. Finally, it updates the variables using the delta and step size, evaluates the objective function, stores the solution, and repeats the process for a predefined number of iterations.

import plotly.express as px
df = px.data.tips()
fig = px.density_contour(df, x="total_bill", y="tip", z="size" )
fig.update_traces(contours_coloring="fill", contours_showlabels = True)
fig.show()

The second part of the code block uses the Plotly library to create a 2D density contour plot. The dataset used in the visualization is the “tips” dataset from the Plotly library, which contains information on the total bill, tip amount, and size of the party at a restaurant. The density contour plot shows the density of points in the dataset, with the x-axis representing the total bill, the y-axis representing the tip amount, and the color representing the size of the party. The contour lines indicate regions of high and low density. The Plotly library provides a variety of tools for customizing the appearance of the plot, including the color scheme, contour line thickness, and axis labels. The function call “fig.show()” displays the resulting plot in the output cell.

Summary: Implementing Optimization Algorithms and Visualization with Python Libraries

In conclusion, the provided code demonstrates the implementation of optimization algorithms like gradient descent, Adam, and RMSProp, along with visualization using the NumPy and Matplotlib libraries. The gradient descent algorithm finds the minimum of the objective function by iteratively moving towards the minimum using the negative gradient of the function. The Adam optimizer and RMSProp optimizer adjust the step size during optimization and improve convergence towards the minimum. The code also includes a plotting function to visualize the optimization process, which can be useful in understanding the path taken by the algorithm toward the minimum. Overall, this code provides an excellent starting point for anyone interested in implementing optimization algorithms or exploring the power of visualization in optimization.

Here is the working demo video of the live application

You can visit the live website through this link: http://ec2-3-145-64-26.us-east-2.compute.amazonaws.com:8502/

Please find the link to code here: https://github.com/deepkapha/EarthScanWebinar/tree/main/notebook

--

--