Optimize Neural Network With Gradient-Free Methods Using Pytorch and Nevergrad.

Optimize Neural Network parameters ( Weights and Bias ) with gradient-free optmization methods by leveraging the power of pytorch and nevergrad.

Maheritiana Jonathan Jeremie Randriarison

8 min readJul 22, 2023

Neural Network is one of the biggest revolution in Artificial Intelligence growth, in addition , it is the main component of all deep learning algorithms used in computer vision , natural language processing , .. such as CNN , RNN , and also the very famous Transformer who is the father of all LLMs( Large Language Models ) . Gradient-based optimizations methods like stochastic gradient descent , BFGS are very powerful, deterministic and considered as the default way for training many machine learning algorithms, especially neural network. However, in some case gradient-free methods are more suitable to the problems statement. In this post, i will show you how to implement this last with two (pytorch & nevergrad) powerful tools.

What are the motivations ?

There are some reason to use free-derivative methods for model training :

(Gradient-based) optimizers can get stuck in local minima, because they are unable to visit the entire search space.
The Cost function are not always derivable or computing the partial derivative is too hard.
It is the easiest methods to train model with multi-objective purpose and/or with many constraints ( this is my own motivation ).
…

Required knowledges :

I assume that you have a basic understanding with the following points :

Global concept of machine learning and neural network.
Main idea behind mathematical optimization and gradient based method.
And finally , you must be familiar with Python Programming.

Ok , right now , let’s start our study.

A little recall of neural network training (with gradient descent).

We will focus only in training stage as said in the tittle. In neural networks, there are three main stages during training: forward propagation, backpropagation stage and weight updating, this last is done with a technique called gradient-descent. Indeed, this training method require all partial derivatives of all weights as we can see in the equation below ( Eq.2 ):

Eq 1. very basic formulation of forward propagation.

where y hat is the prediction , x the input , W the weights matrix , B the bias and phi the activation function.

Eq 2. Simplified Backpropagation formulation with chain rules.

We suppose that we have just one hidden layer, h refer to the layer between the output and the given weight W under evaluation and L to the loss function.

Eq 3. Weights updating ( Gradient-Descent)

This is the formulation of vanilla gradient descent, where alpha is the learning rate.

The following scheme show the entire training process with gradient descent :

Fig.1 Global scheme of neural net gradient-based training

Gradient-free optimization for learning process

Unlike the above method, the partial derivative of all the weights in equation 2 is no longer necessary. I assume that the loss function is the mean square error ( usually for regression problem). We can directly formulate our learning process as follows :

Eq.4 optimization problem design.

Where N the lenght of dataset/batch and y the real target value.

The Global scheme of our training step is shown below :

Fig.2 Global scheme of neural net gradient-free training

All new weight values are found directly by the gradient-free optimization algorithm and do not require partial derivation.

In addition, we can add many objectives function as we need, and also many constraints.

A few example of gradient-free optimization methods :

There are a bunch of available methods , the following list is not exhaustive :

Genetic Algorithm
Particle Swarm Optimization
Sumilated Annealing
Covariance Matrix Adaptation
etc ..

We’ll be using the NGOpt solver in the experimentation part, Which is a “meta-optimizer” that adapts to the parameters provided (budget, number of workers, parameterization) and should therefore be a good default option in Nevergrad. If you are interested to have more information, you can check the original research paper.

Quick Introduction of Pytorch

Pytorch is an open-source deep learing framework ( my prefered one :D ) primarily developed by Facebook. It is widely used in the research community and the industry for developping and training deep learning models. In other word, it contains all (almost) the usefull pieces for creating deep learning models.

For installing pytorch, you need to have python installed, and then you can run the following command line :

pip install torch torchvision torchaudio

For more information about installation( eg : specific option such as os, cuda enabling, …) , please follow this link https://pytorch.org/

An Overview of Nevergrad

Nevergrad is a Python library developed by Facebook AI Research (FAIR) that serves as a versatile and efficient optimization platform. Its primary focus is on derivative-free optimization (DFO), where the objective function to be optimized might not have a known gradient. Nevergrad provides a wide range of free-gradients optimization algorithms.
To install this library, run this command :

pip install nevergrad

For further information, please check the official documentation

Experimentation

Let’s take a real example of a machine-learning problem, to make the story more interesting. The dataset consist of predicting wine quality, you can download it here .

We start by importing all necessaries librairies :

import torch
import torch.nn as nn
from torch.utils.data import DataLoader , Dataset

import nevergrad as ng

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

Load dataset from csv file with pandas and inspect the content:

wine_data = pd.read_csv("wine/winequality-red.csv", sep=";")
wine_data.tail(3)

As you can see, we have 12 variables, and the main goal is to predict the wine quality according to the wine characteristics.

For now, let’s split our dataset into train/test and normalize it by using min-max normalization:

data_train, data_test = train_test_split(wine_data, random_state=104, test_size=0.25, shuffle=True)

# normalize data
scaler = MinMaxScaler()
columns_to_normalize = data_train.columns.tolist()

data_train[columns_to_normalize] = scaler.fit_transform(data_train[columns_to_normalize])
data_test[columns_to_normalize] = scaler.transform(data_test[columns_to_normalize])

For more convenience, we adopt the usage of Dataset and DatasetLoader provided by pytorch :

class create_dataset(Dataset):
    def __init__(self, data,):
        self.data = data
        
    def __getitem__(self, index):
        x = torch.from_numpy(self.data.iloc[index][:-1].to_numpy())
        y = torch.from_numpy(self.data.iloc[index][-1:].to_numpy())
        return x, y
    
    def __len__(self):
        return len(self.data)

dataset_train_wine = create_dataset(data_train)
dataset_test_wine = create_dataset(data_test)
dataloader_train_wine = DataLoader(dataset_train_wine,batch_size=128,shuffle=True)
dataloader_test_wine = DataLoader(dataset_test_wine,batch_size=1,shuffle=False)

The dataset is ready, so let’s create our model, a very little neural network composed of two layers and RELU as an activation function:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(11, 10).to(dtype=torch.float64)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(10, 1).to(dtype=torch.float64)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

Let’s create an instance of the model :

model = Net()
model = model.to(dtype=torch.float64)
criterion = nn.MSELoss().to(dtype=torch.float64)

For now, we create our trainer :

import copy

class train_manager:
    def __init__(self,model,dataloader_train,dataloader_test,criterion,shuffle=False):
        self.model = model
        self.best_model = None
        self.dataloader_train = dataloader_train
        self.dataloader_test = dataloader_test
        self.batch_index = 0
        self.max_batch = len(list(dataloader_train))
        self.x = None 
        self.y = None
        self.best_score = 1e9
        self.criterion = criterion
        
    def batch_loop(self):
        if self.max_batch - self.batch_index == 0:
            self.batch_index = 0
        self.x, self.y = list(self.dataloader_train)[self.batch_index][0], list(self.dataloader_train)[self.batch_index][1]
    
    def weights_updating(self,weights):
        for n, layer in enumerate(self.model.parameters()):
            layer.data = torch.from_numpy(weights[n]).to(dtype=torch.float64)
            
    def evaluate(self):
        preds = []
        reals = []
        for x , y in self.dataloader_test :
            yhat = model(x)
            preds.append(yhat)
            reals.append(y)
        preds = torch.cat(preds).detach().cpu()
        reals = torch.cat(reals).detach().cpu()
        preds = preds.squeeze(dim=1)
        inverse_transform = lambda x : x * (scaler.data_max_[-1] - scaler.data_min_[-1]) + scaler.data_min_[-1]
        preds = inverse_transform(preds)
        reals = inverse_transform(reals)
        loss = self.criterion(preds,reals)
        return loss 
    
    def cost_function(self,input_weight , first_hidden_weight , second_hidden_weight, output_weight ):
        self.batch_loop()
        
        weights = [input_weight , first_hidden_weight , second_hidden_weight, output_weight]
        self.weights_updating(weights)
        
        output = self.model(self.x)
        loss = self.criterion(output, self.y)
        
        self.batch_index += 1
        test_loss = self.evaluate()
        if self.best_score > test_loss :
            self.best_score = test_loss
            self.best_model = copy.deepcopy(self.model)
        print(f'test loss function (mse) : {test_loss} , best score : {self.best_score}')
        return loss

The function batch_loop is used to loop over the dataset, in other word, optimization process is performed batch per batch.

The cost_function method contain fitness/loss computation, a crucial element responsible for computing fitness/loss. This function plays a pivotal role in evaluating the effectiveness of the proposed solution.

The function evaluate is used for dataset test evaluation. Our aim is to save the parameters with which we obtain the best score in test set.

And finally, weights_updating function is responsible of replacing old weights to the new optimal value, found by free-gradient algorithm.
technically, layer.data play the major role in weight updating stage.

parametrization = ng.p.Instrumentation(
    input_weight=ng.p.Array(shape=(10, 11)),
    first_hidden_weight=ng.p.Array(shape=(10,)),
    second_hidden_weight=ng.p.Array(shape=(1, 10)),
    output_weight=ng.p.Array(shape=(1,))
)

trainer = train_manager(model,dataloader_wine)
fitness = trainer.cost_function
optimizer = ng.optimizers.NGOpt(parametrization=parametrization, budget=500)
learned_param = optimizer.minimize(fitness)

ng.p.Instrumentation in Nevergrad allows us to define and manage the optimization variables and parameters used in the optimization tasks. It provides a convenient and flexible way to represent the search space for the optimization problem. I directly took the shape of my weights in neural net model for making the code more easy to read.

ng.optimizers contain all solvers, in this we choose to use NGOpt algorithm that take in parameters the search space and the budget.

optimizer.minimize is the starting point of our optimization/training process, it loop over the dataset n times depending of the budget and use fitness function for checking the solution quality.

l,p,r = trainer.evaluate(model,dataloader_test_wine,criterion)

Before training , we have an mse equal to 7.55 , and after training , we obtain mse equal to 0.69 . We didn’t make any comparaison with others methods because our aim is only to show the idea behind and implement free-gradient approach for neural network training .

Conclusion

In this blog post, we explored another approach to neural network optimization using gradient-free methods with PyTorch and Nevergrad. Traditional optimization techniques rely on gradients, which can be unvailable or not suitable for our problem. By using Nevergrad’s instrumentation and optimizer module, we can easily define and manage our optimization variables (weights) and train our model with no need of partial derivative calculation.
For further work, we will try to make multi-objective training .