Understanding FastAI v2 Training with a Computer Vision Example- Part 2: FastAI Optimizers

Rakesh Sukumar
Analytics Vidhya
Published in
10 min readOct 20, 2020
Image Source: Spatial Uncertainty Sampling for End-to-End Control

This is the second article in this series. This series is aimed at those who are already familiar with FastAI and want to dig a little deeper and understand what is happening behind the scene. The overall structure of this series is as below:

  1. Study the resnet34 model architecture and build it using plain Python & PyTorch.
  2. Deep dive into FastAI optimizers & implement a NAdam optimizer.
  3. Study FastAI Learner and Callbacks & implement learning rate finder (lr_find method) with callbacks.

In this article, we will use the resnet model built in the previous article to understand FastAI optimizers. We will use Google Colab to run our code. You can find the code file for this series here. Let’s get started directly.

First, we will quickly recreate our model from the previous article(refer the first article for explanations).

The optimizer used to fit the model is saved in the learn.opt attribute of the learner object. Optimizer object is not created until we fit the model using the fit() method or any of its variants (like fit_one_cycle). The “opt_func” argument to the Learner constructor is used to create the optimizer. By default, FastAI adds an Adam optimizer. Also by default all trainable parameters of the model (i.e. parameters of all layers that are not frozen) are passed to the optimizer as a single list at the time of its creation.

FastAI optimizers create parameter groups to enable us to set different hyperparameter values for different parts of the network. The default behavior is to create a single parameter group with all parameters of the model. We can create multiple parameter groups by passing a list of collection/generator of parameters to the optimizer. We will see an example of this in the third article, for now, let’s find a good learning rate and train the model for one epoch.

A FastAI optimizer has 4 main attributes:

  1. param_list: A list of list of parameters. Each of the inner list forms a parameter group (explained later). FastAI uses a customized list called an ‘L’.
  2. hypers: A list (an L) of dictionary of hyperparameters (learning rate, weight decay, momentum etc.) with one dictionary per parameter group.
  3. state: A dictionary containing the state variables, such as average gradient, average squared gradient etc. for all parameters of the model. The state for a parameter p can be accessed by opt.state[p]. The state variables are used to implement adaptive optimizers like Adam.
  4. Optimizer callbacks: FastAI optimizers use callback functions to update parameters & state variables during the opt.step() operation. We will see how this works with an example below. Let’s examine the default optimizer that FastAI has added to our model.

param_list: Let’s checkout the param_list first.

We see that param_list is an L containing an L of parameters i.e. L(L(parameters), L(parameters), …). Each of the inner L corresponds to a parameter group. In our case, we have only a single L inside the outer L as FastAI creates only one parameter group by default. The inner L contains 116 items which is the number of parameter tensors in our xresnet34 model (checkout the first article to see the details).

No. of parameter tensors in xresnet34 architecture

Let’s understand the first parameter “0.0.weight”. The first ‘0’ represents the nn.Sequential of the input stem, the second ‘0’ indicates the first layer of the input stem which is an nn.Conv2d(). We do not have a bias for this layer as we have passed bias=False to nn.Conv2d(). “0.0.weight” should be of shape [32, 3, 3, 3] corresponding to 32 output channels, 3 input channels, and 3x3 kernel. Let’s check this.

Other parameters of the model can be identified in a similar way.

Partial output from named_children method on xresnet34 model

hypers: The ‘hypers’ attribute of the optimizer store the hyperparameter values. It is an L of dictionary of hyperparameters with 1 dictionary per parameter group. Thus it allows us to set different hyperparameter values (such as differential learning rates) for different layers (parameter groups) in our neural network model. Note that we have only 1 parameter group here. Let’s checkout the hyperparameter values after the training for 1 epoch above.

hyperparameters of the Adam optimizer

‘mom’ and ‘sqr_mom’ above are beta1 & beta2 in the Adam parameter update rule. Refer this post An overview of gradient descent optimization algorithms if you want a refresher on different optimizers algorithms. We get as many hyperparameter dictionaries as there are parameter groups. Hyperparameter values are updated using the set_hypers() method of the optimizer object. set_hypers() method accepts hyperparameter name & value pairs as its arguments. Hyperparameter values can be specified in one of the following formats:

  • A single value: The hyperparameter will be set to the same value across all parameter groups.
  • A list like object: Length of the list must be equal to the number of parameter groups. Hyperparameter in each parameter group will be set to the corresponding values in the list.
  • A slice object: If both “start” and “stop” are specified, the hyperparameter in parameter groups will be set to even multiples between start & stop with the first parameter group getting the start value & the last parameter group getting the stop value. If “start” value is not specified (ex: slice(1e-3)), the hyperparameter in the last parameter group will get the “stop” value & all other parameter groups will get a value “stop”/10.

state: The state attribute of the optimizer stores the state for each parameter. The state values are used to implement adaptive optimizers such as Adagrad, RMSprop, Adam etc. Let’s explore the state for a couple of parameters in our model after training for 1 epoch.

You can see that state is a dictionary of length 116 with parameters as the keys & their state as the values.

The last layer in our model is a linear layer with 512 input activations and 10 output activations. Let’s checkout the bias parameter for this layer as it is just a 1 dimensional tensor of length 10. The all_params() method of the optimizer object returns an L((p, pg, state, hyper), …) for all parameters p in the model, where pg is the parameter group that p belongs to, state is the state for p, and hyper is the hyperparameter values (dictionary) for the parameter group pg. Let’s use this method to study the state.

Above, we can see that state tracks grad_avg & sqr_avg required for Adam optimizer. ‘step’ stores the number of optimizer steps (i.e. parameter updates) made which is also required for the Adam update rule. Since we have just trained the model for 1 epoch and the size of our training dataset is 147 batches, the step value is 147.

The ‘do_wd’ variable decides whether weight decay is applied to this parameter. This is controlled by the argument ‘wd_bn_bias’ (default value: False) passed to the Learner constructor; the default value specifies that weight decay is not to be applied for all bias parameters and all parameters of batchnorm type (batchnorm/instance norm/layer norm). The create_opt() method of the learner object inserts the variable ‘do_wd ’ with a value = False into the state dictionary for these parameters. Weight decay is not applied during the step() operation if the parameter contains ‘do_wd’ in it’s state with a value = False. Note that create_opt() method creates the optimizer object by calling calls the ‘opt_func’ method passed to the learner constructor.

The create_opt() method also stores a variable called ‘force_train’ (with the value = True) in the state dictionary of batch norm type parameters if ‘train_bn’ argument to Learner constructor is set to True(default value). The variable causes the batch norm parameters to be trained even if they belong to frozen layers in a transfer learning setting. We will see this in action in the next article.

The third parameter from the last in our model is a weight tensor for a batchnorm layer. Lets checkout the state for this parameter:

callbacks: FastAI optimizers use callback functions to update parameters & state variables during the opt.step() operation. The step() method calls all callbacks one after other on every trainable parameters of the model. These callbacks may update the parameter value and/or may return new state values which will then be saved to the parameter’s state. Let’s implement a NAdam optimizer to complete our study of FastAI Optimizers. I have shamelessly copied most of the code below from the FastAI repo.

Also note that optimizer callbacks can have default hyperparameter values specified in .defaults attribute. These defaults will be used by the __init__() method of optimizer object to set hyperparameter values. However, if hyperparameters are also passed-in as arguments to the __init__() method, the passed-in values take precedence over the callback defaults.

Nadam Optimizer

Nadam (Nesterov-accelerated Adaptive Moment Estimation) is a combination of Adam and NAG. The parameter update rule for Nadam optimizer is:

Nadam parameter update rule

where,

  • θt+1 and θt are the parameter values for time steps t+1 and t.
  • η is the learning rate.
  • νt is the exponentially decaying average of past squared gradients and νt-hat is it’s bias corrected estimate.
  • mt is the exponential decaying average of past gradients and mt-hat is it’s bias corrected estimate.
  • gt is the current graidient
  • ϵ is a smoothing term to avoid division by 0
  • β1 and β2 are mom and sqr_mom in FastAI parlance

Let’s create callbacks to implement the Nadam parameter update rule:

Note that debias() function just returns (1- mom**step) when damp = (1-mom). So, debias1 = (1-mom**step) and debias2 = (1-sqr_mom**step). We have added set_trace() in the nadam_step() to check our optimizer.

num’ above computes mom*grad_avg + (1-mom)*p.grad (i.e. β1*mt + (1-β1)*gt in Nadam parameter update rule). Note that bias correction for grad_avg (debias1) is not applied in the calculation of ‘num’ here.

Next, let’s create a function to return a Nadam optimizer after adding the required callbacks.

Let’s validate the Nadam parameter rule using a test parameter the FastAI way (check the Optimizer notebook in FastAI repo). An excel implementation of this test can be found here (File name: Nadam step test example.xlsx).

Note that weight decay is already applied (p has values [0.99, 1.98, 2.97]). Both grad_avg and sqr_avg are initialized with 0s and has values (1-mom)*p.grad and (1-sqr_mom)*p.grad for the first time step. debias1 and debias2 has value 0.1 ( = 1-mom**1) and 0.01 ( = 1-sqr_mom**1) respectively.

num’ above computes mom*grad_avg + (1-mom)*p.grad (i.e. β1*mt + (1-β1)*gt in Nadam parameter update rule). Let’s check the parameter values after the opt.step() operation.

Note that we did not zero the gradients after our step operation. Let’s go for one more round.

Now, let’s fit our model with our own implementation of Nadam optimizer.

We redefine the nadam_step() function below to comment out the set_trace().

We got 80.96% accuracy in 10 epochs with our Nadam optimizer.

In this article we have learned how to create a custom FastAI optimizer. However, if you want to use a standard Pytorch optimizer with FastAI, you can use the OptimWrapper class to wrap the Pytorch optimizer & use it within your learner. Find a tutorial for the same here.

In the next article we will study FastAI learner & callbacks & build our own implementation of lr_find() method. You can find the code files for this series here.

Links to other articles in this series:

References:

  1. Practical Deep Learning for Coders
  2. FastAI GitHub Repo
  3. FastAI Book
  4. FastAI Documentation
  5. An overview of gradient descent optimization algorithms
  6. Adam: A Method for Stochastic Optimization
  7. Pytorch Documentation

--

--