# Optimization using Adam on Sparse Tensors

Adaptive optimization methods, such as Adam and Adagrad, maintain some statistics over time about the variables and gradients (e.g. moments) which affect the learning rate. These statistics won’t be very accurate when working with sparse tensors, where most of its elements are zero or near zero. We investigated the effects of using Adam with varying learning rates on sparse tensors, and explored solutions to allow Adam to properly handle the sparsity.

In our experiments, we observed a lack of responsiveness to learning rate changes. We speculate this is due to fact that Adam update its moment estimates from a much smaller sample than the batch size suggests due to the sparsity, and also unnecessarily attempt to update the cells even if they are zero or unused. We hoped that a sparse variant of Adam might address this problem. Sparse variants are found in a number deep learning libraries, although TensorFlow is a bit behind on this compared to PyTorch.

This article will provide some background for Adam and sparse representations, the implementation details for using TensorFlow sparse variant of Adam with sparse tensors as well as the outcome of our experiments.

### Sparse Representations

A sparse representation is where most elements of the representation are near-zero with a few nonzero elements, which leads to a more semantic, efficient and robust representation [2]. There are many different methods for creating sparse representations such as [1, 3, 4]. For simplicity, we can create a sparse representation by masking the activation outputs in a hidden layer. A mask, for example, can be based on the top k cells in the output, to be represented as a one in the mask with the rest being zeros. This ends up creating sparse gradients as some cells will not get updated when they’re zeroed out in the sparse representation.

The Adam optimizer works by maintaining a per-parameter learning rate that is based on the moment estimates, see Figure 1 for formulas. It keeps track of an exponential moving average controlled by the decay rates beta1 and beta2, which are recommended to be close to 1.0 (the default values are 0.9 and 0.999 respectively). Epsilon, used for numerical stability, might be worth tuning as it has been noted that it can vary in orders of magnitude depending on the architecture and the problem. For example, an Inception network training on ImageNet, an optimal epsilon value might be 1.0 or 0.1.

The sparse variant of Adam would use its ‘sparse update’ mechanism when it sees that the gradient of a variable is of type IndexedSlices, which means an operation like tf.gather was used on a variable to create a sparse version of it. In this case, Adam will sparsely perform weight updates, but update its moment estimates as the dense Adam would. The Lazy Adam Optimizer addresses that part by also performing sparse moment updates. This optimizer is currently in the tf.contrib package and seems to not be as stable as the Adam optimizer.

### Implementation

TensorFlow operations that create sparse representations from variables, such as tf.gather and tf.nn.embedding_lookup, must be used directly on the variable itself before any other operations. This ended up being a bit complicated especially when dealing with convolutional architectures, as we the sparse mask was applied after the activation function. Instead, we focused on our a non-convolutional implementation of a Sparse Autoencoder. In that implementation, we compute the activation output for the encoding and create the sparse masks based on those outputs. The sparse mask used in training is reduced over the batch using tf.reduce_max to be used as the indices for active cells when using the tf.gather operation. This operation is then applied directly to the weights matrix, resulting in a sparse weights matrix. This is then used to recompute the encoding.

### Experiments

We initially ran some experiments comparing the standard and sparse Adam implementations using a Sparse Autoencoder on the MNIST dataset. Initially we noticed superior training performance with the Lazy Adam Optimizer (sparse Adam); however, this did not translate to better classification performance on either the training or the test sets. This led us to dig more into how exactly the sparsening methods in the training and test settings differ.

We noticed that the lifetime sparsity constraint [1, 4] was causing all cells to be updated every step. This was due to the fact that when we reduced the mask over the batch to retrieve the indices, all the cells were active and resulted in a non-sparse weights matrix, and resulted in a non-sparse output. The consequence was much worse performance as we were essentially training on a dense encoding, then evaluating on a sparse one.