Data Augmentation and Preliminary Model Training

Authors: Maitreya Venkataswamy, Adrian Lam, Nathaniel Lanier

Building a Data Augmentation Pipeline

One of the biggest steps we’ve taken to boost the performance during training is to use data augmentation. Data augmentation is the process of making changes to the training data in real time during training in order to generate new “synthetic” data points. This effectively increases the size of the training data, which is important in deep learning since deep neural networks benefit from large amounts of training data. The original training data is stored in TFRecord files, which we load into a Tensorflow Dataset using the TFRecordDataset class. We then apply a batch size to the dataset so that during training, the data is loaded into memory directly into mini-batches. The images are resized to the desired input image size before the augmentation process begins.

The augmentation process involves random flipping of the image, both vertically and horizontally, followed by a random rotation. The angle range of the rotations is left as a tunable parameter, but generally we allow for any rotation to be selected. We choose the “fill” mode of the rotation to be “reflect”, since that maintains a realistic looking image, and also provides more copies of a diseased/healthy plant for the CNN to extract features from. Finally, to improve the computation time, we “prefetch” the augmented images, which means that as the previous batch is being used to train on the GPU, the next batch is being prepared concurrently by the CPU.

We split the TFRecord files into the training set and validation set beforehand, and create separate data loaders for each. We do not perform any rotations or flipping of the validation data, since we want a single consistent dataset with which to compare validation scores across the training history.

The code for the data loader generation can be found here.

Preliminary Model Training

The next step was to train preliminary models to get a sense of which architectures are effective on this particular problem. One of the initial issues that we faced when training models was the tradeoff between resolution and batch size. Colab doesn’t have sufficient memory for both high resolution and high batch size so we had to experiment with different values. For any increase in one of these hyperparameters a commensurate decrease in the other would have to be made to avoid running into memory limitation issues. We had intuition that increasing resolution would be important because the fine-grained details and textures of the leaves seem to be a major factor in differentiating between the various classes. As resolution is decreased these important details are lost. During the training process we found that this intuition was in fact correct and batch size does appear to be one of the most consequential hyperparameters. During our preliminary model training we have adjusted the trade off between batch size and resolution to get up to a resolution of 258 x 258. The decrease in batch size that was necessary to reach 258 x 258 resolution is somewhat concerning because each batch used in our learning process will likely be less representative of the dataset as a whole. As such it is likely that each step taken by the RMSProp/Adam algorithm won’t be as precisely oriented towards a global or local minimum as would be possible with higher batch size. We hope that the smaller batch sizes used during training will only leading to longer training times and not more serious issues. We will keep an eye on this problem during further stages of model training. In future iterations of model training we will look to increase resolution even further either through using Colab Pro or decreasing batch size further as long as this isn’t causing any serious issues with training.

The models that were trained were ResNet, InceptionResNetv2, DenseNet, and EfficientNet, where the pre-trained weights of ImageNet were used to initialize the models. There were many hyperparameters that were tuned for these models by watching training accuracy through several epochs and adjusting accordingly. These include learning rate, decay steps and decay rate in the exponential decay scheduler as well as resolution and batch size which were described earlier. All of the training for these models can be seen in the notebook here.

It is worth pointing out that the aforementioned models performed better when we leave all the parameters trainable, as opposed to freezing the base layers. With the base layers frozen, the validation accuracy was unable to break 0.7, even after extensive effort in experimenting with various learning rates and implementing ReduceLROnPlateau(). It appears that for this particular classification problem, performance is superior if we leave all the weights trainable. The reason for this could be because the distinction between classes here is largely due to very specific low-level/local features. As seen in our EDA (and as mentioned above), one of the main distinguishing features between the disease classes are the leaf texture and colors (very specific low-level features), as opposed to more high-level features used to distinguish between classes in most object classification problems that ImageNet was originally trained on. It thus makes sense to allow some flexibility in some of the earlier layers for the model to extract the lower level features pertaining to the specific leaf texture and colors. Leaving all layers unfrozen can in some cases lead to overfitting, but this should not be a significant problem here given the large data set we have, and our results confirm this.

After reviewing the training scores for each model as well as this paper and this blog post we feel that EfficientNet represents the most auspicious model of all that were trained, with a test score of 0.83 on Kaggle. Plots of our training and validation accuracy, loss and learning rate schedule per epoch for our preliminary EfficientNet-based model can be seen below. For our next steps we will look to optimize the layers that we have added to the existing architecture, and adjust hyperparameters. While we saw better performance with all layers unfrozen versus freezing the base model, we have not tried tuning the number of frozen layers — that itself is a hyperparameter tune, and we will look into model performance when we freeze a subset of the base layers. We also realized that it might be a good idea to crop the background regions of our images instead of resizing them in the preprocessing step, so we save computational cost while not losing any of the important fine-grained details of the leaf texture. Finally, if we are able to get some of the other architectures to reach comparable performance to our EfficientNet, we may look into ensembling multiple models to push our validation score even higher.

Accuracy (y-axis) vs Epoch (x-axis): (orange-training, blue-val)
Loss (y-axis) vs Epoch (x-axis): (orange-training, blue-val)
Learning Rate (y-axis) vs Epoch (x-axis)
Kaggle submission of baseline EfficientNet-based model

Sources:

Tan, Mingxing, and Quoc Le. “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” arxiv.org, 2020, https://arxiv.org/abs/1905.11946. Accessed 15 March 2021.

Tan, Mingxing, and Quoc Le. “EfficientNet: Improving Accuracy and Efficiency through AutoML and Model Scaling.” googleblog, 29 May 2019, https://ai.googleblog.com/2019/05/efficientnet-improving-accuracy-and.html. Accessed 15 March 2021.

https://www.kaggle.com/dimitreoliveira/cassava-leaf-disease-tpu-tensorflow-training

--

--

Maitreya Venkataswamy
DATA2040 Spring 20201 Midterm Project Team LVL

M.S. Student in Data Science @ Brown University, interested in Robotics, Aerospace Engineering, and anything simulated.