Flower classification with TPUs

Mr Mspb Study
10 min readJun 15, 2020

--

The challenge is to build a machine learning model that identifies the type of flowers in a dataset of images.

In this competition we’re classifying 104 types of flowers based on their images drawn from five different public datasets. Some classes are very narrow, containing only a particular sub-type of flower (e.g. pink primroses) while other classes contain many sub-types (e.g. wild roses).

The goal of the project:

Investigate different pretrained models with the competition dataset

Tasks for the project:

  • Analyse existing pretrained models;
  • Choose which one is the most suitable for the competition;
  • Analyse the impact of different data augmentation and by tuning hyperparameters on model’s metrics;
  • Add layers to the neural network and analyse it’s impact.

The dataset consists of 12753 training images, 3712 validation images, 7382 unlabeled test images. Also, it is forbidden to use any additional datasets according to the rules of competitions. Image sizes are presented in different shapes : 192* 192, 224 * 224, 331 * 331, 512 * 512.

Train images:

Test images:

For achieving purposes 3 pretrained models were choosed:

  1. DenseNet201 — it was suggested to start training with this Neural Network by the authors of the competition.
  2. Xception — popular neural net among Kagglers at the competion
  3. Efficientnet — another popular neural network

Let’s take a look on the architecture and advantages each of them.

DenseNet201 ( Dense Convolutional Network )

This is the paper of 2017 CVPR which got Best Paper Award with over 2000 citations. It is jointly invented by Cornwell University, Tsinghua University and Facebook AI Research (FAIR).

With dense connection, fewer parameters and high accuracy are achieved compared with ResNet and Pre-Activation ResNet.

Here will be briefly covered:

  1. Dense Block
  2. DenseNet Architecture
  3. Advantages of DenseNet

Dense Block

In Standard ConvNet, input image goes through multiple convolution and obtain high-level features.

In DenseNet, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers. Concatenation is used. Each layer is recieved a “collective knowledge” from all preceding layers.

DenseNet Architecture

For each composition layer, Pre-Activation Batch Norm (BN) and ReLU, then 3×3 Conv are done with output feature maps of k channels.

Advantages of DenseNet

  1. Strong Gradient Flow — the error signal can be easily and more directly propagated to earlier layers.
  2. Parameter & Computational Efficiency — for each layer, number of parameters in ResNet is directly proportional to C×C while Number of parameters in DenseNet is directly proportional to l×k×k.
  3. Diversified Features — since each layer in DenseNet receive all preceding layers as input, features are more diversified and tend to have richer patterns.

Specifically in the competition DenseNet201 was used trained on ILSVRC 2012 classification dataset.

Xception ( Extreme Inception)

This neural network implies depthwise separable convolution and is better than Inception-v3 (Image Classification)

Original Depthwise Separable Convolution

The original depthwise separable convolution is the depthwise convolution followed by a pointwise convolution.

  1. Depthwise convolution is the channel-wise n×n spatial convolution.
  2. Pointwise convolution actually is the 1×1 convolution to change the dimension.

A normal convolutional layer simultaneously processes both spatial information (the correlation of neighboring points within a single channel) and inter-channel information, since convolution is applied to all channels at once.

The Xception architecture is based on the assumption that these two types of information can be processed sequentially without loss of network quality, and decomposes the usual convolution into pointwise convolution and spatial convolution.

On the left is acanonical Inception module (Inception V3). Ont the right is an “extreme” version of our Inception module, with onespatial convolution per output channel of the 1x1 convolution

The modified depthwise separable convolution is the pointwise convolution followed by a depthwise convolution. This modification is motivated by the inception module in Inception-v3 that 1×1 convolution is done first before any n×n spatial convolutions. Thus, it is a bit different from the original one.

Advantages of Xception

  • Depthwise convolution— decoupling information from channels provides more a efficient way for detecting deep features;

Specifically in the competition Xception model was used trained on ILSVRC 2012 classification dataset.

EfficientNet

A class of new models that is derived from studying the scaling of models and balancing the depth and width (number of channels) of the network, as well as the resolution of images in the network. The authors of the article propose a new method of composite scaling (compound scaling method), which evenly scales the depth/width/resolution with fixed proportions between them.

There is one problem with traditional manual scaling method, after a certain level: scaling doesn’t improve performance. It starts to affect adversely by degrading performance.

Here will be briefly covered:

  1. Compound scaling;
  2. Baseline network setting;
  3. New at EfficientNet;
  4. EfficientNet Architecture.

Compound scaling

Compound scaling method uses a compound co-efficient ø to scale width, depth, and resolution together. The target is to maximize the model accuracy for any given resource constraints, which can be formulated as an optimization problem. Below are formulas for scaled attributes:

where w,d,r are coefficients for scaling network width, depth, and resolution; N, F, L, W, C, H are predefined parameters in baseline network.

α, β, γ are constants that can be determined by a small grid search. Intuitively, φ is a user-specified coefficient that controls how many more resources are available for model scaling, while α, β, γ specify how to assign these extra resources to network.

Baseline network settings

Since model scaling does not change layer operators F in baseline network, having a good baseline network is critical. The baseline net-work was developed by leveraging a multi-objective neural architecture search that optimizes both accuracy and FLOP.

The optimization goal could be calculated as ACC(m)×[FLOPS(m)/T], where ACC(m) and FLOPS(m) denote the accuracy and FLOPS of model m, T is the target FLOPS and w= -0.07 is a hyperparameter for controlling the trade-off between accuracy and FLOP.

Authors observed that mobile scaling can be used on any CNN architecture and it works fine but the overall performance depends on baseline architecture significantly. As regards EfficientNet it is similar to MnasNet:

MnasNet-A1 Architecture– (a) is a representative model selected from Table 1; (b) — (d) are a few corresponding layer structures. MBConvdenotes mobile inverted bottleneck conv,DWConv denotes depthwise conv, k 3x3/ k 5x5 denotes kernel size, BN is batch norm, H x W x F denotes tensor shape (height, width, depth), and × 1/2/3/4 denotes the number of repeated layers within the block

New at EfficientNet

EfficientNet uses 7 MBConv blocks, which also use squeeze & excitation block along with swish activation.

Swish activation — new activation function, which is a multiplication of a linear and a sigmoid activation.

As we can see Swish solves the problem of nullifying derivatives of negative values

Inverted Residual Block — the opposite to residual blocks introduced in ResNet. It connects narrow layers while wider layers are between skip connections. Overall, this type of block requeres less parameters to initialize, than the original.

Squeeze and Excitation Block (SE) — a method to give weightage to each channel instead of treating them all equally.

EfficientNet Architecture

MBConv block takes two inputs, first is data and the other is block arguments. The data is output from the last layer. A block argument is a collection of attributes to be used inside an MBConv block like input filters, output filters, expansion ratio, squeeze ratio etc. Standart EfficientNet architecture with MBConv blocks is presented below:

To sum up, it can be seen that totall execution process could be divided on 4 main phases:

Advantages of Effecientnet

  • Transfer Learining — big amount of datasets were used for trainig the model
  • Scalability— easily scale up a baseline Con-vNet to any target resource constraints in a more principled way

The weights for this module were obtained by training on the ILSVRC-2012-CLS dataset for image classification (“Imagenet”) with AutoAugment preprocessing.

Comparing models

In this section the results of the comparing pretrained models, which were presented above, trained on the competition dataset, will be described. For the experiment the next variables were setted:

  • Number of epochs = 45
  • Batch size = 128
  • Image size = 224 * 224
  • Optimizer = Adam,
  • Loss = Sparse_categorical_crossentropy,
  • Metrics = Sparse_categorical_accuracy

It should be noticed that random image rotation was used as the default data augmentation.

Model’s architecture is a sequential, where to the pretrained model was added by GlobalAveragePooling2D. The last layer was fully connected layer with the shape equal to the numer of classes (104).

Fine-tuning

After setting up, the following amount of parameters was initialized for fine-tuning models:

After training, the following loss and accuracy for the validation dataset were obtained:

Result for the confusion matrix are presented below:

Confusion matrix on validation dataset with fine-tuning

Transfer learning

After setting up, the following amount of parameters was initialized for fine-tuning models:

After training, the following loss and accuracy for the validation dataset were obtained:

I apologies for the mistake with Xception accuracy and loss, which are shown for the second learning run. But, anyway the accuracy of the Xception is close to 0.8
Confusion matrix on validation dataset with transfer learning

Pre-results

It could be seen from the graphs, the most suitable Neural network for this competition is EfficientNet. There are two markeable thing, which should be noticed:

  • For the fine-tuning setting, EfficientNet is less susceptible to retraining. Such behaviour important for further investigations.
  • For the transfer learning, DenseNet achieves higher accuracy in compare to others. If it would be taken into account, that DenseNet has in two times less parameters, then this neural network can be considered as a candidate for the future research.

Now, we will experiment with data and hyperparameters to increase accuracy on validation dataset.

Achieving better results

The following steps were done in pursuit of increasing accuracy in the validation dataset:

  1. Image transformation using random rotation and shearing(Image_RS)
  2. Adding custom learning rate sheduler(Model_LR)
  3. Adding fully-connected layer with “ReLU” activation function(Model_ReLU)
  4. Previous image transformation + random zooming and shifting(Image_RSZS);

It should be noticed, that every 5 epochs dataset was updated by random transformation. If we will suppose, that an every dataset update will strongly change images, then we could multiple the default amount of trainging images by 9.

Set on the left represents pictures, which were randomly rotated and sheared. On the right, pictures were rotated, shifted, zoomed and sheared.

The hypothesis is that image transformation named Image_RSZS is strong enough, to provide a new set of pictures every time , when it’s called. So, for the last transformation and according to the hypothesis, the training dataset was used with size above 100 k images.

It could be seen from the table below, that the main impact was made by introducing custom learning rate scheduler.

Results

Modern neural networks architectures were analysed. Top-3 preatrained neural nets were chosen for the research. Two different methodics were applyied for a models training — fine-tuning and transfer learning. It was shown that the most suitable pretrained neural network for the competition dataset is EfficientNet.

Different methods for increasing accuracy were applied. It was shown that the best scenarios is applying data augmentation with rotation, shearing,zooming and shifting and using custom learning rate scheduler.

References

DenseNet:

Xception

Effientnet

--

--