SpineNet: An Unconventional Backbone Architecture from Google Brain

Analyzing the newly introduced scale-permuted backbone architecture for recognition and localization

Yesha R Shastri
VisionWizard
7 min readJul 15, 2020

--

The problem of classification has been solved quite efficiently with the encoder-decoder architectures having the decreasing resolution scales in the encoder part. However, this architecture fails to efficiently generate strong multi-scale features required for the purpose of object detection (simultaneous recognition and localization).

How is SpineNet different than previous backbones?

“High resolution may be needed to detect the presence of a feature, while its exact position need not to be determined with equally high precision.” [1]

The Drawback of Scale-Decreased Backbone

  • Normally, a backbone model refers to the scale-decreased network in the encoder-decoder architecture i.e. the encoder.
  • Since the task of the encoder is to compute feature representations from the input, a scale-decreased backbone will not be able to hold the spatial information.
  • As the layers get deeper, features will become more abstract and less localized hence making it difficult for the decoder to retrieve the exact required features.

Proposed Innovation

In order to overcome the difficulty of obtaining and retrieving multi-scale features for localization, a scale-permuted model with cross-scale connections is introduced with the following improvements:

  1. The scales of feature maps are given the flexibility to increase or decrease at any time in the architecture by means of permuting blocks as opposed to the earlier pattern of strictly decreasing. This will enable maintenance of the spatial information.
  2. The connections of feature maps are allowed to go across feature scales in order to perform feature fusion from multiple scales.
Figure 1: An example of a scale-decreased network (left) vs. scale permuted network (right). The width denotes the resolution and height denotes the feature dimension (number of channels). [Source: [1]]

The Methodology and Architecture

Neural Architecture Search (NAS)

  • The methodology of choosing the proposed architecture of SpineNet has been employed by using NAS [1].
  • NAS uses the reinforcement learning controller. It proposes various architectures and those are sent to the environment in which they are trained fully.
  • The output accuracy will act as a reward and the decision to choose the architecture will depend on it.
Figure 2: Neural Architecture Search Method in the context of [1].

The SpineNet architecture possesses a fixed stem network (scale-decreased network) followed by a learned scale-permuted network. The search space of the NAS to build the scale permuted network comprises of scale permutations, cross-scale connections, and block adjustments.

  1. Scale permutations: A block can only connect to its parent block which has lower orderings, so the ordering of blocks is important. Here, the permutations occur for the intermediate and output blocks.
  2. Cross-scale connections: For each block in the search space, two input connections are defined.
  3. Block adjustments: Each of the blocks can adjust their scale-level and type. Scale levels for intermediate blocks can range like {−1,0,1,2} and the block types could be either bottleneck block or a residual block.

Resampling in Cross-Scale Connections

  • While performing cross-scale connections, a challenge is faced when the cross-scale features having different resolution and feature dimensions among the parent and target blocks are to be fused.
  • In order to do so, spatial and feature resampling is done to match the parameters to the target block.
  • In resampling, the nearest-neighbor algorithm is used for upsampling whereas a stride 2 3×3 convolution performs down-sampling on the feature map to match the target resolution.
Figure 3: Resampling operations [Source: [1]]

For detailed understanding refer to section 3.2 of [1].

Evolution of SpineNet Architecture from ResNet

  • The scale-permuted model is formed by permuting the blocks of ResNet architecture.
  • For comparing the fully scale-decreased network with the scale-permuted network, a number of intermediate models are generated which gradually shifts the architecture to the scale-permuted form.
Figure 4: Building a scale-permuted network by permuting ResNet. [Source: [1]]
  • In the above figure, part (a) denotes the usage of ResNet-50 followed by a Feature Pyramid Network (FPN) output layer.
  • In part (b), 7 blocks are part of ResNet and 10 blocks are utilized for the scale-permuted network.
  • In part c, all blocks are part of the scale-permuted network and in (d) SpineNet-49 is introduced with the highest AP score of 40.8% requiring 10% fewer FLOPs (85.4B vs. 95.2B).

Proposed SpineNet Architectures

Based on the SpineNet-49 architecture derived in figure 4 (d), four more architectures are constructed in the SpineNet family.

  • SpineNet-49S has the same architecture as SpineNet-49 with the feature dimensions scaled down by a factor of 0.65.
  • SpineNet-96 architecture repeats all the blocks two times so the model size is double than SpineNet-49.
  • SpineNet-143 repeats each block three times and the scaling factor in resampling operation is kept to 1.0.
  • SpineNet-190 repeats each block four times with the scaling factor = 1.3 to further scale up the feature dimension.
Figure 5: Increase model depth by block repeat. From left to right: blocks in SpineNet-49, SpineNet-96, and SpineNet-143. [Source: [1]]

Comparative Results

The experiments are conducted for object detection as well as for the task of image classification to demonstrate the versatility of the proposed architecture.

Object Detection

The ResNet-FPN backbone model is replaced with the RetinaNet detector for the task of object detection. The model is evaluated on the COCO test-dev dataset and is trained on the train2017 split.

  • The following results (Figure 6) demonstrate that SpineNet models outperform other popular detectors by large margins. The largest SpineNet-190 achieves the highest 52.1% AP. Generally, SpineNet architectures require a fewer number of FLOPs and a lesser number of parameters making the models computationally less expensive.
Figure 6: One-stage object detection results on COCO test-dev. Different backbones with RetinaNet are employed on single model. By default, training is done using multi-scale training and ReLU activation for all models in this table. Models marked by dagger (†) are trained by applying stochastic depth and swish activation for a longer training schedule. [Source: [1]]
  • The following results (figure 7) on COCO val2017 demonstrate that SpineNet-49 requires ~10% lesser FLOPs and AP has improved to 40.8 as opposed to 37.8 in R50-FPN.
Figure 7: Results comparisons between R50-FPN and scale-permuted models on COCO val2017. [Source: [1]]
  • RetinaNet model adopting SpineNet backbones achieves a higher AP score with considerably less number of FLOPs as compared to ResNet-FPN and NAS-FPN backbones (figure 8).
Figure 8: The comparison of RetinaNet models adopting SpineNet, ResNet-FPN, and NAS-FPN backbones. [Source: [1]]

Image Classification

SpineNet is trained on two datasets- ImageNet ILSVRC-2012 and iNaturalist-2017 for the purpose of image classification.

  • On ImageNet, the Top-1% and Top-5% accuracy are at par with ResNet and in addition to that, the number of FLOPs is considerably reduced.
  • On iNaturalist, ResNet is outperformed by SpineNet with a large margin of 5% along with a reduction in FLOPs.
Figure 9: Image classification results on ImageNet and iNaturalist. [Source: [1]]

The above results demonstrate that SpineNet not only works better for object detection but also proves to be versatile enough for other visual learning tasks like image classification.

Importance of Scale-Permutation and Cross-Scale Connections

According to [1], two popular architecture shapes-Fish and Hourglass are chosen in encoder-decoder networks to compare with the R0-SP53 proposed model. Cross-connections in all models are learned using NAS.

Scale-Permutation

  • The insight derived is that jointly learning scale-permutations and cross-scale connections (R0-SP53) prove to be beneficial instead of only learning connections on a fixed architecture/fixed block orderings (Hourglass and Fish).
  • The AP score is higher (40.7%) in the case of proposed model R0-SP53.
Figure 10: Importance of learned scale permutation [Source: [1]]

Cross-Scale Connections

  • The method employed to study the importance of cross-scale connections is graph damage.
  • The cross-scale connections are damaged in three ways - removing short connections, removing long connections, and removing both the connections.
  • Results show that the AP score is severely affected in case (2) and (3). The reason is long-range connections can effectively handle frequent resolution changes so damaging those will hurt the overall accuracy more.
Figure 11: Importance of learned cross-scale connections [Source: [1]]

For detailed implementation and experimentation details, refer to the section 5 of [1].

Final Insights

  • In [1], a new meta-architecture, a scale-permuted model is proposed to effectively solve the task of simultaneous object recognition and localization which earlier could not be solved effectively using a scale-decreased backbone.
  • Neural Architecture Search (NAS) is used to obtain SpineNet-49 architecture. Furthermore, by increasing the model depth, four more architectures are produced which are more robust.
  • SpineNet is evaluated for the object detection task using the COCO test-dev set and it achieves a 52.1% AP which is higher than existing state-of-the-art detectors.
  • SpineNet is also successful in getting comparable and improved Top-1% accuracy on the image-classification task by using ImageNet and iNaturalist dataset respectively.
  • In summary, higher accuracy is achieved with less compute and approximately the same number of parameters by using the new architecture.

References

[1] X. Du, T. Lin, P. Jin, G. Ghiasi, M. Tan, Y. Cui, Q. V. Le, and X. Song. “SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization.” IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020.

[2] https://www.youtube.com/watch?v=qFRfnIRMNlk&feature=youtu.be

[3] https://towardsdatascience.com/neural-architecture-search-nas-the-future-of-deep-learning-c99356351136

--

--