RegNet or How to methodologically design effective networks.

Chris Ha
Analytics Vidhya
Published in
15 min readApr 5, 2020

Designing Network Design Spaces

A new SoTA network has been published, but in a very indiscreet way. It is understandable, since it is introduced in a paper that aims to further a more powerful idea. If one would only take away “new SoTA” from this paper it would be a great loss.

The paper in question is Designing Network Design Spaces(link) and it was published by Facebook AI research (FAIR). The authors aims to further the idea of “Design space design”. Their aim was not to discover any particular network or even to find a certain family of networks. As they put it, they sought “to discover general design principles that describe networks that are simple, work well, and generalize across settings”

Design Space Design

our focus is on exploring network structure(e.g., width, depth, groups,etc .) assuming standard model families including VGG, ResNet, and ResNeXt

From a simple unconstrained ResNet like base, which they call AnyNet, they conduct population based experiments to arrive at a design space they refer to as RegNet. I believe that it is paramount to understand and explore how they conducted their experiments as that is arguably more important to take away from this paper other than the RegNet that they arrived at.

Although their RegNet design space was aimed towards a low-compute, low epoch regime, they show that it generalizes towards other regimes too. In the end, the Models produced from RegNet design yields models that are comparable to the previous SoTA EfficientNet models across the board.

However, I would like to emphasize, that it is obvious that they have left a lot of potential on the table for brevity and fair comparison. Their aim, as far as I could tell, was not to simply introduce another SoTA model. Rather, it was to push forward the underlying principles that enabled their discovery.

Rather than designing or searching for a single best model under specific settings, we study the behavior of populations of models.

Their approach is to take a design space sample models from that design space and analyze them to understand what among them works better or worse and then developing a new design space through that intuition.

The AnyNet Design space

The initial design space from which they begin their process is the AnyNet. Fundamentally, it is a very simplified and unconstrained ResNet. It has a Stem where the image is input, the Body where the bulk of computations are done, and the Head where the classification is done. The body has 4 stages, and each stage has a varying number of identical blocks(save for a certain blocks with stride 2 functioning as the “downsampling” block).

Network Structure.

Each block as 3 parameters: the width w , bottleneck ratio b, and group width g. The resolution r is fixed at 224.

The basic block of AnyNetX the X block. (b) depicts the “downsampling block”

While the general structure is simple, the total number of possible networks in the AnyNet design space is vast.

The paper does a good job summarizing their parameter space :

“The AnyNetX design space has 16 degrees of freedom as each network consists of 4 stages and each stage i has 4 parameters: the number of blocks di, block width wi , bottle-neck ratio bi, and group width gi. We fix the input resolution r= 224 unless otherwise noted. To obtain valid models, we perform log-uniform sampling of di≤16,wi≤ 1024 and divisible by 8,bi∈ {1,2,4}, and gi∈ {1,2, . . . ,32 }. We repeat the sampling until we obtain n= 500 models in our target complexity regime(360MF to 400MF), and train each model for 10 epochs

To obtain a distribution of models, we sample and train n models from a design space. For efficiency, we primarily do so in a low-compute,low-epoch training regime. In particular, in this section we use the 400 million flop (400MF) regime train each sampled model for 10 epochs on the ImageNet dataset. We note that while we train many models, each training run is fast: training 100 models at 400MF for 10 epochs is roughly equivalent in flops to training a single ResNet50 model at 4GF for 100 epochs (Emphasis mine)

The above emphasized portions, where they describe their population based methodology, is one of the key contributions of this paper. Note that the 400MF regime roughly equates to EfficientNet-B0

Rather than searching for the single best model out of these∼10^18 configurations, we explore whether there are general design principles that can help us understand and refine this design space.

Their fundamental aims in approaching their design is as follows

  1. to simplify the structure of the design space,
  2. to improve the interpretability of the design space,
  3. to improve or maintain the design space quality,
  4. to maintain model diversity in the design space

This shows that they not only pursued to design a design space that is effective and efficient, they do so in a way that would be understandable.

The aforementioned unconstrained design space they designate AnyNetXa. They sample from this space, analyze the population, design a new space and repeat.

Note that Most canonical ResNe(X)ts are not included in AnyNetX. This is at least due to differences in the stem layer and an intervening maxpool layer between the stem and body.

Plotting against a specific parameter can provide insights into effective value ranges.

In AnyNetXb, they constrain the bottleneck parameter to be the same across all stages. The population sampled after this constraint shows effectively same distribution as with AnyNetXa. Although this did not improve any objective metric per se, it simplifies the design space without any harm.

In AnyNetXc, they clamp the group parameter across stages. This means that each stage shares the same group width (to understand group widths further, refer to Cardinality in Aggregated Residual Transformations for Deep Neural Networks).

(Emphasis Mine)

In the course of testing b and c variants, they discover that b = 1 (with effectively no bottleneck expansion) and group width of 2 or greater yielded the best results.

Those familiar with ResNets would be able to remember that the bottleneck ratio and expansion existed to increase channel width without unduly increase FLOPs. It is also worth noting that it was necessary to introduce 1x1 convolutions before and after 3x3 convolutions to match dimensions and those 1x1 convolutions increased memory access costs while reducing FLOPs. Thus, this “bottleneck” block design also caused networks based on it to become bottlenecked (pun not intended) on GPU based accelerators.

MNASNets including MobileNets and EfficientNets extensively use Depthwise convolutions to achieve SoTA performances. These convolutions could be understood as group convolutions with group width of 1. The fact that AnyNetXb populations showed that g >1 is best, does not conflict with this fact. That such networks can and do perform excellently is not under question. The paper is empirically showing, with statistical backing to back the claim, that as a design space, g = 1 might be best avoided even though the MNAS search has found particular instances in which there are good performing models to build upon.

And to reiterate, it is the statistical tools and population analyses that we should be focusing on.

Above is an example of a good network with increasing width. Below is a bad network. Notice the error values shown as e = %

Next, the authors examine both the good and bad networks from AnyNetXc and discover that good network have increasing widths. AnyNetXD encapsulates this principle.

Through further testing they conclude that increasing depth as well as width is an important aspect of good networks. This is integrated as AnyNetXe

Interestingly, the increasing depth condition is not necessarily required for the last stage. Remember that canonical ResNe(X)t models are designed to have 3 blocks in the 4th stage even when it scales from 50,101 and 152 depth models.

TheRegNet Design Space

To resummarize, the AnyNetXe design space is as follows

  • a very basic ResNet like skeleton constituting a simple Stem, Body and Head. 4 stages with varying numbers of identical blocks.
  • (AnyNetXb) same bottleneck across stages. preferably b = 1
  • (AnyNetXc) same group width across stages preferably g > 1
  • (AnyNetXd,e) progressively increasing channel width and block depth with the possible exception of the last stage

From this foundation, the authors identify and analyze the best performing models.

Emphasis Mine. Top right shows two models with the bet linear fit

They find that it is possible to find a linear fit to explain and predict the best models within this space.

According to the linear fit they have found, a network structure via 6 parameters, d(depth),w0(initial width),wa,wm, b and g (latter two for bottleneck and group width respectively). These parameters are fitted into the following equations to generate block widths and depths.

different block width uj for each block j < d. w0 initial width, wa slope, d depth
wm additional parameter to control parametrization. Compute sj for each block j such that the following holds
compute quantized per-block widths wj via

The resulting design space is referred to as RegNet.

As the above table shows, RegNet is not a single network, or even a scaled family of networks like EfficientNets. It is a design space restricted by a quantized linear rule that is expected to contain good models.

EDF of the respective populations for AnyNetXa, Xe, and RegNetX with various additional restrictions.(right) shows that random search efficiency is much higher for RegNetX

Further restrictions were tested including wm = 2 or w0 = wa. Although these additional conditions can improve performance, the authors decide not to incorporate them into their RegNetX design space. It would be the authors intent to allow readers to apply their own restrictions according to their own requirements, hopefully in such a structured and disciplined manner.

It is also discovered that the design space defined by RegNetX is also a good candidate for network architecture search. This shows that such design space engineering is not exclusive with network architecture search methods. Thus one could imagine a pipeline incorporating both methodologies.

Design space Generalization

The AnyNet and subsequent RegNet design space was geared for low-compute, low-epoch training regimes. This goal is useful in of itself and for population based methodologies shown in the paper. Remarkably, they show that their design space is also useful for more complex models.

Emphasis mine. Design space overfitting in this context would imply a decay in performance as a result of a specific restraint on a design space.

The EDFs above show that RegNetX generalizes in other regimes outside of its initial design

Analyzing the RegNetX Design Space

The analyses by the authors expose several trends backed by empirical data that do not match popular design choices. The specific trends and data are well described in the paper and are summarized below.

  • depth of best models is stable across regimes, with an optimal depth of∼ 20 blocks (60 layers since each block has 3 layers consisting of 1x1, 3x3, and 1x1 convolutions in that order)
  • the best models use a bottleneck ratio b of 1.0, which effectively removes the bottleneck. Although the authors do not mention this, I believe that this is an especially important point. The whole point of the three layer bottleneck block (with 1x1 and 3x3 convolutions) was to introduce the channel width expansion (at least in the original paper). If it is unnecessary or harmful, it would be valuable to revisit the fundamental block structure themselves. Maybe a new design space consisting of “Basic blocks” with two consecutive 3x3 convolutions would yield a different design space with different more GPU friendly models. However, such design space changes would warrant extra testing.
  • it is shown that width multiplier of good models is∼2.5, this closely reflects the design choice of established models which tend to increase by a factor of 2. The difference is significant though as a multiplier of 2 v 2.5 would lead to a more than double difference in width by the last (4th stage)

Again, I would like to point out that, as valuable the insights are themselves, it is important to understand the constraints under which they are valid, and the methodology that led to those insights.

One interesting avenue of analysis that the authors conduct is the complexity analysis. It has been common for papers dealing with architectures to study and publish statistics regarding parameters and FLOPs. More recent papers are including inference times for select hardware. Here, the authors decide to investigate network activations. These can heavily affect runtime on memory bound hardware such as GPUs. Other studies and field experience show that networks that extensively utilize Depthwise Separable convolutions and inverted bottlenecks may reduce parameter count and FLOPs significantly, they increase memory access cost significantly. The authors conclude that best models are those whose activations increase with the square root of flops and linearly with parameters.

Following these insights the RegNetX design space is further constrained with the conditions listed below.

  • bottleneck ratio (b) = 1, depth (d) ≤ 40, width multiplier (wm) ≥ 2
  • parameters and activations are limited accordingly

As the RegNetX was initially designed for low flops, low epoch regime, the additional constraints ensure that they are also low memory cost while maintaining high accuracy, across all flop regimes.

The constrained variant limits bottleneck depth width and parameters to activations

Alternate design choices were also fully investigated. A bottleneck ratio below 1 (b < 1) effectively signifies an inverted bottleneck. While this condition is fundamentally outside the original AnyNetX (where b is either 1,2, or 4) it is a very common choice found in many current SoTA models (such as MobileNet or EfficientNet). Also incorporated in such models is the depthwise convolution where g = 1. Although this case has already been studied here as to be inferior, they conduct population analyses to understand their EDF further. At least within the RegNetX(which, if you allow for b < 1, is expansive enough to incorporate MNASnet like models, but not quite) design space, they are shown to have an inferior EDF. In other terms, selecting models from the RegNetX space under such constraints harm the distribution of the model performances.

It is important to fully and properly understand the implication. MNAS based networks such as MobileNet and EfficientNets ARE indeed strong and powerful models with various strength and certain drawbacks. Again this has already been shown and is not under question. MNAS is a powerful network architecture search tool and the architects involved in the model used an effective tool to search for specific effective models and have succeeded, in that instance. The research laid out in this paper, shows that as a design space, there is a superior way of approach and the RegNetX design space as a whole, maybe more fruitful in finding more effective models.

It is very easy to imagine researchers using MNAS or other network architecture search methods on the RegNetX search space to find models that can outperform both previous MNAS based models or the RegNetX models.

This paper, however, does in fact directly contradict research from the paper that introduced EfficientNet . In “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”, the authors conclude that a principled approach to model scaling should compound all width, depth and resolution aspects.

From https://arxiv.org/abs/1905.11946

This paper however, show that even for higher flop regimes, the fixed resolution of 224 x 224 is best. I argue that this difference come from different design constraints. Within different design spaces, it is likely that different constraints produce different scaling.

EDF for various netowrks and resolutions (y axis shows cumulative error probability)

Comparison to Existing Networks

To compare the models from the design space, the authors pick the best model for each FLOP regime from 25 random settings, and re train the top model 5 times each at 100 epochs.

In the past few years, network training and regularization schemes have advanced leaps and bounds. Many SoTA papers incorporated various enhancements and increased schedules in lieu of certain architecture improvements. For a fair comparison, they control the training scheme as closely as possible.

I would argue that they are actually comparing with one arm tied behind their back since they report according to the respective model’s original papers while they only train the RegNet models with a basic 100 epoch schedule without reasonable and modern training time improvements (including, but not limited to, Cutout, DropPath, Autoaugment).

For comparison, ResNe(X)t-50 is around 4.0GF range.
RegNet Y incorporates Squeeze and Excite Block. EfficientNet -B4 is around 4.0GF range

Mobile Regime

In the mobile regime up to ~600MF, other comparable networks are the MobileNet, ShuffleNet and certain other NAS derived models. They show SoTA like results without even trying to hard.

Standard Baseline Comparisons

Another family of important baseline models that are still relevant and was also important in the inception of the AnyNet design space in the first place, are the ResNe(X)t models. Although the paper argues that they share the exact same design space, I argue that this isn’t strictly true. As you can see below, the stem is different. Surprisingly, they do share a common flaw. The “downsampling” block has a residual connection that has a kernel size of 1 and stride of 2 in both the original ResNet and AnyNet block design. This is often cited as a flaw as it discards a full 3/4 of the input space. Many modern implementations of ResNets (such as ResNet-D, Assembled ResNet etc) rectify this by incorporating a pooling layer.

The fact that these SoTA models were born from the same faulty design space is mind boggling to me.

AnyNet : The stem is one stage 3x3, stride 2
The Stem is one stage 7x7 stride 2 followed by Maxpool 3x3 stride 2. Notice the “faulty Down sampling block with a kernel size of 1 and stride of 2. Source : https://arxiv.org/pdf/1812.01187.pdf.
“Faulty” downsampling block that only reflects 1/4 of the input space

No matter how you cut it, whether according to activations or flops, RegNetXs come out on top. It is also worth noting that in this case, the authors do NOT report ResNe(X)t performances according to their original papers. Using the same training as RegNetXs, ResNe(X)t actually improve over their original papers(where they were trained 90 epochs vs 100 here)

State of the Art comparison : Full regime

The current state of the art in image classification, across ALL regimes, is the EfficientNet family. Improving upon MobileNet-V3, the researchers from google designed (a) a highly performant baseline EfficientNet-B0 (b) an effective scaling strategy to scale them into all FLOP regimes up to EfficientNet-B7. (This consistent scaling strategy was employed by other researchers to develop EfficientNet-B8 which is a natural evolution of B7). These models are highly effective in many use cases, and have been the backbone of many projects within Google(NoisyStudent, EfficientDet etc) and by others.

In my eyes, their main drawback is that they are hard to train with GPUs. As noted several times before, they extensively utilize depthwise convolutions and inverted bottlenecks which make them hard to train and infer with high throughput on GPUs. This does not seem to be a problem for google who have access to extensive cloud ML systems and care more about inference time on mobile.

As with ResNe(X)ts , they employ the same vanilla 100 epoch training for both EfficientNets and RegNets for the following comparisons.

RegNetXs loses in the sub 800MF regime but come out on top in others. But across ALL regimes they are invariably faster to infer and train. By the 8.0GF regime (B5) they are 5 times faster while showing better performance.

Conclusion

I consider the fundamental contributions of this paper as three fold.

  1. A Design space design principle is presented.
  2. An effective design space according to those principles, is introduced(RegNet)
  3. A family of SoTA networks (RegNetX and RegNetY) are introduced.

It is very obvious that they wish for the world to focus on number 1 and 2. If others merely take number 3 and run with it the authors would be very disappointed. Again, please remember the process in which RegNetX were selected. From the RegNet C design space (still with more than 10⁶ models to choose from) they take 25 RANDOM models from each FLOP regime and picked the best, and trained them for 100 epochs. It is very obvious that even a simple NAS within that space would yield better models, at least for specific uses with relative ease. Or better yet, one would be able to employ population analysis methods described in this paper to further add design constraints that could yield a better design space.

Even if merely taking the latest and greatest is your aim, I hope you take the time and effort to read the paper and hopefully this article.

--

--

Chris Ha
Analytics Vidhya

MD interested in Machine Learning and Blockchain technologies