Review: FractalNet (Image Classification)

An Ultra Deep Network Without Using Residuals

In 2015, after the invention of ResNet, with numerous champion won, there are plenty of researchers working on how to improve the ResNet, such as Pre-Activation ResNet, RiR, RoR, Stochastic Depth, and WRN. In this story, conversely, a non-residual-network approach, FractalNet is shortly reviewed. When VGGNet is starting to degrade when it goes from 16 layers (VGG-16) to 19 layers (VGG-19), FractalNet can go up to 40 layers or even 80 layers. And it is published in 2017 ICLR with more than 100 citations. (Sik-Ho Tsang @ Medium)

What Are Covered

  1. FractalNet Architecture
  2. Drop-Path as Regularization
  3. Ablation Study
  4. Results

1. FractalNet Architecture

Fractal Architecture: A Simple Fractal Expansion (Left), Recursively Stacking of Fractal Expansion as One Block (Middle), 5 Blocks Cascaded as FractalNet (Right)

For the base case, f1(z) is the convolutional layer:

After that, the recursive fractals are:

where C is the number of columns as in the middle of the figure. The number of convolutional layers at the deepest path within a block will have 2^(C-1). In this case, C=4, thereby, the number of convolutional layers is 2³=8 layers.

For the join layer (green), element-wise mean is computed. It is not concatenation or addition.

With 5 blocks (B=5) cascaded as FractalNet at the right of the figure, then the number of convolutional layers at the deepest path within the whole network is B×2^(C-1), i.e. 5×2³=40 layers.

In between 2 blocks, 2×2 max pooling is done to reduce the size of feature maps. Batch Norm and ReLU are used after each convolution.

2. Drop-Path as Regularization

Iterations of Alternative Local and Global Drop-Path

There are local and global drop-paths:

  • Local: A join drops each input with fixed probability, but it is sure that there is at least one survives.
  • Global: A single path is selected for the entire network.

By dropping path randomly, it is just like noise is injected to the input, and finally regularization effect is achieved to reduce overfitting.

3. Ablation Study

3.1. Depth

Number of Depths on CIFAR-100++

The network with depth of 80 yields the best performance. The network with depth of 160 has overfitting problem.

3.2. Effectiveness of Fractal Structure

Effectiveness of Fractal Structure on CIFAR-100++

If plain network with depth of 40 is used, overfitting occurs, while fractal structure can still obtain good performance, which shows FractalNet may be less prone to overfitting.

3.3. Student-Teacher Information Flow

Training Loss Against Epoches for Plain Network (Left) and FractalNet (Right)

For plain network with 40 layers, the speed of convergence is very slow.

But for the col #4 of FractalNet which has the same 40 layers, the speed of convergence is much faster. Authors claim that there are help from other columns, i.e. Student-Teacher Information Flow.

4. Results

4.1. CIFAR-10 (C10), CIFAR-100 (C100), SVHN (Street VIew House Number)

CIFAR-10, CIFAR-100, SVHN Results, (+: Data augmentation, ++: Heavy data augmentation)
  • Without any data augmentation, FractalNet obtain the lowest test error except the DenseNet-BC. And FractalNet outperforms ResNet by largin especially for C10 and C100.
  • With data augmentation, FractalNet can still outperforms almost all ResNet variants except the wide one (WRN) for C10+ and C100+.
  • But with heavy data augmentation (C10++ and C100++), FractalNet still cannot have the best result when comparing with those in C10+ or C100+.
  • And DenseNet-BC, which was the concurrent work at that moment, outperforms FractalNet for all tasks. But in this paper, the main purpose is to compare with ResNet, i.e. non-residual vs residual.

4.2. ImageNet

ImageNet Results

FractalNet-34 has a little better results comparing with ResNet-34 C.

When comparing with non-residual-network, FractalNet-34 outperforms VGG-16 by large margin.