Creating ShuffleNet in Tensorflow

Sumeet Badgujar
Analytics Vidhya
Published in
5 min readJul 25, 2021

ShuffleNet introduced by Magvi Inc (Face++) is an extremely computation efficient CNN architecture, designed specifically for mobile devices with very limited computing power i.e. 10–150 MFLOPS. The new architecture utilizes two operations achieve reduced computation cost and maintain the same accuracy or improve it — Pointwise group convolutions and Channel Shuffle. ShuffleNet manages to obtain lower top-1 error (6.7%) than the MobileNet on ImageNet classification.

Channel Shuffle is the main highlight of the paper, a new operation applied in efforts to make more feature map channels, which helps to encode more information and makes feature detection more robust.

ShuffleNet paper link — https://arxiv.org/pdf/1707.01083v1.pdf

Now lets have a look at the building blocks of ShuffleNet.

Pointwise Group Convolutions

In tiny networks, expensive pointwise convolutions result in limited number of channels to meet the complexity constraint, which might significantly damage the accuracy. To address the issue, a straightforward solution is to apply channel sparse connections, for example group convolutions, also on 1 × 1 layers. By ensuring that each convolution operates only on
the corresponding input channel group, group convolution significantly reduces computation cost.

But what is Group Convolution?

Group Convolution introduced in AlexNet, is a type of convolution to split the channels into groups and then the kernel is convolved on each group separately and then concatenated back. This operation helps to sparsify the connections and lowers the connection count.

Here’s a visual pic for understanding.

Figure 3 — Grouped convolution with 2 filter groups

But it has a drawback — the outputs from a certain group only relate to the
inputs within the group. This property blocks information flow between channel groups and weakens representation.

What if we mix the inputs of different groups before applying group convolutions, so that information flow from input to output is fully represented? Thus Shuffling.

Channel Shuffle

Figure 3— Channel Shuffle with Group Convolution (Source: ShuffleNet paper)

Suppose a convolutional layer with g groups whose output has g × n channels; we first reshape the output channel dimension into (g, n),
transposing and then flattening it back as the input of next layer.

How to understand channel shuffle? Here’s a diagram.

Figure 4 — Channel Shuffle

For groups =3, for 3 channels RGB is split in such a way that original RGB is represented into smaller 3 representative RGB channels. First split of red channel will be the red channel of the first group, 2nd split will be the red channel of the second group and the 3rd split will be the red channel of the third group.

ShuffleNet Unit

Figure 5 — ShuffleNet Units (Source: Original ShuffleNet Paper)

The a) part is the original bottleneck unit with depthwise convolution. The b) part is the new bottleneck unit of ShuffleNet with Pointwise Group convolution and Channel Shuffle. The c) part is ShuffleNet unit with stride=2.

Now lets look at the python code -

For channel shuffle. First we reshape the channel dimension into (g,n), use permute to form the smaller representation and then reshaping it into the original format.

def channel_shuffle(x, groups):    _, width, height, channels = x.get_shape().as_list()    group_ch = channels // groups    x = Reshape([width, height, group_ch, groups])(x)    x = Permute([1, 2, 4, 3])(x)    x = Reshape([width, height, channels])(x)    return x

Now the code for ShuffleNet Unit. If stride = 2, then the bottleneck unit input and output are concatenated and if stride = 1 then Add function is used.

def shuffle_unit(x, groups, channels,strides):    y = x    x = Conv2D(channels//4, kernel_size = 1, strides = (1,1),padding = 'same', groups=groups)(x)
x = BatchNormalization()(x)
x = ReLU()(x)
x = channel_shuffle(x, groups) x = DepthwiseConv2D(kernel_size = (3,3), strides = strides, padding = 'same')(x)
x = BatchNormalization()(x)
if strides == (2,2):
channels = channels - y.shape[-1]
x = Conv2D(channels, kernel_size = 1, strides = (1,1),padding = 'same', groups=groups)(x)
x = BatchNormalization()(x)

if strides ==(1,1):
x = Add()([x,y])
if strides == (2,2):
y = AvgPool2D((3,3), strides = (2,2), padding = 'same')(y)
x = concatenate([x,y])
x = ReLU()(x) return x

ShuffleNet Architecture

Figure 6 — ShuffleNet Architecture (Source: Original ShuffleNet Paper)

Four ShuffleNet units are placed in the architecture. The output channels are doubled for each unit and with each unit the dimensions are halved.

Here’s the complete code for groups=2 —

def Shuffle_Net(nclasses, start_channels ,input_shape = (224,224,3)):    groups = 2
input = Input (input_shape)
x = Conv2D (24,kernel_size=3,strides = (2,2), padding = 'same', use_bias = True)(input)
x = BatchNormalization()(x)
x = ReLU()(x)
x = MaxPool2D (pool_size=(3,3), strides = 2, padding='same')(x) repetitions = [3,7,3] for i,repetition in enumerate(repetitions): channels = start_channels * (2**i) x = shuffle_unit(x, groups, channels,strides = (2,2)) for i in range(repetition):
x = shuffle_unit(x, groups, channels,strides=(1,1))
x = GlobalAveragePooling2D()(x) output = Dense(n_classes,activation='softmax')(x) model = Model(input, output) return model
Figure 7 — Model Summary

And that’s how we implement ShuffleNet in TensorFlow.

But how effective is Channel Shuffle?’

The authors of the paper did the comparison of with/without channel shuffle.

Figure 8 — Comparison

The table compares the performance of ShuffleNet structures (group number is set to 3 or 8 for instance) with/without channel shuffle. The evaluations are performed under three different scales of complexity. It is clear that channel shuffle consistently boosts classification scores for different settings.

Checkout the code on github.

--

--

Sumeet Badgujar
Analytics Vidhya

A guy interested in Data Science and Ex-Machine Learning Engineer, doing data analysis and fun AI projects. “Ore wa Kaizoku Ou ni naru!”