Fine Grained Image Classification using Bilinear Convolutional Neural Networks - Tensorflow V2

Saaketh
Analytics Vidhya
Published in
4 min readMay 17, 2021

This article is TF implementation of Bi-linear CNN’s published in https://arxiv.org/abs/1504.07889. Please do read the paper, to get a better understanding.

Pre requisites :- Python, CNN’s, Keras, TF

The code is hosted on https://github.com/tommarvoloriddle/Bilinear-CNN-Tensorflow2.4-implementation

Idea

CNN’s such as VGG, RNN have been found to perform not so good for fine grained image recognition, if the image dataset being trained doesn’t contain similar images to Imagenet performance would be even worse. So the idea is to create a new architecture without losing the functionalities of previous linear SOTA’s

Intuition

The intuition behind the Bilinear CNN’s can be understood as simple parallel CNN’s each trying to identify different feature of same image. Crudely putting it, in identification of a particular bird species, 2 parallel CNN’s can be used one would identify beak and other would identify a tail. This example is to just present an intuition, in actuality multiple CNN’s identify different features but these features can be very small edges and not as distinct as tail and beak. Running a heat map analysis on input will give better understanding of B-CNN.

Uses

B-CNN’s over come some of the problems of linear CNN’s, such as improvement in fine-grained classification such as classification of birds into 200 species (http://www.vision.caltech.edu/visipedia/CUB-200.html), classifying retail store products.

Bilinear Convolutional Neural Network

Bilinear CNN’s are simple parallel CNN’s which are combined using matrix outer product (https://en.wikipedia.org/wiki/Outer_product). The outputs from CNN’s are taken before the FC layers.

Fig 1 : B-CNN architecture.

Since the overall architecture is a directed acyclic graph, the parameters can be trained by back-propagating the gradients of the classification loss (e.g., cross-entropy). The bilinear form simplifies the gradient computations. If the outputs of the two networks are matrices A and B of size L × M and L × N respectively, then the bilinear feature is x = A(T) B (transpose of A*B)of size M × N. Let dl/dx be the gradient of the loss function l with respect to x, then by chain rule of gradients we have

Fig 2: Chain rule of the gradients.

Flow of gradients in back propagation.

Fig 3: Gradient flow.

Implementation

We will have to define our outer product, l2 normalisation and square root functions, to implement flow as show in fig3.

The dot product function assumes same size input tensors, but can be also of different sizes as shown in fig3, a few changes in dot product will be required.

"""
Calculates dot product of x[0] and x[1] for mini_batch

Assuming both have same size and shape

@param
x -> [ (size_minibatch, total_pixels, size_filter), (size_minibatch, total_pixels, size_filter) ]

"""
def dot_product(x):

return keras.backend.batch_dot(x[0], x[1], axes=[1,1]) / x[0].get_shape().as_list()[1]

"""
Calculate signed square root

@param
x -> a tensor

"""

def signed_sqrt(x):

return keras.backend.sign(x) * keras.backend.sqrt(keras.backend.abs(x) + 1e-9)

"""
Calculate L2-norm

@param
x -> a tensor

"""

def L2_norm(x, axis=-1):

return keras.backend.l2_normalize(x, axis=axis)

Building Model

Build model function will return us the bilinear model. In this case we are taking outputs of 2 VGG16 networks from last layer. To have a custom CNN please check the code from the repo (https://github.com/tommarvoloriddle/Bilinear-CNN-Tensorflow2.4-implementation/blob/main/BILINEAR-Custom.ipynb).

The outputs are reshaped to match the tensor shapes, useful when building custom CNN.

The outer product, L2 norm, sqrt are added as Lamda layers to the model, also will not have any trainable weights.

'''

Take outputs of last layer of VGG and load it into Lambda layer which calculates outer product.

Here both bi-linear branches have same shape.

z -> output shape tuple
x -> outpur og VGG tensor
y -> copy of x as we modify x, we use x, y for outer product.

'''

def build_model():
tensor_input = keras.layers.Input(shape=[150,150,3])

# load pre-trained model
tensor_input = keras.layers.Input(shape=[150,150,3])



model_detector = keras.applications.vgg16.VGG16(
input_tensor=tensor_input,
include_top=False,
weights='imagenet')

model_detector2 = keras.applications.vgg16.VGG16(
input_tensor=tensor_input,
include_top=False,
weights='imagenet')


model_detector2 = keras.models.Sequential(layers=model_detector2.layers)

for i, layer in enumerate(model_detector2.layers):
layer._name = layer.name +"_second"

model2 = keras.models.Model(inputs=[tensor_input], outputs = [model_detector2.layers[-1].output])

x = model_detector.layers[17].output
z = model_detector.layers[17].output_shape
y = model2.layers[17].output

print(model_detector.summary())

print(model2.summary())
# rehape to (batch_size, total_pixels, filter_size)
x = keras.layers.Reshape([z[1] * z[2] , z[-1]])(x)

y = keras.layers.Reshape([z[1] * z[2] , z[-1]])(y)

# outer products of x, y
x = keras.layers.Lambda(dot_product)([x, y])

# rehape to (batch_size, filter_size_vgg_last_layer*filter_vgg_last_layer)
x = keras.layers.Reshape([z[-1]*z[-1]])(x)

# signed_sqrt
x = keras.layers.Lambda(signed_sqrt)(x)

# L2_norm
x = keras.layers.Lambda(L2_norm)(x)

# FC-Layer

initializer = tf.keras.initializers.GlorotNormal()

x = keras.layers.Dense(units=258,
kernel_regularizer=keras.regularizers.l2(0.0),
kernel_initializer=initializer)(x)

tensor_prediction = keras.layers.Activation("softmax")(x)

model_bilinear = keras.models.Model(inputs=[tensor_input],
outputs=[tensor_prediction])


# Freeze VGG layers
for layer in model_detector.layers:
layer.trainable = False


sgd = keras.optimizers.SGD(lr=1.0,
decay=0.0,
momentum=0.9)

model_bilinear.compile(loss="categorical_crossentropy",
optimizer=sgd,
metrics=["categorical_accuracy"])

model_bilinear.summary()

return model_bilinear

Model summary

Summary of custom B-CNN as the one with 2 VGG16 will take many pages and be confusing.

In this summary we can see that, inputs for 101 and 104 is same, these are our starting points. We take outputs from 103 and 106 to our lambda 18(Outer Product).

Fig 4: Model summary of custom B-CNN

Other generic methods like model fitting and prediction are skipped in this article as it is already very long, however are available on https://github.com/tommarvoloriddle/Bilinear-CNN-Tensorflow2.4-implementation.

[1] Tsung-Yu Lin et-al Bilinear CNNs for Fine-grained Visual Recognition (https://arxiv.org/pdf/1504.07889.pdf)

[2]Ryan- (https://github.com/ryanfwy/BCNN-keras-clean)

--

--