Stories by Praveen Kumar Rajendran on Medium

UP-DETR: Unsupervised Pre-training for Object Detection with Transformers (A Review)

Praveen Kumar Rajendran — Thu, 23 Sep 2021 14:01:12 GMT

Unsupervised pretraining, to the rescue!

Researchers from SCTU and Tencent Wechat AI in China have suggested UP-DETR, an unsupervised learning approach for object detection that will be explored in this article. It is an advancement of the DETR object detection approach put forth by Facebook AI.

Inspired by the great success of pre-training transformers in NLP, authors of UP-DETR propose a pretext task named random query patch detection to Unsupervisedly Pre-train DETR (UP-DETR) for object detection.

Before delving into the inner workings of UP-DETR, it’s important to understand what transformers do in deep learning and why they’re needed for computer vision tasks.

1. Attention Is All You Need

In 2017, Vaswani et al (From Google) propounded a network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. This model performed superior in tasks of machine translation while also ensuring the parallelization ability that promotes faster training.

For capturing the long term dependencies, in a sequence to sequence task like NLP recurrent neural networks work well yet they are slow due to sequential computation and would easily suffer from vanishing/exploding gradient issues.

Even though the Transformers does not use any recurrent units, how they actually capture the long term dependency patterns, that you might wonder! answer in 1.2..3…

“ATTENTION” mechanism.

Source: Link

can’t tell you to consider the attention mechanism as a black box for deeply understanding the working of the transformer, I extremely recommend you to read the Jay Alammar article (Nicely explained with visual aids)

It is essential to learn the roles of Queries(Q), Keys(K) and Values(V) vectors.

for a further understanding of the Attention Is All You Need paper, Watch the video.

https://medium.com/media/8219452f400a63ace3780aae70b1bbc0/href

2. Why do you need transformers for vision tasks?

In comparison to RNNs, transformers allow for the modelling of long dependencies between input sequence elements and support parallel processing of sequences. Transformers’ uncomplicated design allows them to process multiple modalities (e.g., images, videos, text, and speech) with similar processing blocks and demonstrates excellent scalability to very large capacity networks and massive datasets. These advantages have resulted in exciting progress on a variety of vision tasks involving Transformer networks. — link

3. DETR( simple review )

A method proposed in 2020 deals with object detection as a set prediction problem using transformer encoder-decoder architecture. It leverages global loss that forces unique predictions via bipartite matching — Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel.

Fig. 2: DETR

DETR is a supervised learning approach that gives n set of predictions as output. Here bipartite matching loss plays a pivotal role in ensuring that a single object is not detected multiple times in a single image input.

It is important to note that this loss function considers the classification loss as well as regression loss of the bounding box.

1.Assume the given input image is having 2 labelled ground-truth objects. 2.Assume the no of total predictions(N) by the DETR is 4

This loss function will try to encourage the model to have the prediction such that it gives two predictions with their classes and bounding boxes and two predictions with no class. It will penalize otherwise.

4. Unsupervised Pretraining

Deep feedforward neural network training can be difficult due to local optima in the objective function and complex models’ proclivity to overfitting. Unsupervised pre-training is the process of starting a discriminative neural network from one that has been trained using an unsupervised criterion, such as a deep belief network or a deep autoencoder. This method can occasionally aid in optimization.

Source: link

The idea is simple and straightforward. Instead of initializing the weights randomly, we pretrain them for a task( usually feature reconstruction in autoencoders ) and then fix the weight. Then we finetune it for the downstream task ( starts from more favourable regions of feature space so that model learns faster than it would if its weights were initialized randomly )

4. UP-DETR

The main picture begins…

UPDETR approach, randomly crop patches from the given image and then feed them as queries to the decoder. The model is pre-trained to detect these query patches from the original image. Two critical issues addressed in pretraining are as follows.

Multi-task learning.
Multi-query localization.

UP-DETR argues that even though DETR performs well on object detection tasks it comes with hurdles in training and optimization, which requires large scale training data and comparatively longer schedules for training.

You can infer from the below figures that UP-DETR requires lesser time to converge and performs well in the long run and it is evident that DETR performs inadequately in PASCAL VOC [link] which relatively has less training data and instances than COCO [link]

It suggests that pretraining transformers is indispensable on insufficient training data

Multi-task learning

to put it simply, a combination of object classification and localization is known as object detection.

To prevent query patch detection from destroying classification features, a frozen pre-training backbone and patch feature reconstruction to preserve transformer feature discrimination are introduced.

Furthermore, an ablation study demonstrates that freezing the CNN backbone plays an important role in feature discrimination during the pretraining stage.

Multi-query localization

Different object queries concentrate on different position areas and box sizes. A simple single-query pre-training is proposed and expanded to a multi-query version to demonstrate this property.

Object query shuffle and attention mask are introduced to solve the assignment problems between query patches and object queries in multi-query patches.

A Two-stage Attack!

I) Pretraining of transformers in an unsupervised manner.

Source

UP-DETR is pre-trained on the ImageNet training set without any labels. The CNN backbone (ResNet-50) is pre-trained with SwAV

II) Finetuning

The model is initialized with pretraining UP-DETR parameters and fine-tuned for all the parameters (including CNN) on VOC and COCO with labelled data.

As mentioned before this stage start from a favourable feature space thus it performs nice and converges well.

The model is fine-tuned with short/long schedule for 150/300 epochs and the learning rate is multiplied by 0.1 at 100/200 epochs, respectively.

Source

ARCHITECTURE DETAILS

Source: Link

As you can see, the input image is first passed through a CNN backbone to extract feature map(f) which is added to positional encodings and fed into multiple transformer encoder layers. The output of the encoder feeds into the decoder.

C=Channel; H=Height; W=Width

random cropped query patch from the same input image is fed into CNN backbone with GAP(Global Average Pooling) such that it gives the patch feature(p) which is then added with the object query of the same dimension to be fed into the decoder.

Source

There are N - number of object queries. These are learnable as the model is training.

“ The role of object queries is like a group of people“

These guys will be responsible for questioning a certain position and box size [that in turn will help the model ] to give predictions according to it .

😆 Object queries — Courtesy Zoo Zoo

From DETR Paper

Loss Function

For the loss calculation in pretraining stage: The predicted result consists of three elements.

cˆi ==> Binary classification of matching the query patch or not for each object query

bˆi ==> Vector that defines the box center coordinates {x, y, w, h}

pˆi ==> Reconstructed feature with C = 2048 for the ResNet-50 backbone

L rec component is the reconstruction loss proposed in this paper to balance classification and localization during the unsupervised pre-training. A mean squared error between the L2-normalized patch feature to preserve the feature discrimination.

with multi-query patches,

If we have “M” query patches and “N” object queries then we divide N object queries into M groups, where each query patch is assigned to N/M object queries.

authors hypothesize two requirements for better generalization
i) Independence of query patches(attention mask) ii) Diversity of object queries(object query shuffle)

To satisfy the independence of query patches, we utilize an attention mask matrix to control the interactions between different object queries.

To simulate implicit group assignment between object queries, we randomly shuffle the permutation of all the object query embeddings during pre-training. 10% query patches are masked to zero during pre-training similarly to dropout. “The object query shuffle is not helpful” in their further study

Results suggest that pre-training transformers are still indispensable even on sufficient training data (i.e. ∼ 118K images on COCO)

The results of UP-DETR is further extended for one-shot detection and Panoptic segmentation and it seems to perform comprehensively in those tasks as well.

The following curves and results summarize why an unsupervised approach is important.

With unsupervised pre-training, UP-DETR significantly outperforms DETR on object detection, one-shot detection and panoptic segmentation.

References

See ya next time!

connect with me on LinkedIn
linkedin.com/in/praveenkumar-rajendran/

UP-DETR: Unsupervised Pre-training for Object Detection with Transformers (A Review) was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

AlexNet TensorFlow 2.1.0

Praveen Kumar Rajendran — Thu, 30 Apr 2020 16:51:42 GMT

Training AlexNet from scratch in TensorFlow 2.1.0 for our own classification task.

“AI is the new electricity.”— Andrew Ng

Demystifying Deep Learning

Hellooooo Everyone! This is my first ever post on the Medium site. It really took a long time to come here. I’m an Automotive Software test engineer with an Electrical Engineering background. Well, I know what you got in your mind now “What the heck is he doing with deep learning”. Wait for it, I’ll answer it later because First Things First.

I’m going to go through creating AlexNet and training it on the five Flowers dataset, from scratch. This section will talk exclusively about creating AlexNet in TensorFlow 2.1.0, An end-to-end open-source machine learning platform.

Why TensorFlow 2.x?

TensorFlow 2.x makes the development of ML applications much easier. With tight integration of Keras into TensorFlow, eager execution by default, and Pythonic function execution.

you no longer need to create a session to run the computational graph, See the result of your code directly without the need of creating Session, unlike you do it in TensorFlow 1.x.

HOW COOL IS THAT!

AlexNet

AlexNet is an Influential paper published in computer vision, employing CNNs and GPUs to accelerate deep learning. As of 2020, the AlexNet paper has been cited over 61015 times according to the author’s Google Scholar profile.

AlexNet, A large margin winner of the ILSRVC-2012. The network demonstrated the potential of training large neural networks quickly on massive datasets using widely available gaming GPUs.

Availability of high computation power and large datasets together is❤️️
Yeah! One of the reasons why deep learning is taking off.
“Seismic shift that broke the Richter scale!”

AlexNet Architecture as given in the research paper

Six Main Ideas of AlexNet

1.ReLU nonlinearity

ReLU is a so-called non-saturating activation. This means that the gradient will never be close to zero for a positive activation and as a result, the training will be faster. In other words, When the activation(a) is negative ReLu(a) = 0, When activation(a) is positive ReLu(a) = a.

ReLU Visualization

2.Multiple GPUs for training

3.Local response normalization

4.Data augmentation

5.Test time data augmentation

Five Crops of the single Test image(4 Corners & Center) and their horizontal flips were taken, Predictions are made on these 10 augmented images. Later, predictions are averaged to make the final prediction.

6.Dropout

It uses 0.5 dropout during training. This means that during the forward pass, 50% of all activations of the layer were set to zero and also did not participate in backpropagation. During testing, no single neuron is dropped as in the real-time Inference.

Intuition for dropouts

TensorFlow Implementation

Yes, it’s finally happening! You just came across a little theory that will be useful for you. Honestly, Seeing that working out for yourself is a joy. Man, That’s the thing, let’s get it.

ENVIRONMENT USED:

Editor: PyCharm IDE
OS: Windows 10 (64bit)
GPU: Nvidia GeForce GTX 1050
CPU: Intel i7–8750H

Training Time: ~17 Minutes(Approx)

What are we up to?

Import necessary packages.
Getting the dataset & Analyzing them.
Defining the Model Architecture, Yey!!!!! The AlexNet is coming…
Preprocessing the images in the dataset for the training process of our deep learning model.
Compiling it with the Loss function and Optimizer to be used for training.
Define callbacks to be used while training.
Finally, we train the model and save it.
Visualization of the training process and model in TensorBoard.dev
Doing Evaluation of the trained model.
Importance of validation dataset.

Step 1:
I will start off by importing the necessary packages. TensorFlow, NumPy, pathlib, Datetime. I will print out the version for reference.

https://medium.com/media/a25935db6c94c411169afe83e63b4433/href

Tensor Flow Version: 2.1.0
numpy Version: 1.18.2

Step 2:

In this section, I’ve specified the Directory of the unzipped dataset.

i)The total no of images is then printed.
ii)Class names are printed as a list by reading the names of subdirectory in the dataset.
iii)The total no of classes is printed.

The folder structure of the unzipped Dataset is given below.

flower_photos
|__daisy
|__dandelion
|__roses
|__sunflowers
|__tulips
|__LICENSE.txt

https://medium.com/media/7c4ec358ad54a0fd4dedde90c850036f/href

3670
['daisy' 'dandelion' 'roses' 'sunflowers' 'tulips']
5

Step 3:

Here we define a model architecture of AlexNet.

i) As you can see, batch Normalization is used after each convolution layer instead of Local response Normalization.
ii) The dropout layer is not added but given in the comment section at Two Fully connected layers, So that if you want you can tweak it.
iii) The parameters like strides and kernel size are tweaked a little bit (Yey! we are becoming a deep learning practitioner) however the number of kernels kept the same as that of AlexNet.

The reason why I did not add a dropout layer is that, sometimes, It behaves weirdly at the backpropagation of the neural network.

Benefits to using Batch normalization is more than just reducing overfitting like, speeding up training by giving us the ability to use a higher learning rate for the optimizer of the network.

As Andrew NG Explains, I’m talking about those “tiny tiny baby steps” ❤️

https://medium.com/media/dbd8e6a1c42e9705c7d3ea1fc0785bb5/href

Step 4:

In this section, we are preparing the data for training which means, preprocessing the data before we feed it to a neural network. Defining the batch size, Height, Width, Steps per epoch. Later Resizing, and preprocessing the image as needed using the ImageDataGenerator which lets you do everything in the fly, That’s a nice gift by Keras.

I can’t stress enough, how much useful ImageDataGenerator was, for deep learning.

ImageDataGenerator accepts the raw data, randomly transforms it as we want with the arguments given by us, and returns only the new, transformed data to be used while training.

https://medium.com/media/6cb2f994bf3ce18bff9567e1d49166e6/href

Found 3670 images belonging to 5 classes.

Step 5:

In this section, we will train our deep learning model with the data that we have prepared. We specify the Loss function and the optimizer. To know more about the Stochastic gradient optimizer, and how it differs from Normal Gradient descent look at the video below.

https://medium.com/media/caec91525c5b412b468d6670bc5efb6a/href

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 55, 55, 96)        34944     
_________________________________________________________________
batch_normalization (BatchNo (None, 55, 55, 96)        384       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 27, 27, 96)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 27, 27, 256)       2973952   
_________________________________________________________________
batch_normalization_1 (Batch (None, 27, 27, 256)       1024      
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 27, 27, 384)       885120    
_________________________________________________________________
batch_normalization_2 (Batch (None, 27, 27, 384)       1536      
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 27, 27, 384)       1327488   
_________________________________________________________________
batch_normalization_3 (Batch (None, 27, 27, 384)       1536      
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 27, 27, 256)       884992    
_________________________________________________________________
batch_normalization_4 (Batch (None, 27, 27, 256)       1024      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 256)       0         
_________________________________________________________________
flatten (Flatten)            (None, 43264)             0         
_________________________________________________________________
dense (Dense)                (None, 4096)              177213440 
_________________________________________________________________
dense_1 (Dense)              (None, 4096)              16781312  
_________________________________________________________________
dense_2 (Dense)              (None, 5)                 20485     
=================================================================
Total params: 200,127,237
Trainable params: 200,124,485
Non-trainable params: 2,752
_________________________________________________________________

https://medium.com/media/0b23d8fd56a651f73133e15aed404da1/href

Step 6:

Here we define the callbacks to be used while our model is training.

https://medium.com/media/f0ac6ec3eb57a457aa48410fc1097b55/href

Step 7:

Finally, we train the model. It’s useful to note that even when we specified the epochs = 50, the model was trained for only for 17 epochs, that's because of the callbacks we used to stop the training when a certain level of accuracy and loss is obtained.

For a long time, I’m saying training the model, well! what does that mean?
It means our model is learning weights for the neurons to map the input given to output.
What we are dealing with, is a supervised learning problem. i.e we show the neural network that, for this input this is the output. Then our model learns from it using input and output data, the optimizer which will try to reduce the loss that we specified.
Model which inturn can be deployed to make prediction on the image that we will give in real time.

Saving the model is important because later you can use it to deploy it, wherever you want. we can use it to deploy on the Lightweight Embedded device like Raspberry Pi, Mobile devices by converting it into TFLite model. Or Even you can deploy it on the browser using TensorFlow.js

https://medium.com/media/56d45331f9ffe7cefd15de806a3a4b99/href

My Blog is getting very long 🤯😅… So you can find the training progress here

Step 8:

TensorBoard is A nice tool for making implementations transparent. So that you can ask other deep learners to debug your model or to demonstrate why your model is performing well.
You can follow the below command in cmd to upload it on TensorBoard.dev and get the link for TensorBoard Visualization.

PS: ‘logs’ is a directory of the log that will be stored during training.

Interesting thing about the TensorBoard is that you can track how your model is performing During after the training. Cool!

tensorboard dev upload --logdir logs \
    --name "AlexNet TensorFlow 2.1.0" \                               
    --description "AlexNet Architecture Implementation in TensorFlow 2.1.0 from scratch with list of callbacks for stopping training when the required metrics are met. Callbacks are also used for Tensorboard Visuals."

You can see the TensorBoard visualization here.

Plots of Accuracy (y-axis) Vs epochs(x-axis) AND Loss(y-axis) Vs epochs(x-axis)

Model Graph at TensorBoard

Step 9:

Example: Neural Network Recognizing Hand written Digits

In this section, we will evaluate the model performance. Even though I can do it in the training file itself, I’m doing it in a separate file just to let you know that we can use the saved model later for inference or evaluation with the real-time data.
I did randomly downloaded 10 images in Google Images for each of the 5 classes(Total 50). Stored it in the same directory structure as the training dataset to make use of the ImageDataGenerator for the evaluation.

Test_set
|__daisy
|__dandelion
|__roses
|__sunflowers
|__tulips

The code is as same as already explained at the training of the model. But the difference is we are loading a saved model. Later, using the test data that we acquired from the web to do the Inference to find out how our model’s performing on the unseen data.

Accuracy is then printed.

https://medium.com/media/bd190946d02693405720ef9120d405ad/href

48
['daisy' 'dandelion' 'roses' 'sunflowers' 'tulips']
5
Found 50 images belonging to 5 classes.

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 55, 55, 96)        34944     
_________________________________________________________________
batch_normalization (BatchNo (None, 55, 55, 96)        384       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 27, 27, 96)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 27, 27, 256)       2973952   
_________________________________________________________________
batch_normalization_1 (Batch (None, 27, 27, 256)       1024      
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 27, 27, 384)       885120    
_________________________________________________________________
batch_normalization_2 (Batch (None, 27, 27, 384)       1536      
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 27, 27, 384)       1327488   
_________________________________________________________________
batch_normalization_3 (Batch (None, 27, 27, 384)       1536      
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 27, 27, 256)       884992    
_________________________________________________________________
batch_normalization_4 (Batch (None, 27, 27, 256)       1024      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 256)       0         
_________________________________________________________________
flatten (Flatten)            (None, 43264)             0         
_________________________________________________________________
dense (Dense)                (None, 4096)              177213440 
_________________________________________________________________
dense_1 (Dense)              (None, 4096)              16781312  
_________________________________________________________________
dense_2 (Dense)              (None, 5)                 20485     
=================================================================
Total params: 200,127,237
Trainable params: 200,124,485
Non-trainable params: 2,752
_________________________________________________________________

1/2 [==============>...............] - ETA: 3s - loss: 1.4212 - accuracy: 0.7188
2/2 [==============================] - 5s 2s/step - loss: 1.1020 - accuracy: 0.7000
accuracy:70.00%

Hurray! 70% ACCURACY. That’s fair for a model that didn’t even used a validation set while training. Hmm! It's possible for a model to perform even better at the unseen data by making them generalize well for the data it was not exposed to.

However, we did a great job making the model to do the classification for 5 classes for the flower images downloaded randomly from the web.

Step 10:

Our model might be little bit overfitting to the training data. If it does not perform well on Test data. So we would need to use the validation data while training itself so that we can debug our model easily. We should also consider tuning the parameters and hyperparameters of the network.

validation_split can be specified in the ImageDataGenerator for using the portion of the data available, to be the validation set.

References :

💭“Winter is here.”

Link to flower dataset is here.
Link to Randomly downloaded test image dataset is here.
Link to the saved model is here.
Link to the repository is here.
Link to the AlexNet paper is here.

Finishing Things off

Most of the readers don’t make it to the end of the blog but you did, because you are special who just don’t just give up reading.

I’m hoping that I’ve taught you something. If you’ve found this post useful then do clap and hold it for a while, for the better reach of my blog who will need it.
If you have any doubts, clarification, Suggestions for improvement, contact me on LinkedIn and raise the issue at GitHub.

Haa! I almost forgot to answer the question you had in your mind at the start. No problem, I Gotcha! Well, I’m a Software Tester by profession but that does not stop me from doing what I love to become.

“When something is important enough, you do it even if the odds are not in your favor.” — Elon Musk

AlexNet TensorFlow 2.1.0 was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.