Implementing Object Detection in Machine Learning for Flag Cards with MXNet

Prasad Pai
YML Innovation Lab
Published in
6 min readNov 7, 2017

We at Y Media Labs have been working on implementing Artificial Intelligence in various sectors which brings value in our daily lives. One such field we have targeted is of Education sector. We have performed few experiments to recognize and localize objects in the images/videos and with this technological capability, we would like to leverage it to teach children about the flags of the various countries in the world. Check the below video to see what we are going to discuss about in this article.

Youtube link

Every year, with the vast improvements among the winners in ImageNet competitions, lots of new architectures comes up. Instead of building our own neural network from scratch, we thought of making use of some of these models as it will not only give us better accuracy but also reduce our efforts. Almost every deep learning framework today has hoisted these well known networks’ weight files trained on ImageNet datasets.

Now let us look into approach we followed while building our flag detection module.

What to choose? Image Classification Vs Object Detection

This is an important decision to make, in this type of approach we would like to go ahead with identifying the flag. The networks will vary depending upon the choice we make. While Image classification will help us in only identifying the flag but object detection will allow us to identify and localize multiple types of flags in the image. Take a look at the slide used in Stanford’s CS231n class lectures showing various types of recognitions.

As we wanted to have flexibility in allowing the user to show multiple types of flags at one go, we have decided to go with object detection. Also object detection will give us the localized bounding box of each type of flag.

Selecting the right Framework

Besides the desktop version, we have the plans of porting the network to mobile devices. While Tensorflow is being used widely across various AI related projects, there appears to be no official support by Tensorflow to convert to Apple’s CoreML as of now. Also Apple’s CoreML website has not shown on, how to convert the Tensorflow’s trained models. Because of this reason, we had decided to forego the Tensorflow’s Object Detection models.

After the initial hiccup, we quickly set eyes upon MXNet. MXNet has a very good collection of several leading models with the architecture and learned weight file in its zoo. They even have detailed examples of several types of machine learning problems. Lastly, they have provided support for converting the trained MXNet models to Apple’s CoreML with their in-built tool of mxnet-to-coreml. Hence, we decided to select MXNet.

If you are new to MXNet, you can start learning about this wonderful framework from their tutorials.

Which Object Detection Network

The various types of object detection makes use of a base network (like ResNet, Inception, VGG etc) and on top of it implement their own technique. The different types of object detection with bounding box techniques we have come across so far are YOLO, SSD and Faster R-CNN. Amongst these three techniques, SSD (Single Shot MultiBox Detector) achieves the best speed in processing the objects from the image and hence, we chose SSD method for object detection. In selecting the base networks, we decided to go with VGG network as it’s implementation is very straight-forward.

SSD network architecture

Dataset Generation

Now that we are clear with the framework and network, the next step is to feed the right type of data. It is important to remember that what data goes inside the network during the training phase will be utilized during the evaluation(testing) phase. If we observe, the evaluation image will mostly comprise of a person’s upper part of body (head dominantly) along with one or two flags in his/her hands. To get this type of specific raw dataset in internet is next to impossible. Physically generating it would require hundreds of volunteers willing to pose with flag images which in turn would come with days of hard labor.

To overcome this problem, we scavenged the internet for user profile based image dataset and best one we got was of CelebA dataset. CelebA dataset comprises of more than 2 lakh images of celebrities which comes with diversity in pose and distance from camera. The next thing is to add our flag images into CelebA images.

While it may look ideal to add the flag photos next to the hands of the celebrities (which comes again with excess work of detecting faces, hands and scaling flag images accordingly), we found out that randomization of flag images with data augmentation techniques combined with random positioning of flag doesn’t impact the performance of the model at all! We have generated a total of 120000 training images and 10000 testing images. If you are interested in various data augmentation techniques, you can read about it in my another Medium post.

This is how the sample dataset image looks like.

Dataset with flags and added type of noise

Bounding boxes and data format

As our object detection model outputs the coordinates of the flag images, it is obvious that we will have to feed the network with coordinates of the flag images along with the input images during the training phase. As we are generating our own dataset, we have proper information about the coordinates of the flag images. For each output image, we create an XML file in Pascal VOC format with information on coordinates of object of interest, occlusion, truncation of objects, is it difficult to track etc. A sample XML file looks like this:

The MXNet image iterators accept data either in .rec format or .lst format. .rec format is the preferred binary record format. We have improved the provided example codes to process the generated 120000 train and 10000 validation images and XML files into .rec format (train and validation separate) with our own FlagsCeleba class. In our case, the train.rec file resulted in 7 GB.

Training

With the binary record file ready, we start our training procedure. Using MXNet, we can explicitly set the context of running our code in GPU or CPU. We ran our code in a GTX 1080 GPU for a total of 4 epochs. Each epoch took about 45 minutes. We ran the model with default learning rate of 0.004 and batch size of 32 samples.

Evaluation

For evaluation purpose we have provided support for testing using images, pre-recorded videos and even live video feed detection using video stream as input. Some of the sample images evaluated are shown below.

Input images on top row and flag detected images in bottom row

You can also check the video output on the series of flags in this Youtube Video.

Future Work

Up until now, I have not mentioned for how many flags have we trained the network. We have currently trained the network for 25 country flags only. We would like to train the network to include all the countries in world (190+). The challenge involved with this is dataset will be extremely huge and also, will the network be able to detect 190+ classes (types of flags). If you see, ImageNet Object Detection competition has 200 classes and not to forget that these classes have lot of distinction among themselves but in our case most of the flags have just rectangle bars on it with difference primarily being in color only.

This limitations may prove difficult to train the model for all the country flags and hence, it might be even necessary to change our pipeline of network architecture. There may be a continuation to this article on part 2 in which I may be discussing on our learnings from making the network learn 190+ types of classes(flags).

You can run the demo or train your own network with your own country flags from my Github repository.

Do let me know through the comments about what did you feel about this article. Also feel free to make any suggestions or mistakes you find in our approach.

Disclaimer: As the CelebA dataset restricts commercial usage, we honour their policies and hence, in no place we intend to use this dataset where the outcome of the work can go into commercial purpose.

--

--

Prasad Pai
YML Innovation Lab

Software developer @ Flipkart. An aspiring data scientist moving ahead with one step at a time.