Object Detection in X-ray Images

Published in

SFU Professional Computer Science

26 min readApr 16, 2020

Authors: Nattapat Juthaprachakul, Rui Wang, Siyu Wu, Yihan Lan

This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit {sfu.ca/computing/pmp}.

Table of contents:

Motivation and Background
Problem Statements
Data Science Pipeline
Methodology
Evaluation
Data Product
Lessons Learnt
Summary

1. Motivation and Background:

Everyday millions of people are traveling on public transport such as subways and civil airliners. Security inspection such as baggage screening has been playing a critical role in security risk and clearance process by helping protect public space from safety hazard such as terrorism. However, together with the growth of population in cities and more usage in public transport, history has shown that people want security without having to undergo substantial inconvenience. Therefore, it is very important for us to design a system that can help speed up and increase the efficiency of the security inspection process. With the accelerating pace of development in Deep learning algorithms such as Convolutional Neural Networks, its application has shown the usefulness in a variety of different fields such as Machine Translation and Image Processing. For example, it can help fast, automatically and accurately discover and classify objects of interests. In this project, we will explore several Deep learning-based Object Detection models to locate and classify prohibited objects in X-ray images and compare their performances on different metrics.

As one of the fundamental computer vision problems, object detection is able to provide valuable information for semantic understanding of images and videos and is related to many applications, including image classification, robotics, face recognition and autonomous driving. In our project, we are focusing on a generic object detection task which aims at locating and classifying existing objects in an image, and labeling them with rectangular bounding boxes to show the confidences of its existence. There have been several researches focusing in this area. R. Girshick, J. Donahue, T. Darrell, and J. Malik[29] invented a region-based Object Detection network called R-CNN that uses a selective search algorithm to find bounding boxes around objects of interests; however, this model is very slow to train. After several months later, R. Girshick[30] improved the R-CNN model by improving previous selective search algorithm. This helps reduce the training time and the model is called Fast R-CNN. One year later, S. Ren, K. He, R. Girshick, and J. Sun[31] introduced a new model called Faster R-CNN by removing a selective algorithm and introducing an additional network called Region Proposal net. This helps reduce training time dramatically.

In 2017, K. He, G. Gkioxari, P. Doll ́ar, and R. B. Girshick[32] introduced a new architecture called the Mask R-CNN, which could locate exact pixels of each object instead of just bounding boxes. Different from these region proposal based methods, some previous researches also introduced another method which is known as the regression/classification based method. D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov[33] introduced the MulitBox in 2014; J. Redmon, S. Divvala, R. Girshick, and A. Farhadi[34] invented the YOLO in 2016. Later, J. Redmon and A. Farhadi[35] introduced a faster model called YOLO v2. In 2016, another research team consisting of W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg[36] introduced a new architecture called Single Shot Detection network (SSD). Compared to the Faster R-CNN, SSD has a comparative accuracy performance but with a lower training time. All of these region proposal based frameworks and regression/classification based frameworks have been explored in our project.

Dataset: SIXray dataset consists of X-ray images from Beijing subway https://github.com/MeioJane/SIXray

Credit: (@INPROCEEDINGS{Miao2019SIXray, author = {Miao, Caijing and Xie, Lingxi and Wan, Fang and Su, chi and Liu, Hongye and Jiao, jianbin and Ye, Qixiang }, title = {SIXray: A Large-scale Security Inspection X-ray Benchmark for Prohibited Item Discovery in Overlapping Images}, booktitle = {CVPR}, year = {2019} })

2. Problem Statement:

The goal of this project is to find the best algorithms that can detect prohibited objects in the X-ray images by selecting multiple algorithms, training multiple models, and reporting on comparative performance of each one. The prohibited items consist of Gun, Knife, Wrench, Pliers, and Scissors. Hammer class is not included in this project since there are too few numbers of its images. Performance of the model is described by mean average Precision Accuracy(Object Detection metrics), as well as Accuracy and Recall scores. The following sections are the problems that we are trying to solve and why they are challenging.

2.1 Algorithms(Object Detection vs Image Classification)

In the image classification, CNNs are used as a feature extractor by extracting features directly using all pixels in the image. These features are later used as basic information for detecting and classifying prohibited items in X-ray images. However, the method of extracting features using all pixels is very computationally expensive and brings in a large amount of redundant information. In addition, standard CNNs followed by a fully connected layer have constant fixed outputs because the number of occurrences of the prohibited items need to be fixed. However, in our dataset, there could be many prohibited items with the same or different classes within one single image, as well as several non-related items. Another problem could be that the prohibited items might have different spatial locations and aspect ratios within our images. Since we need to select a large number of different regions and aspects of images, this results in expensive computation cost and time. Therefore, we can conclude that this dataset is well-suited for Object Detection algorithms instead. For Object Detection, the goal is not only to classify the prohibited objects but also to locate them by creating a bounding box around them.

Credit: https://machinelearningmastery.com/object-recognition-with-deep-learning/

2.2 Imbalanced Dataset

The second problem is that our dataset is highly imbalanced. Our dataset has more negative samples than positive samples. Negative samples mean that the image does not contain an object of our interest. On the other hand, the positive samples mean that the image contains objects of interest. In this case, the objects we try to detect are prohibited items in an X-ray image such as Knife, Gun, Wrench, Plier, and Scissors.

However, the benefit of using Object Detection models instead of Classification models is that we are able to train enough positive samples without incorporating negative images into our training dataset. The reason is that the negative samples already exist implicitly in the images. All regions of the images that do not correspond to a bounding box are a negative sample. Therefore, we are able to save time and cost for training large dataset without sacrificing much accuracy due to the imbalanced dataset.

2.3 Challenging Images

Lastly, our X-ray image dataset has problems with not only imbalanced dataset but also unclear images. By nature, security inspection often deals with baggage images containing several items that can cluster, overlap, and randomly stack with other items. For example, the normal objects and prohibited items are mixing together in an arbitrary manner, leading to major detection problems like false inspection or ignorance by current technology like simple metal detectors or even human inspectors.

However, by carefully choosing proper Object Detection models, this challenging problem can be solved by the models that not only classify dangerous items correctly but also locate them precisely in the clustering images. In the next section, we will present the Object Detection architecture behind each model that our project has selected.

3. Data Science Pipeline:

3.1 Obtaining dataset

SIXray dataset contains Positive samples(the images containing objects of our interests namely prohibited items that we want to locate and classify) and Negative samples (the images containing non-prohibited items) which are later used for evaluating our models. In addition, the label files for whole images are in three separated folders. The location annotation files of objects of our interests are in xml file format.

3.2 Pre-process images and label files to create training data

We use the subset of Positive samples for Training and another subset of Positive samples combining with Negative samples for Testing/Evaluating. Due to limitations on computation cost and power, in this project we did not use the whole SIXray dataset. There are three main pre-processing steps for our dataset.

The first step is to get correct labels for each image we intend to use. Since we are using a subset of the main dataset, we need to get new labels for each image from our dataset. Later, these labels are used for Testing/Evaluating of our trained models.

The second step is to create a readable dataset by transforming annotated files, a xml file consisting a metadata of each image such as classes and location of the objects.

The last step is to transform images and annotation files of Positive samples into Tensorflow Record for training in our Object Detection models.

3.3 Creating Training pipeline and training models

Our training is done by Tensorflow Object Detection API where we can download and install from the link below, together with Config files and Pre-trained Models from a Tensorflow model zoo. Also, we have tried the Classification model but it did not work so well. Therefore, we changed that model to the Object Detection models instead.

Tensorflow Object Detection API: https://github.com/tensorflow/models/tree/master/research/object_detection

Tensorflow Object Detection Model Zoo:

https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md

Tensorflow Object Detection config for each model:

https://github.com/tensorflow/models/tree/master/research/object_detection/samples/configs

The training is done on Google Cloud Platform with Deep Learning VM. We trained eight different Object Detection models.

The total number of images for training is 7,200 Positive samples. In this project, we did not incorporate Negative samples to our training set as the Detection Models use the parts of the image that do not belong to the annotated objects as negative samples. In addition, the training process is monitored by the TensorBoard where we set up networks and see our training progress online such as numbers of finishing steps, training loss, validation loss, and etc.

3.4 Evaluate each model with different ratios of negative-positive image sets

We create an Inference graph from our trained model and use it to evaluate with another subset of Positive samples and a whole set of Negative samples. The performance is measured by Precision-Recall score and mean Average Precision scores (mAP).

Three ratios of our test dataset for model evaluation:

1,800 Positive Samples + 50,000 Negative Samples
1,800 Positive Samples + 100,000 Negative Samples
1,800 Positive Samples + 150,000 Negative Samples

4. Methodology:

For Image Classification problems, by taking an image as an input and predicting the object in that image, a model classifies what is contained in that image while Image Localization problem is to specify the location of objects in that image. However, the localization alone will not help us predict the class of the object in the image. On the other hand, Object Detection specifies the location of the object in the image and predicts the class of that object. Therefore, in this project, this makes the Object Detection model well-suited for our X-ray image dataset.

In our project we implement eight Object Detection models, where they have different structures as described by following sections.

SSD Mobilenet_v1
SSD Mobilenet_v1_fpn
SSD Inception_v2
SSD Resnet50
R-FCN Resnet101
Faster R-CNN Resnet50
Faster R-CNN Resnet101
Faster R-CNN Inception_v2

4.1 Object Detection Architectures

SSD (Single Shot MultiBox Detector) https://arxiv.org/abs/1512.02325

SSD is an approach for detecting objects in images using a single deep neural network, which discretizes the output spaces of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object class in each default bounding box and produces adjustments to the box to better match that object shape.

SSD is simple relative to other methods that require region proposals because it completely encapsulates all computation in one single network. VGG16 network is used as a feature extractor which is equivalent to the CNN in Faster R-CNN. It makes SSD easy to train, fast to detect, and straightforward to be integrated into systems that require real-time detection. Thanks to the Pyramidal Feature Hierarchy structure, it has a fast detection speed but has low performance in detecting small objects as it misses the opportunity to reuse the higher resolution maps of that feature. For example, SSD only uses upper layers for detection as shown in the following picture.

Credit:https://medium.com/@jonathan_hui/what-do-we-learn-from-single-shot-object-detectors-ssd-yolo-fpn-focal-loss-3888677c5f4d

Feature Pyramid Network (FPN) https://arxiv.org/abs/1612.03144

FPN consists of two main pathways: bottom-up pathway(a low-resolution and semantically strong feature) with top-up pathway(a high-resolution with semantically weak features). Also, the network adds lateral connections to help connect between reconstructed layers and corresponding feature maps to help the detector better predict the object location. This overall feature pyramid has rich semantics at all levels and is built quickly from a single input image scale without sacrificing representational feature, speed, or memory.

In summary, FPN is a feature extractor developed for building high-level semantic feature maps at all scales(Pyramid concept). FPN shows improvement as a multi-scale feature extractor with better quality information over other normal feature extractor in Object Detection models like Faster R-CNN architectures.

Credit:https://medium.com/@jonathan_hui/understanding-feature-pyramid-networks-for-object-detection-fpn-45b227b9106c

Credit:http://presentations.cocodataset.org/COCO17-Stuff-FAIR.pdf

Faster R-CNN (Region-based Convolutional Network) https://arxiv.org/abs/1506.01497

In a simple approach for Object Detection algorithms, we would apply CNN models on one single image in order to detect an object of our interest. We re-train networks by applying different sliding windows for different regions several times as our object of interest could be in any position in the image. This approach is very computationally expensive and time consuming. Therefore, there is an attempt to reduce the number of sliding windows.

By introducing R-CNN, Ross Girshick proposed a selective search method to extract 2,000 regions from an image and these regions are called region proposals. Selective search uses local cues like texture, intensity, and color to generate all the possible locations of the object. CNNs act as a feature extractor for each candidate region. Lastly, a linear SVM classifier is used to classify the presence of the object within that candidate region proposal. However, R-CNN is still computationally expensive to train since there are up to 2,000 region proposals per image.

The same researcher introduced the upgraded version of the same model called Fast R-CNN that has a very similar approach such as selective search with some new modification. Instead of selecting 2,000 fixed region proposals, a set of region proposals are extracted by two main operations. The first operation is a feature extraction operation by CNN models and output is called a convolutional feature map. The second operation is to use a region of interest pooling layer(RoI) to identify the region proposals from the output of the first operation. This approach resulted in less computation.

As the selective search approach is very time consuming, a new model is proposed called Faster R-CNN. Instead of using selective search algorithms, the separate new network is introduced to select region proposals. This makes Faster R-CNN faster than both R-CNN and Fast R-CNN.

R-FCN (Region-based Fully Convolutional Networks) https://arxiv.org/abs/1605.06409

According to authors of the paper, in contrast to previous region-based detectors such as R-CNN, Fast R-CNN, and Faster R-CNN that apply a costly per-region subnetwork hundreds of times, a new region-based model called R-FCN is proposed. It has a fully convolutional architecture with almost all computation shared on the entire image. The authors propose position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection. Therefore, this method can naturally adopt fully convolutional image classifier backbones such as the latest Residual Networks(Resnet) for object detection.

Credit: https://medium.com/@jonathan_hui/understanding-region-based-fully-convolutional-networks-r-fcn-for-object-detection-828316f07c99

Credit: https://arxiv.org/abs/1605.06409

Model Comparisons between Accuracy and Speed

Credit: https://mc.ai/object-detection-speed-and-accuracy-comparison-faster-r-cnn-r-fcn-ssd-and-yolo/

4.2 Key features of the underlying CNN models used as network backbone for Object Detection models:

Resnet50 and Resnet101 https://arxiv.org/abs/1512.03385

Resnet is a very deep network with many layers. It is the very first network to use skip connection in order to solve the problem of decreasing accuracy known as vanishing gradient while the network gets deeper. It also applies the batch normalization technique. Note that Resnet101 has a deeper network than Resnet50.

Inception v2 https://arxiv.org/pdf/1512.00567v3.pdf

There are three main notable components of Inception_v2 architecture. Firstly, it introduces two additional auxiliary classifiers in the middle of the network in order to solve the vanishing gradient problem. Secondly, because of different sizes of filters in the same layer, it has deeper and wider networks compared to Resnet. Lastly, due to the problem of reducing the size of inputs drastically that leads to information loss, the network has been upgraded from Inception_v1 by implementing two 3x3 convolution operations instead one 5x5 as to solve the problem of representational bottleneck.

Mobilenet v1 https://arxiv.org/abs/1704.04861

The key feature of Mobilenet is that it uses depth-wise separable convolutions to build light-weight deep networks. This means that the network applies depthwise convolution before applying a normal pointwise convolution. A regular convolution does both filtering and combining in a single operation. However, with a depth-wise separable convolution, these two operations are done on separate steps, resulting in faster computation.

Credit: https://towardsdatascience.com/review-mobilenetv1-depthwise-separable-convolution-light-weight-model-a382df364b69

Model Comparisons between Accuracy and Speed

Credit: https://arxiv.org/pdf/1810.00736.pdf

Note:

1. Complexity can be expressed in terms of floating point operations or flops required to find the solution, meaning that a Flop serves as a basic unit of computation and the number of flops indicates the cost of performing a sequence of operations.

2. Inception v3 has the same architecture as Inception v2 with some minor changes.

From the graph above, in terms of computation time, we can arrange from fastest to slowest computation for each model backend that we use as follows: Resnet101, Inception_v3, Resnet50, and Mobilenet_v1 respectively. On the other hand, in terms of accuracy from highest to lowest, we get Inception_v3, Resnet101, Resnet50, and Mobilenet_v1 respectively.

5. Evaluation:

Evaluating Object Detection models consists of two main distinct tasks that require us to measure. The first is a Classification task which is to determine whether there is an object of our interest in the image. The second task is a Localization task which is to determine the location of the object of our interest in the image. In addition, our dataset is both highly imbalanced between positive and negative samples and non-uniformly distributed among different classes of prohibited objects. Therefore, using accuracy metrics alone would not be enough as we also need to assess how likely our models misclassify objects and non-objects of interest. Hence, a model score or confidence score is evaluated based on each bounding box around objects of our interests in the images in order to assess both object locations and classes of our models at a variety of accepting thresholds. The Average Precision(AP) is the main metric used in Object Detection tasks. Therefore, we need to understand some important underlying concepts such as Precision-Recall curve, Average Precision and Intersection over Union threshold(IoU).

5.1 Intersection over Union threshold (IoU)

The most important threshold in selecting whether Object Detection models can classify the classes of prohibited objects and predict the location of those objects in the images or not is Intersection over Union threshold(IoU). IoU is defined as an overlapping area of the intersection divided by the area of the union of a bounding box from a ground-truth image and a predicted bounding box from our models.

Credit: https://github.com/rafaelpadilla/Object-Detection-Metrics

5.2 Precision-Recall curve

Precision-Recall metric is implemented in this project as a useful measure of success of prediction when our samples and classes are imbalanced.

Precision(P) is defined as the number of true positives(Tp) over the number of true positives plus the number of false positives(Fp). [P = Tp /(Tp+Fp)]

Recall(R) is defined as the number of true positives(Tp) over the number of true positives plus the number of false negatives(Fn). [R = Tp/(Tp+Fn)]

In order to assess these metrics, we need to pick some thresholds to consider the direction of our model prediction.

True Positive(Tp) is a correct detection with IoU ≥ threshold.

False Positive(Fp) is an incorrect detection with IoU < threshold.

False Negative(Fn) is a missing detection on objects of our interests.

True Negative(Tn) is an implicit measurement for the Object Detection model as bounding boxes for True Negative are bounding boxes for non-objects of our interests. Therefore, there are many possible bounding boxes for non-objects of our interests within each image. This is why we do not need to explicitly measure True Negative since other measures mentioned above can perform very similar tasks in an opposite direction.

Therefore, Precision is the ability of our models to detect ONLY the relevant OBJECTS of our interests while Recall is the ability of ours to find ALL the relevant BOUNDING BOXES of objects of our interests.

The important observation from Recall and Precision formula is that Precision may not decrease with Recall. The definition of precision Tp/(Tp+Fp) shows that lowering the threshold of our models may increase the denominator by increasing the number of all relevant results returned. If the threshold was previously set too high, the new results may all be True Positives which in turn will increase Precision. However, if the previous threshold was about right or too low, further lowering the threshold will introduce False Positives which in turn decrease Precision. On the other hand, Recall is defined as Tp/(Tp+Fn) where Fn does not depend on the selected threshold. This means that lowering the threshold may increase Recall by increasing the number of True Positive results. Consequently, it is possible that lowering the threshold may possibly cause Precision to fluctuate while Recall is unchanged. However, selecting the right thresholds is hard; therefore, we would rather find all possible thresholds and rather find an average of them. This is why Average Precision is very important.

The Precision-Recall curve shows the tradeoff between Precision and Recall for different thresholds. A high area under the curve represents both high Recall and high Precision where high Precision relates to a low False Positive rate and high Recall relates to a low False Negative rate. High scores for both show that our models are returning accurate results (high Precision), as well as returning a majority of all positive results (high Recall).

Models with high Recall but low Precision can locate most bounding boxes around the objects of our interests; however, most of their predicted classes of those objects are incorrect when compared to the true labels. Models with high Precision but low Recall are just the opposite. By locating few relevant bounding boxes, but most of their predicted classes are correct when compared to the true labels. In summary, we want models with both high Precision and high Recall since they will return many relevant bounding boxes with all results labeled correctly.

5.3 Average Precision(AP) and Mean Average precision(mAP)

Average precision(AP) summarizes the Precision-Recall curve as the weighted mean of Precisions achieved with the increase in Recall from the previous threshold used as the weight at every threshold level. [AP=∑n(Rn−Rn−1)Pn ] where Pn and Rn are the precision and recall at the n-th threshold. From the above formula, AP is the Precision averaged across all Recall levels for every threshold. On the other hand, Mean Average Precision(mAP) is defined as the mean of Average Precision across all different classes. However, there are two different types of mAP: Micro mAP and Macro mAP. A Macro mAP computes an AP metric independently for each class of objects of our interests and then computes the average. This means that Macro mAP treats all classes equally. In contrast, a Micro mAP will aggregate the contributions of all classes to compute the AP metric. As our dataset is highly imbalanced, Micro mAP is more suitable for evaluating our models.

Result:

We train all of our models with 7,200 Positive samples while evaluating with another 1,800 Positive samples together with different numbers of Negative samples namely: 50,000, 100,000, and 150,000 respectively. All of the above graphs are Micro Average Precision-Recall curves of different models with the test dataset with different ratios of Positive and Negative samples. The higher the areas under curves, the higher both the Precision and Recall at every threshold.

In the top-left image, we test our models with only 1,800 Positive samples without any Negative samples. The areas under the curves for each model are very high though SSD_Mobilenet_v1 might have a lower area under the curve compared to others. The rest three images show the performance of each model when testing them with different subsets of the test dataset namely 50,000, 100,000, and 150,000 Negative samples. The SSD_Inception_v2 model has the highest area under curve compared to others in every test dataset. Additionally, in every ratio between Positive and Negative samples of our test datasets, SSD-based models such as SSD_Mobilenet_v1_fpn and SSD_Resnet50 also have higher areas under curves than other model architectures like R-FCN and Faster R-CNN as well with an exception of SSD_Mobilenet_v1. The SSD_Mobilenet_v1 has the lowest area under curve in every test dataset. This implies that the performances of our models not only depend on Detection networks alone but also rely on network backends like different CNN models that are used for feature extraction as well. We can observe that SSD-based Detection models with Inception_v2, Mobilenet_v1_fpn, and Resnet50 outperform R-FCN and Faster R-CNN-based models with similar network backends. In comparison, the SSD-based model with a simple extraction network like Mobilenet_v1 performs the worst among all of our models.

The above table shows the Average Precision(AP) of each model by different categories of dangerous items in different ratios of Positive-Negative samples in test datasets while the last three columns show Micro mean Average Precision(mAP) summary for every class of prohibited objects of every model in different ratios of test datasets. The clear observation of this table is that as we include more Negative samples from 50k to 150k into test datasets, both AP and mAP are decreasing correspondingly. For the Gun class, RFCN_Resnet101 has the best performance while the other models like Faster_RCNN_Resnet50/101 and SSD_Inception_v2 come very close. However, for the Knife class, SSD_Inception_v2 has the highest AP and outperforms other models by a very large margin compared to other models. Both Gun and Knife class, our best models have AP up to 90%. On the other hand, for both Wrench and Pliers class, Faster_RCNN_Resnet50 and SSD_Mobilenet_v1_fpn have the highest AP respectively with a 60–80% range. However, for the Scissors class, SSD_Resnet50 is the best model with the highest AP of just 20–40% range. This implies that the Scissors class might be the most difficult prohibited item to be detected; therefore, it would rather recommend for a Machine Learning engineer to modify models or add more data with more Scissors class.

Overall, our projects implement Micro mAP to assess the general performance of each model. The SSD_Inception_v2 has the highest Micro mAP which follows our previous analysis on Average-Recall curves.

The lines chart above summarizes the last three columns of the previous table by using Micro mAP scores of each model. SSD_Inception_v2 is the best model in our project which is followed by SSD_Mobilenet_v1_fpn; while the performance of SSD_Mobilenet_v1 is relatively disappointing among all of our models.

6. Data Product:

This test image shows the performances of our different trained Object Detection models together with a ground truth image. From the ground truth image, we can observe that there are four dangerous items in this baggage image, including two guns and three overlapped knives. All of these models, SSD_Mobilenet_v1_fpn, SSD_Inception_v2 and SSD_Resnet50, are able to detect only guns while ignoring all the knives while the rest of our models could detect both guns and knives. Both RFCN_Resnet101 and Faster_RCNN_Resnet101 have the best performance among other models because they can detect all four dangerous objects with very high accuracy though RFCN_Resnet101 puts more bounding boxes on the prohibited objects.

The second test image is more challenging than the previous. There are three different types of dangerous items: wrench, gun, and knife with several numbers of them in one single image. From the ground truth image, we could observe that there are three wrenches, two guns and one knife which randomly scatter and overlap on each other. SSD_Mobilenet_v1_fpn and SSD_Inception_v2 can detect wrenches and guns while missing the knife; whereas, other models except SSD_Resnet50 can detect all of these three kinds of prohibited objects. On the other hand, SSD_Resnet50 can detect both guns and wrenches with very low accuracy scores while missing a knife and wrench entirely. RFCN_Resnet101, Faster_RCNN_Resnet101, and Faster_RCNN_Resnet50 perform the best in this image since they are able to identify all dangerous items and locate them with the highest accuracy scores.

7. Lessons Learnt:

Three important lessons we have learnt from this project are as follows: how the Object Detection model works, why we need Object Detection models, and how to assess the performance of Object Detection models.

Usually, we choose CNN models to solve the image classification problem; whereas, in this project CNNs failed to both identify and locate dangerous items in our X-ray dataset. For example, we have tried both VGG16 and Resnet50 models but the result was disappointing. To explain this phenomenon, we did some research on Computer Vision and found out that the Classification model alone is not suitable for solving our problems in this project. The challenging tasks in this project include the feature extraction and multiple object location. We instead implement a better alternative approach which is the Object Detection model.

In addition, we have learned a variety of different Object Detection architectures such as Faster R-CNN, SSD, R-FCN, and FPN. We have explained its structure, features and advantages in detail for each of them in previous sections. To implement Object Detection models, we use Tensorflow Object Detection API and set up the training pipeline on Google Cloud Platform. We have trained several models and evaluated their performances. In the evaluation part, we have learnt three new concepts for model evaluation metrics including Precision-Recall curve, Average Precision(AP), mean Average Precision(mAP), and Intersection over Union threshold(IoU). We used AP and Micro mAP as our main metrics to evaluate all trained Object Detection models and selected the model with the best performance.

Future works:

To improve the accuracy of our Object Detection models, we need to add more positive images, especially images with Scissors class. Our current dataset is both unbalanced in Positive and Negative samples(with 8,929 Positive and 1,050,302 Negative images) and unbalanced in numbers of images containing prohibited objects in each class. In addition, for our project, we only use Positive images to train our models though positive images only make up less than 1% and we still need some of them for the test dataset. In the future, we could integrate some Negative samples into our training dataset.

Furthermore, among five classes of dangerous items, identifying scissors seems to be the most difficult for all of our models. The model with the best performance in detecting scissors can obtain only 42% accuracy in the smallest subset of our test dataset. One possible reason is the lack of scissors images in our dataset since we have only 983 images with scissors, which are far lower than other classes.

In addition, in the future we could evaluate time and accuracy tradeoff among all of our models. As some applications require real-time Object Detection, the models with highest accuracy but slowest to train and evaluate might not be appropriate in this situation.

8. Summary:

In this project, our goal is to find the best algorithm that can correctly classify prohibited objects in X-ray images and precisely locate all of those objects. We used SIXray dataset, which is a large-scale dataset consisting of more than one million X-ray images consisting of different numbers of dangerous objects and non-dangerous objects.

Due to the disappointing performance of CNN models, we instead implemented Object Detection models to solve this problem. Plenty of Object Detection architectures such as SSD, Faster R-CNN, FPN, and R-FCN with different feature extractor backends like CNN models including Resnet, Inception, and Mobilenet have been selected. Our group successfully trained eight Object Detection models. To find out the best performing model in our unbalanced dataset, we assessed the performance of each model and used micro mean Average Precision to measure the general performance of each model in predicting different classes of dangerous items in the images. SSD_Inception_v2 is proven to be the most suitable model with the highest mean Average Precision score in this project. Some future work that still needs to be conducted is to optimize the performance of our models in predicting certain prohibited items like Scissors. As the number of Scissor images is just 0.001% of the whole dataset, one possible solution is to increase the amount of training dataset, including adding more Positive samples.

Additional Figures: