Neural Architecture Search for Object Detection in Point Cloud

Manuel
Seoul Robotics
Published in
9 min readAug 19, 2019

A Summer Internship at Seoul Robotics

I had the opportunity to be part of Seoul Robotics for a six-week summer internship. In the following, I will give an overview of the work I conducted during that time. The aim of my project was to implement Neural Architecture Search for object detection in point cloud data.
Questions and inquires can be directly addressed to me or Seoul Robotics.

Everything that can be automated will be automated

Neural Architecture Search (NAS) is an emerging field in Deep Learning with good reason. Until now breakthroughs in Deep Learning, such as the ResNet or Inception (see figure 1) models, are mostly driven by manually designed novel network architectures. Finding new well-performing architectures is both error-prone and time-consuming. Architectures are typically proposed based on vague arguments and intuitions. It is hard to imagine how to come up with such complex architectures such as the Inception Network (see figure 1).

NAS aims to free researchers of this dire and labor-intensive search. In addition, it promises to be a more rigorous and scientific approach. Finally, NAS can tap so far unexplored architecture spaces and with that has the potential to outperform manually designed networks.

As you can see NAS seems like one of the obvious next steps in Deep Learning. So far so good, but how do you do it? In the following, I will give a short overview of the literature and thereafter will introduce the approach I have explored during my summer internship.

Figure 1: The Manually Designed Inception Network

State of the Art

A naive approach to find the best solution would be to simply exhaustively search a certain search space. This is of course not feasible for most search spaces as they are too large to efficiently search them. Nevertheless, this is exactly what Google did in an effort to create a benchmark dataset for NAS. In the following, I give an overview of more subtle methods.

Fundamentally there are three different approaches for the Search Strategy, i.e. Reinforcement Learning (RL), Evolutionary Methods and Gradient-Based methods.
Reinforcement Learning (e.g. Zoph et al., 2018) frames the problem with an agent that aims to find a suitable architecture in order to maximize its reward which is the network performance.
Evolutionary Methods (e.g. Real et al. 2019) make use of genetic algorithms to sample a parent from a population of networks, which is used to produce offspring models by applying mutations to the architecture, for instance, this can be the change connections, operations or similar.
Gradient-Based (e.g. Liu et al. 2018 ) methods use continuous relaxation to make use of gradient descent methods for the network architecture optimization. Instead of fixing an architecture this approach uses a convex combination of multiple architectures.
The major downside for Reinforcement Learning and Evolutionary Methods is that they both tend to be computationally expensive with the search needing as much as 2,000 and 3,150 GPU days respectively. On the other hand, those methods can be used for multi-objective optimization by adopting the optimized metric accordingly and are therefore more flexible.

A recent critic on the above approaches was stated by Liam et al. (2019). They achieved the same performance as the leading approaches by randomly sampling the network architectures. Liam et al. therefore argue that the correctly chosen restricted architecture search space that most NAS methods require is the biggest reason for the good performance of the recent work in the field.

Figure 2: Point Cloud Data from KITTI Dataset

Neural Architecture Search for Object Detection in Point Cloud Data

Introduction

The goal of my summer internship was to develop Neural Architecture Search for object detection in point cloud data.

In contrast to that, most of the work in NAS was conducted in image classification (e.g. Zoph et al., 2018, Real et al. 2019) and only a hand full in object detection (Tan et al. 2018, Liu et al. 2019). To the best of our knowledge, there has not been any work published in the realm of NAS for object detection in point cloud data. Which makes it both a challenging and promising problem to solve.

In order to avoid the immense computational cost of Reinforcement Learning and Evolutionary Methods, I aimed to extend a recent gradient-based approach by Liu et al. 2019 dubbed Auto-DeepLab.
As Auto-DeepLab was developed for image segmentation it does not directly translate to point cloud data. I, therefore, mixed in some of the manually found architecture elements and other methods for object detection in point cloud data reported by Yang et al. (2018) in a paper called PIXOR.

Figure 3: Cell Architecture Search Space (Liu et al. 2018)

Methods

The network was trained on Lidar data from the KITTI dataset (see figure 2). The point cloud data is first projected into the bird’s eye view and thereafter voxelized into a discrete grid. Subsequently, the data is downsampled up to a spatial resolution of 4 with the stem used in PIXOR. The PIXOR stem was used because downsampling is conducted much slower in contrast to the stem of Auto-DeepLab, which should help to extract more information of the objects that can be very small in point cloud data.
The output of the stem is thereafter fed to the actual architecture search space that makes use of convex combinations of operations and connections. The search is both conducted for lower-level network cells (see figure 3) as well as the higher-level architecture of the network, i.e. when to spatially down- or upsample (see figure 4). The continuous relaxation is best illustrated at the example of calculating the output y of a layer as the convex combination of multiple operations o(convolution, skip-connection, etc) applied to the input x, i.e. y =Σ αo(x) (see figure 3 a)). In the implemented approach convex combination is used to combine multiple operations and inputs in the cell space as well as to combine multiple spatial resolutions as cell inputs.

Figure 4: Manually Designed Conv-Deconv Network (left) and Network Architecture Space (right) (from Liu et al. 2019)

To decode the network output at the last layer different network heads were tested. This includes Atrous Spatial Pyramid Pooling (ASPP), stacked deconvolution layers and bilinear upsampling in combination with convolutional layers.
Finally, focal loss in combination with the approach of dense, proposal-free object detection from PIXOR was used.
To find the network architecture gradient descent is used on the parameters of the convex combination of the network architectures. Simplified this leads to enhance connections at each junction in the architecture search space that minimizes the total loss (as can be seen in figure 3 c)).

Limitations

Whereas Reinforcement Learning and Evolutionary Methods tend to take many GPU days until the architecture is found, the used approach via continuous relaxation showed to produce a very large model because all the possible connections and operations in both the cell and higher architecture level need to be kept in memory simultaneously. Metaphorically speaking, this can be compared to stacking hundreds of normal neural networks into a single giant network and then train that whole giant network. I, therefore, had to compromise in the search space and limit the number of search layers, considered operations, channels and batch size to fit the model onto a single Tesla V100 GPU with 16 GB.

Training the network proved to be difficult because training this large network takes a very long time, whereas my time in the internship was very limited. Training the model with 12 search space layers took 10 days. This limited the number of different hyperparameter settings and header networks I was able to train.

Results

As can be seen in figure 5 the weight for the network architecture is in the very small range from 0.22 to 0.46 during the last epoch of training. This shows that the network architecture did not converge to one distinct path otherwise the distribution would be clearly divided into a cluster close to 1.0 and 0.0.

Figure 5: Histogram of Network Architecture Weights of Best Model

This shows clearly that the approach was not yet successful. I suspect that the hyperparameter settings, as well as the selected header networks, are the reason for the dire performance. As this is only a speculation, it needs to be investigated further. In a similar vein, the validation results of the mean average precision (mAP) were not up to par with the state of the art models.
In conclusion of the various tested header network, it was observed that the ASPP header was the best performing.

UPDATE: On the very last day of my internship I found some hyperparameter setting that did converge to a clearly more distinctive path (see figure 6)! Those are preliminary results, i.e. the network was not finished with training, but it hints that the approach goes into the right direction and can be further tweaked.

Figure 6: Preliminary Results of a — Histogram of Network Architecture Weights of Best Model

Conclusion

As you can see my efforts were not yet successful. I tackled to solve a difficult problem and I was not able to crack the case just yet. Nevertheless, I learned greatly from this experience and I am hopeful that my work will be continued and a suitable hyperparameter setting or header network will be found.

Outlook

To come back to Zuboff’s first Law, “Everything that can be automated will be automated”, we can observe that hyperparameters are still selected manually and both the stem and the network header are chosen manually. In addition, the search space is highly restricted and manually chosen. We can, therefore, conclude that there is much left to be done in automating Deep Learning. Finally, I believe that Neural Architecture Search will continue to help us to uncover some novel high performing architectures. It’s an exciting field to be part of and there is much more left to explore!

The Internship

I highly enjoyed my summer internship at Seoul Robotics. The team is highly skilled and motivated. I was able to feel the momentum the company has. I see a bright future for Seoul Robotics and I am excited to learn about all the team’s upcoming accomplishments and the company’s path.
I am grateful to have had the opportunity to tackle a novel problem in Deep Learning and discuss it with my fellow colleagues.

Further Readings

Interested readers can find a very good overview of the NAS field and the different approaches in the recent survey paper of the University of Freiburg (Elsken, T. et al., 2018). In addition, the Machine Learning Lab of the University has also an updated and curated list of all the recent publications in the field.

A Note from Captain (CEO)

The Last Supper with Manuel in Gangnam

We were happy to have Manuel, such a bright engineer from ETH Zürich over the summer to apply NAS on Lidar output. Although the result did not converge, it was a very successful investigation for a potential research topic for Seoul Robotics, as we focus on delivering the state of the art 3D perception engine to the industry. Outside of work, we tried to give him the full Seoul experience, including Korean BBQ in Gangnam.

--

--

Manuel
Seoul Robotics

Robotics, Systems and Control Master student at ETH Zurich with interests in Deep Learning, Computer Vision and Robotics. https://www.linkedin.com/in/manuelbre