How a neural network can see more by actually seeing less

Like in life, taking a step back might actually be a good approach for designing neural networks too.

Aaron Baw
6 min readDec 15, 2019

As mentioned in my last post, object detection is a hard thing for computers to do.

That feeling when you dilute too many convolutions. Image by Michael Kemp.

At its core, like with many things, achieving good performance on the problem is about choosing where you want to compromise.

Like adjusting different video game character attributes with sliders, optimising for one aspect of performance typically comes at the cost of others. Want more muscle? That’s probably going to come at the cost of decreased aerobic capacity and agility. You get the picture.

In object detection, scale invariance becomes a key point where we might choose to adjust the sliders. How can we detect what might be the same object at different sizes? Just think about how difficult it is to recognise a face in a photograph when it’s stuck-up against your own face.

Detecting the same corner at different scales becomes a challenge when your detector’s receptive field is only accustmed to a given size. Corners start looking a lot edges at larger scales. Image courtesy of OpenCV

In Computer Vision, therein lies the dilemma between opposing scales — how can detectors learn to identify objects at small scales while still retaining enough of the global context to identify what might be the same object at larger scales? Previously I mentioned some of the state-of-the art object detectors like YOLO and RCNN. An issue with these single and multi-stage detectors is their inability to cope with detecting very large or very small objects. I definitely saw this to be the case while leveraging YOLOv3 to detect small handwritten symbols (that — or my handwriting isn’t the best).

Traditional approaches in the literature tend to address the problem by diversifying the training set. By representing the same object at different scales and sizes, then the hope is that the network will learn the corresponding features at each scale so that it’s better able to then detect the object when presented with an unseen sample at a similar scale.

Traditional approaches use image pyramids — creating many variations of the same image at different sizes and blur levels. The idea is that the network will then learn to recognise the object in whatever size it encounters. Image courtesy of Wikipedia.

The obvious issue here is training. By creating many different variations of the same object at different scales and blur levels, our dataset quickly explodes. At the same time, we aren’t really tackling the inherent architectural weakness. So instead of addressing this at the level of the training data, what if we could instead change how the network ‘sees’ each sample?

Enter: TridentNet

TridentNet is a novel approach to object detection which takes a step back (quite literally) from the problem. Instead of requiring many variations of the same object at different scales, it instead leverages a novel parallel-branch architecture where each branch interprets the image differently — almost as if the network were to be looking at the same image from up close and far away, at the same time.

The team leveraged a novel insight that the size of the receptive field for each convolution kernel played a key role in the specific scale at which objects could be recognised.

You can think of a ‘receptive field’ as the size of a kernel’s awareness, or visual field. In the example above, the kernel has a receptive field of 3x3 pixels, operating over the darker-blue plane. Image courtesy of dennlinger on Stackoverflow.

This makes sense intuitively. The more of an image you’re taking in at the same time, the more context you can appreciate and use to identify large objects. Along the same thread, a smaller receptive field helps focus in on details and would allow for better detection of smaller objects —think applying a magnifying glass over a newspaper.

Changing the receptive field.

So how can we change the size of a kernel’s receptive field? Dilated Convolutions. Dilated Convolutions allow for sparser sampling of data points while retaining the same stride and number of parameters. In effect, we can take an existing kernel, add a bunch of zeros in between each weight, and increase the size without adding any new parameters to worry about.

Zero-padding a convolution kernel enlarges its receptive field without adding any new parameters. Image courtesy of me + ☕️ + ♥️.

By enlarging the receptive field, we get a similar effect to training the network at a different scale — without actually changing the training data. It’s quite literally like telling the network to “take a step back” to better appreciate the global image context.

So we know we can change the receptive field to better learn how to detect features at different scales — but we still need different receptive fields. How do we proceed? This:

Multiple parallel branches used in TridentNet. From top to bottom we can see branches dedicated to detecting images of small, medium, and large scale respectively, each with increasing receptive field size. Image courtesy of Yanghao Li et al.

The team created an architecture where the network had three branches. Each branch used kernels of a different receptive field size, thereby specialised to detecting images of a particular scale.

So at this point you may be thinking — “Hang on a second, instead of just replicating the training data, we’ve just replicated the architecture itself? Isn’t that still really costly?”. You would be mostly correct in this thinking! Except the marked difference here is that each branch actually shares the weights with all the other branches. What that means, is that during training, every object seen at a particular scale updates the weights for all branches — where the only difference between each branch is the size of the receptive field.

By leveraging this approach, the network can learn to identify an object at different scales by only seeing a single sample, but interpreting it at different scales. Almost like moving a photograph back and forth in front of you to better appreciate how you might recognise someone! Fascinating.

Challenges during training.

It’s not quite all that simple, however. Or rather, nothing ever is simple in Machine Learning. One major challenge to training TridentNet in this way is the fact that the branch dedicated to detecting small objects, for example, would still have to try and learn to detect the larger objects — because it’s being fed the same data.

To address this, the training data was selectively fed to each branch, so that only images with small objects were fed to the small branch, larger objects to the larger branch, and so on — while still sharing weights across all branches.

As you can imagine, running a prediction through the various parallel branches might incur quite some computation, but they found that the architecture performed well even when they decided to use a single (i.e., the middle) branch — giving better performance without any added computation over baseline models, thanks to that weight sharing. Sharing is really caring after all.

Like in many cases in life, by taking a step back, the team was able to achieve a resounding breakthrough. If you’re interested, go and read the full paper!

While not always the case, I find it fascinating when advancements in Machine Learning seem to mimic or take inspiration from biology. My guess is that this approach more closely mimics our own ability to perceive and recognise objects at different scales — since we aren’t quite seeing many variations of the same object. We just see it once and kind of, well, know how to recognise it.

--

--

Aaron Baw

Long-form tweets. Philosophical nuggets. Extended thoughts.