Challenges in Small Object Detection (eg: drones) and Cascaded Classifier Approach

Published in

Codable

9 min readJan 26, 2023

This post will be mostly about drone detection but I think the main point is the approach we can take in computer vision problems especially when we are dealing with

Small objects that are hard to detect or discriminate
A system that is expected to run with modest hardware
Where the data is limited and/or available in different formats or modalities

Spoiler Alert: I will be proposing using synthetic data and taking a cascaded approach.

Stay tuned to see how and why.

About drone detection

In a world with demand and ease of access to drones, cameras and computer vision come as passive (not emitting any signals), cost effective but challenging solutions to detect drones and help keeping security, safety and privacy. In this article I will talk about the challenges of drone detection like working with small objects on complex backgrounds and ways to overcome them like using synthetic data, using temporal information about the object and using a cascaded system to overcome weaknesses.

Who am I?

I am an AI enthusiast and professional. I am currently the Head of AI in OBSS Technologies. Here at OBSS AI we are mainly focused on solving problems with strong real world use cases. We have been interested in the problem of drone detection. We are the winners of WOSDETC Drone vs Bird Challenge two years in a row (2021 and 2022).

We also publish open source libraries related to AI at OBSS Github Page related to vision, NLP, biomedical signal processing etc. SAHI being the most popular of which you can use to increase your model’s performance on detecting small objects without retraining. Check out Fatih Cagatay Akyon’s blog post about the library especially if you are interested in detecting small objects.

Why drone detection matters?

Drone technologies as they are getting cheaper and easier to get access to, start to pose security, public safety and privacy. These kinds of technologies can be easily used to cause physical harm, spying on targets, and espionage. Or even without any ill intent they may cause security issues like affecting aircraft flight or cause your privacy to be invaded in your own property.

Although there have been many regulations about how and where you can fly drones, there can be no sanctions if the event goes unnoticed or it is too late.

From a technological perspective there are a variety of solutions to detect drones. RF listening solutions may be used to listen and identify the radio signals that drones are emitting (in some cases receiving), but this technology cannot detect drones designed to fly autonomously without any RF transmission. Radar systems are another popular solution but these systems usually need to emit and receive RF signals. This make them both expensive and really difficult to place anywhere due to the regulations regarding the signals they emit.

Passive detection systems, mainly cameras, fill an important gap in this regard. Using daylight or IR light they can be easily deployable to most places with lower costs than alternatives. But this solution comes with the price of its own. It is a challenging problem for computer vision systems to detect drones from a distance, especially when the background is complex and environmental conditions might vary.

Why is it hard?

We have seen great success in a variety of domains in computer vision. We have models capable of making classifications or even generating images from scratch. As architectures and datasets grew, models have been able to pick up more important information from the images in a more nuanced way. We have left behind (pretty much mostly) the days where models comically confused animals with each other. More abstract properties of the images can be used by models, higher level features are used and visual understanding is much better.

But what happens when the object is very small in the visual field (since we want to detect them from as far as possible) and does not contain much information by itself? The detection process becomes more and more about the object being related to the environment and changes in the image. When you are trying to detect and track an object consisting of a handful of pixels, it is more prone to both false positives and false negatives because detected images are not very different in by themselves.

In summary the main factors that make drone detection challenging are as follows:

They are small objects in the visual field
Complex backgrounds and environmental conditions
Limitation of data including various scenarios
It has to be done efficiently in real time.

How can we do it?

For some vision problems human vision is still superior to machine vision. Drone detection (mostly) falls into such a category. But it is a tedious task and humans might get tired and make mistakes over time. When we analyze how humans detect drones, we find that a few key factors stand out.

When we only use a single frame, humans perform even worse than machine learning models in complex environments. When there is a little object hidden in a complex background, it is harder for our vision system to recognize the object. What we actually rely on is how the image differs in time. Incorporating temporal information is crucial for effective drone detection.

We as humans are able to use a lot of contextualized information to generalize the problem. It is not very easy to do with vision systems, but there are several things we can do about it.

Synthetic data

Machine learning models struggle to evaluate or predict how a drone would appear in different environments. But if we can add those kinds of situations to training data, it helps. On the other hand, it is very exhausting to get labeled information for those kinds of scenarios. What we can do is to use synthetic data to generate images in various environments and scenarios. The good part is that labels come as free since you can decide or know where the object is. The bad part is that there is a so-called domain gap between synthetic and real images. If you rely on them too much (in our experience) it might actually hurt the actual performance. However, if used strategically, synthetic data can help improve generalization and create more robust models.

If you do not have enough data on a specific environment like a cityscape unlike the ones you have in your dataset, recreating a similar environment or using already existing pictures of the environment and adding drones synthetically improves the performance.

Cascaded system

A very common approach is to use “a machine learning model” for the solution. But this approach is limited by the capabilities of the model you have and the amount of data you have. But most of the time the reality is data is not plenty and/or there are some constraints on what kind of models you can use.

When it comes to drone detection, the problem is really about videos rather than pictures. But it is much harder and more costly to get labeled videos. And getting the performance you want on a video modality is harder. But image-based detection models are a different story. There are really efficient object detection algorithms working on images. And its much easier to increase the variety and the amount of data using images. Each frame of the video also counts as an image.

But still, you would want to use the temporal information to accurately classify drones. Because a single frame contains a very limited amount of information. You can get lots of information when you track objects over time.

Tracking system

If you are using an image-based object detector, it is essential to use a good tracker algorithm to track the object across time. Apart from providing continuity of the object, this allows you to have some temporal information about the object and provide you with crucial information about it which can be used for improving the classification.

How do the predictions change over time
How fast does the object move
What kind of trajectory it has
Do they really have continuity or are they momentary detections?

Already this is a decent solution, but it still has some limitations

We still have some false positives and false negatives (especially many false positives)
We are not still making full use of temporal information (like birds flapping wings)

Combining the strengths of different models

Using different models, a.k.a ensemble learning is a widely used, useful method. Most commonly used ones are perhaps bagging and boosting. But those usually rely on a large number of “weak learners” that are easy to train and cheap to get inference on. Image modality is not very suitable for using weak learners because detected objects are usually a result of a higher level of perception. Stacking is also another ensemble learning method that can be used to improve accuracy, but it suffers the same fate. None of these solutions are feasible if you want to run the system in real time on a modest computing resource.

Cascading is a little less known ensemble learning tool usually used for making accurate decisions. Again, you are using many models, but this time they are not running in parallel, but they use each other’s outputs to make a decision.

In the figure you can see an example of a cascaded system to detect pedestrians. You can use several different models to single out pedestrians and increase accuracy. But keep in mind that this does not have to be used in a strict fashion; you can make the graph that you want. You are not limited to just eliminating and filtering. You can also create parallel flows and combine them when necessary.

The idea that makes this approach feasible in this problem is that you need to deal with a very large image at the first step. In the following steps you can deal with smaller data like

Only patches of the image
Track information
Confidence of different models

And the beauty of this approach is:

You can have very small models with a higher impact
You can train the models separately and create data for them separately
You can attack specific problems without touching existing models

This approach works especially well if you have many false positives that need to be filtered. But a potential problem with cascaded classifiers is that the false negatives of the first classifier provide a serious limitation to the system. But as mentioned earlier, if you do not use the filtering approach too strictly, you can provide supporting ways to create predictions that the first model couldn’t.

Conclusion

Dealing with real world scenarios and hardware constraints is sometimes very different from trying to get some performance on some dataset. Because regardless of the metrics, you are responsible for the mistakes that models make. Good metrics do not always mean preferable results, problems that have little impact on metrics can still be serious problems. You are responsible for both the models and the data besides you have to mind development and modeling costs and find intelligent ways to overcome problems. And most of all you have to be adaptable to be versatile enough building a solution.

In my experience, 2 things provided a great deal of versatility for dealing with these kinds of scenarios.

Using synthetic data mindfully to really improve where you are lacking
Having a cascaded classifier with many different small models in the system. (And using the temporal data in these steps.)

But this cascaded approach does not need to be applied strictly with eliminating at each step. Think of it as creating a prediction graph of models where you can eliminate detections at some stages and combine the prediction of the models to enrich them.

This allows you to be creative with experimenting with many but small models where you can train of find data (mostly) independently.