Power Line Corridor Detection (a YOLO application)

Roger Fong
Picterra
Published in
10 min readMay 23, 2018

Convolutional Neural Networks have proved to be a powerful ally in the world of computer vision and can be applied to a great many tasks. They have revolutionized the field of image understanding and have opened up the possibilities of advanced technologies that only a decade ago would have seemed like science fiction. Self driving cars and augmented reality are both examples of technologies that have been made plausible thanks to CNNs. A CNN’s performance is heavily dependent on the data you give it and how you choose to train it. In this post we will be talking about the task of object detection and how adding more class annotations to your data can increase the performance of your detector. Let’s begin with some context.

CNNs and Computer Vision: CNN’s are not perfect. Indeed, in the case of technologies such as self driving cars it is important to have perfect accuracy in detection and recognition tasks and the state of the art is simply not there yet. This is because while CNN’s are currently the best we have, the problem of arbitrary object recognition and comprehensive scene understanding is an extremely difficult problem, one that takes us as humans years of practice to perfect ourselves. However, given a much more specific task, CNNs excel. There are three main vision tasks that are actively being researched and benchmarked. These are image classification, segmentation and object detection.

In image classification you are given an image and the goal is to determine what object category it belongs to (a cat, dog, car, plane, etc). The idea here is that you are given an image that is focused on just one class. There won’t be an image with both a cat and dog, or if there are maybe the desired category would instead be something like “pets”.

Image from [2]

In segmentation you are given an image and run classification on each individual pixel, creating a dense classification map. To represent the output we usually create a image where the each pixel is colored with a color representing its discovered class.

Image from [3]

In this post we will be focusing on object detection. The goal of object detection is to find a tight fitting bounding box around an object(s) of interest. Object detection is usually coupled with classification where the class of each discovered bounding box is also learned.

Image from [1]

We will be looking at one problem in particular: the detection of power pylon structures from aerial imagery, a problem that we at Picterra were recently asked to solve.

The problem: We are given about a thousand images along power line corridors and asked to produce a detector that finds power pylon structures. The images look like this:

Sample images over the power line corridors. Images courtesy of AerotecUSA (http://www.aerotecusa.xyz/).

What we can notice immediately about this problem is that the pylons really stand out in our images, the backgrounds are very homogeneous and the structure of the pylons are very distinct. In our dataset there are also 3 distinct types of pylon structures (as seen above). From a computer vision standpoint this seems like it would be a pretty easy problem to solve. Let’s see if it is the case!

What can we do without neural networks?

Given the simplicity of the pylons and the backgrounds that we see them in we could have a try with some more traditional computer vision techniques. We can find edges with a Canny Edge Detector:

They don’t look great, but you can kind of make out where the pylons are. Then maybe we can use a hough transform to look for straight line segments, and if we find enough segments that criss cross each other in the same region, it’s probably a pylon or do some kind of template matching on the different types of pylons. But there are various considerations to take into account. The objects are often not seen at the same scale, shadows created by the sun can throw off our results, other objects crossing under the power lines themselves could be confused as pylons. As is with many traditional computer vision methods, the output would probably be very sensitive to context and would require a lot of parameter futzing though given the nature of the problem, it might do the trick. Admittedly, to get those not so great images above I had to tune the Canny Edge Detector threshold parameters for each image, which is not ideal. We did not investigate this approach further and went straight for neural networks, so let’s talk about that now.

What can we do with neural networks?

A CNN is specifically designed to be able to learn and use features of interest in an image. Indeed by visualizing the filters learned in a CNN used for image based tasks, the initial layers of a CNN learn filters that correspond to basic image processing operations like edge detection and subsequent learn how to transform these features into more conceptual representations. Long story short, we can use CNNs to learn about the edge structure of our pylons implicitly via supervised learning rather than explicitly trying to determine it using the aforementioned traditional computer vision approaches.

There are a great many architectures and network models for accomplishing this task, but one of the most recent and well known ones is the You Only Look Once (YOLO) model. YOLO comes from a class of models call one-shot models in which a single neural network is responsible for the complete task of object detection. The input is an image, the outputs are the bounding boxes and class of each object in the scene.

The YOLO Detection, figure from [1]

This is opposed to say some architectures which divide the process up into two parts, the first of which provides a large number of generic object proposals and second which classifies and refines these proposals. Both are powerful models which perform well in object detection benchmarks though the one-shot methods have the added benefit of being very very fast (assuming you’re using a decent GPU). The speed isn’t a requirement in this task, but it’s certainly nice to have.

Single-class experiment:

In our experiments with YOLO, we split our dataset into a train / test split of 70% / 30% and annotate our training set with bounding boxes. Our initial inclination was to group all pylons as a single class and train YOLO on those. Seems simple enough, here are the results on the same 3 images as before (from our test set):

Predicted bounding boxes over pylons from the trained YOLO model. Images courtesy of AerotecUSA (http://www.aerotecusa.xyz/).

Seems like it’s working. But wait! What is this?

Ambiguous or false bounding boxes predicted. Images courtesy of AerotecUSA (http://www.aerotecusa.xyz/).

And this?

Ambiguous or false bounding boxes predicted. Images courtesy of AerotecUSA (http://www.aerotecusa.xyz/).

Okay, so from the first set of images it looks we sometimes detect trees or scatted piles of logs as power pylons. One hypothesis for this is that our network is only learning the same criss-crossing edge structure that we would have tried to look for using a traditional vision method and thus it also suffers from some of the same aforementioned pitfalls. From the second set it seems like the detector has some kind of bias for positions of pylons. Indeed in our annotations many cropped images of pylons can be found near top and bottom of the image. It may have learned that if we are near the edge of the image we require fewer edges to count as a pylon which would make the detector even more sensitive to lines created by a few instance of trees and shrubbery near the bottom and top edges of the image. At this point this is starting to get a bit hand wavy, so let’s move on from conjecture and try something new.

So how can we solve this? How can we get the network to more specifically learn the structure and shape of the pylons instead of relying on information like general “criss-crossy-ness” and image position. The surprising answer? To make the problem more complex.

Multi-class experiment:

A large CNN structure like YOLO has the potential to learn very complex detection problems, but it needs guidance. You have to explicitly give it a complex problem to learn in the first place as well as the corresponding data. What is the key advantage of doing this?

When training with just a single class, the CNN doesn’t really know that it’s trying to learn the structure of power pylons, just that there are a bunch of lines that stand out against the background, it doesn’t really need to learn the organization of these lines. However, if we separate our “pylon” class into three different classes (the 3 types of pylons) and reformulate our problem into detecting and properly classifying these pylon types, now the network is forced to learn the structural features of the pylons as it has to if it wants to be able to distinguish between different groupings of lines.

If our CNN can learn the organization of these lines, it will learn more than just that we are looking for groupings of lines. It will learn that there is one pylon structure that has two legs and a cross beam going in between, one that has three legs standing side by side, and one that has just one leg with a pentagonal looking structure on top. Here are the same problem images in the single class case but now with our multi-class detector.

Predicted bounding boxes with multiple classes of pylons. Images courtesy of AerotecUSA (http://www.aerotecusa.xyz/).

Nice! In fact the method works so well that out of all the 300 test images, I scrolled through manually and couldn’t find a single unreasonable detection. I hesitate to call that 100% accuracy, but it’s certainly pretty darn great and even gets some pretty tough edge cases. In addition now we get the added benefit of knowing what type of pylon we’re detecting. Images are nice, so here’s bunch more:

Predicted bounding boxes with multiple classes of pylons. Images courtesy of AerotecUSA (http://www.aerotecusa.xyz/).

Conclusions

In this post we have demonstrated the power of CNNs to apply to the problem of pylon detection in aerial oblique imagery. A problem relatively easy to handle, that is not to say that CNN should not be used for much more challenging problems. Certainly for challenges such as ImageNet, VOC and MS COCO where we have very varied scenes and significantly more classes to identify, CNN’s unquestionably out perform any traditional computer vision methods by huge margins. Here at Picterra we are interested in providing our users accurate detection models for aerial imagery on a wide variety of objects and scenes. While this is a very tough problem we are confident that CNN based methods will be key to our solutions. But more on this in a future blog post!

Thanks for reading and don’t hesitate to get in contact with us at https://www.picterra.ch/!

[1] Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/cvpr.2016.91

[2] https://medium.com/@tifa2up/image-classification-using-deep-neural-networks-a-beginner-friendly-approach-using-tensorflow-94b0a090ccd4

[3] https://blog.deepsense.ai/deep-learning-for-satellite-imagery-via-image-segmentation/

--

--