Ecological Protection through Object Detection at Wind Farms

Will Seaton
Institute for Applied Computational Science
11 min readDec 28, 2020

This article was produced as part of the final project for Harvard University’s AC295 Fall 2020 course within the Institute for Applied Computational Science.

Authors: Johannes Kolberg, Nikhil Vanderklaauw, Will Seaton, Hardik Gupta

Image Source.

Electricity generation from onshore wind is a critical contributor to meeting international sustainability goals, but the recent rapid growth in wind farms has damaged local bird habitats and increased avian mortality through blade collisions or environmental degradation. Current mitigation strategies involve lengthy impact assessments prior to construction. Through continuous, rapid, and accurate monitoring, object detection methods can increase the speed and accuracy of impact assessments and allow autonomous slowdowns of turbines to reduce risks to passing birds.

We demonstrate how an object detection pipeline using the “You Only Look Once” (YOLO) model architecture can achieve actionable and useful predictive accuracy on large-scale image data, as might be collected around wind turbines.

To investigate and illustrate the problem space, we do both high level separation — between birds and aircraft — and more fine-grained classification — of specific bird species or types of aircraft.

Context

Technological improvements have reduced the cost of onshore wind electricity by 70% over the ten-year period ending in 2019, currently averaging $41 per megawatt hour (MWh) as the second cheapest source of electricity.¹ This drop in price corresponds with a growth in wind power generators, passing 100 gigawatts total capacity in the U.S. in 2020 representing 182% growth over the prior ten-year period.² The wind power industry currently provides the United States with 7% of its total electricity production, employs 120,000 Americans, and represents 39% of spending on new utility projects in 2019.³ New construction must respect environmental and ecological regulations that require an upfront impact assessment paired with ongoing monitoring and reporting.

One of the largest negative ecological impacts that a new wind farm can have is on avian nesting and migratory patterns. Wind turbines can cause an increase in bird mortality due to collision with blades, loss of nesting and feeding grounds and disruption of migratory paths. Manual impact assessments are costly and time consuming as humans must monitor the area over time for bird inhabitants and travel patterns. Researchers have begun to focus on this challenge of balancing the needs of local bird populations with the necessity to rapidly construct sustainable power generators.

To move forward, we must align our renewable energy and ecological preservation goals around these public utility investments. Recent advances in machine learning techniques can support this effort.

Data

Our dataset is Google Open Images Dataset V6, a store of 9 million images that have been human-annotated for image classification, object detection, and segmentation and include bounding boxes for 1.9 million images spanning 600 classes. It has the scale of one of the largest publicly available image stores, totaling 18TBs from its hosted location. We chose this dataset as combining more general and open-source computer vision datasets with publicly available transfer learning models demonstrates that the renewable energy industry does not need heavy investment in collecting custom training data to get a good start.

You can explore categories and images of the dataset using Google’s Open Images Visualizer.

Sample images of birds with associated classification and bounding box localization

Data Augmentation and Pipeline

The 18TB dataset contains over 600 classes, so as a first step, we filter it to focus on categories relevant to our use case of identifying objects in flight: six species of birds and three types of aircraft. We include some clearly unrelated and general categories, such as “man”, “woman”, “cat” and “dog”. These latter four prove useful when we explore how the models perform with less specialized training data and as intuitively out-of-distribution data for testing.

The data is available from two storage methods: A) downloading individual images from the various source websites via direct URL or B) downloading and processing a hosted archive. Since about 20% of images were missing from their original source websites, we elected to download and process the hosted solution. There is no way to filter images or classes in the hosted archive, so we needed to download it in its entirety before processing and keeping only those categories we wanted.

After filtering to our chosen classes, we are left with about 160GB of images. We store these in raw form and as compressed archives in a Google Cloud Storage bucket for easy access in our data exploration and modelling. The six species of birds and three types of aircraft comprise about 5GB of the total, so the vast majority consisted of images of humans, cats, or dogs.

Many images contain multiple objects so that observations greatly outnumber images. A cached version of the full dataset would contain a lot of duplicate image data and would not fit into memory. We use the TensorFlow API to set up a Dataset pipeline that reads in each record as a combination of: an ImageID, which it uses to load the image data from disk into memory; four numbers defining the bounding box of the object (bottom-left corner coordinates along with its height and width on the normalized image scale); and a sparse integer-encoded class.

Single Input (green ImageID) for Multi Output (red Bounding Box; yellow Class)

Each observation can be loaded independently and the entire graph of dataset processing (such as loading and normalizing images, one-hot-encoding the class, etc.) defined as a lazily evaluated graph that is only executed at runtime and in batches. This pipeline, when run, will pre-fetch and prepare a new batch of records while training (or evaluation) is performed on the preceding batch. Only a set number of batches and records is loaded into memory at any given time, enabling us to leverage the full dataset without having to fit it all into memory at once, and minimizing wait times by parallelizing I/O while training.

Simple multi-task network using tf.data Pipeline

Baseline Model

To establish a baseline of performance, we deployed a transfer model based on the VGG16 architecture plus a little fine-tuning. VGG16 is a convolutional neural network whose innovation at its introduction was replacing large kernel-sized filters with multiple 3x3 kernel-sized filters. It is expensive to train and its trained weights can be quite large, but for transfer learning, with minimal retraining over our target dataset, it can serve as an excellent visual feature extraction base. Our VGG16-based model added a few fully-connected layers and basic dropout for fine-tuning to our task. It used Mean Square Error as the loss function for the bounding box task and Cross Entropy for the classification task.

VGG16 Architecture. Source.

It struggled on two primary types of images: Multi Object and Partial Object. When multiple objects were present in a single image, the transfer model would succeed in classification but incorrectly predict the bounding box for each object as an area-weighted centroid box calculated across all objects in that image. As a baseline, this was not surprising, as on expectation the minimum loss is achieved with such a centroid. For example, in the first row below, the predicted bounding box is one that encompasses the two prominent sparrows or three airplanes together.

Our baseline model struggled with Multi Object and Partial Object images

When partial objects were present, the model showed mixed classification accuracy and very poor bounding box localization accuracy. Partial objects are relatively rare in the data, hence the model was effectively trained to recognize whole objects.

To improve upon this performance, we would consider a different architecture (such as a deeper convolutional network that learns a more tailored succession of increasingly abstract representations, such as parts of objects) or a different pipeline (such as one that augments randomly cropped images of every object).

Customized YOLO Model

The structure and challenges of quick, efficient, and accurate object detection for multiple (transient) objects requires a different kind of architecture than traditional classification or bounding-box computer vision problems.

Early methods like R-CNNs involve multiple stages or Region of Interest (RoI) pooling given by selective search to find parts of the image with high likelihood of containing an object. These can dramatically improve performance, but can be very costly to train and slow to serve predictions. Faster R-CNN improves on its predecessors by running a CNN region proposal network (with a feature extractor, such as ResNet) that minimizes the need to make multiple passes in search of promising regions and pools neighboring regions. Still, its multiple stages — feature extraction into region proposal, repeatedly cropping and pooling, then finally predicting — remain computationally expensive.

More recently, single-stage methods like You Only Look Once (YOLO) and Single-Shot Detectors (SSD) have grown in popularity as they are able to perform fast object detection in real time at little to no loss in accuracy.

To improve performance beyond our simple baseline and to overcome the Multi Object and Partial Object dataset challenges, we implemented the YOLO model developed by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. [Paper,2]. This model “frames object detection as a regression problem to spatially separated bounding boxes and associated class probabilities”, which is perfect for our application and dataset. Its pipeline is also simple, enabling it to achieve real-time image processing of up to 45 frames per second.

YOLO Architecture Diagram from original paper. Source.

Further Detail on Model Architecture

The YOLO detection system includes: resizing input images to 416x416; running a convolutional network over the entire image; returning a 13x13x5x14 tensor that defines the proposed bounding boxes over the image grid, along with confidence and specific class probabilities; and finally discarding proposed detections whose confidence (from the model) are too low.

As the name suggests, the model requires only a single “look” at an image to predict for multiple objects and classes. This single pass greatly increases speed while only slightly reducing the localization accuracy of the model’s predicted bounding boxes. However, given our goal of detecting the presence and species of pictured birds, the greater inference speed and generalizability is worth the slight sacrifice in location accuracy. If we can successfully detect the type and quantity of flying objects in the monitored sky, it enables wind farms to cost-effectively slow turbines and to record species passing through the vicinity in a timely manner without needing to know each bird’s exact location.

Model Illustration from original paper. Source.

The YOLO architecture achieves its results by effectively dividing the image into a grid of 13x13 cells, where each cell is responsible for proposing up to 5 bounding boxes that are initialized at a variety of anchor aspect ratios optimally set to capture typical objects. Each box uses the darknet feature extractor (similar to ResNet) to learn to identify objects in its vicinity.

Each box predicts a confidence score that it has in fact enclosed an object, and simultaneously its estimated probability of each class being that object. The bounding box confidence score and class probabilities are combined into an overall score that the bounding box contains a specific type of object. We thus get 13x13x5 simultaneously calculated proposed boxes, each with 14 values (four box parameters, box confidence, and nine class probabilities).

The combined scores are thresholded to those above 0.3, leaving us with the final predictions of arbitrarily many object detections in the input image — each of which is a bounding box with a confidence and class probabilities.

Results & Interpretation

We deployed a pre-trained YOLO model using transfer learning and fine-tuned the final layers for our specific classes and large-scale dataset. After 8 epochs of training (4 hours on a Tesla T4 GPU), we achieved a mean Average Precision (mAP) value of 0.61 — though this varied by class in relation to its representation in the dataset.

YOLO performance across classes

Our two worst performing classes were “Magpie” and “Rocket”, which correspond to the labels with the fewest observations in our training dataset. We did not implement measures of class balancing to ensure equal proportions of examples, to stay as close to the original general dataset as possible to stay representative for non-specializing practitioners. This clearly impacted the model’s final performance.

Our YOLO implementation performed well at identifying multiple objects of different classes in a single image — including partial objects, and, critically for our use case, at varying depths of field in a single image.

For example, one of the multi object images below has a helicopter in the foreground, a partial airplane departing on the right, and an airborne airplane in the background — all correctly identified. This is very promising for the types of object detection, classification, and localization we would need to monitor bird species in areas surrounding wind turbines.

Successfully detected multi-class airborne objects at varying depths of field

One type of image that our implementation struggled with was objects with a non-classed overlapping item, such as when a bird sits on barbed wire or power lines. Our pipeline was trained primarily on whole objects, but included overlapping objects with associated class labels and some occluded or truncated objects. The model learned to try and identify overlapping objects as a class instead of separating classed objects from non-classed objects.

Examples of poor classifications

Takeaways

State-of-the-art object detection models can successfully be deployed to detect, classify and localize flying objects in the skies around wind turbines.

Our implementation uses the “You Only Look Once” architecture to achieve between 60–80% mean Average Precision on one of the largest publicly available, general object detection datasets — with satisfactory mAP scores reliant upon sufficient quantities and variety of observations for each class of interest. We performed minimal fine-tuning of our transfer models, suggesting accuracy improvement is possible targeted on general categories or targeting native bird populations with sufficient image observations.

Examples of successful species identification

The model architecture lends itself well to our use case by leveraging transfer learning on a general dataset not specialized to our particular prediction problem. Its lightweight approach greatly reduces image processing speed, necessary for serving accurate predictions in a timely and actionable manner, at an acceptable trade off against a small reduction in localization accuracy given the relative priorities in our use case.

With the critical need to transition domestic energy production to sustainable and cheaper technologies like wind generators, machine learning can play an impactful role in accelerating these investments while preserving local bird populations.

Recommended Research

--

--