Viewing the world through a straw

How lessons from computer vision applications in geo will impact bio image analysis (Part 1)

Nick Weir

Follow

Published in

The DownLinQ

9 min readJan 7, 2020

--

By Nick Weir (Senior Data Scientist, In-Q-Tel CosmiQ Works), JJ BenJoseph (Member of Technical Staff, In-Q-Tel B.Next), and Dylan George (VP, Technical Staff, In-Q-Tel B.Next). This is part 1 of a collaboration between CosmiQ Works and B.Next, and is cross-posted at both blogs.

Introduction

Marc Andreesen described software companies as “eating traditional business”. Similarly, computer vision has begun to eat the world of manual image analysis. However, when we step beyond standard photographs and enter niche domains, like biological imaging (medical imaging, microscopy data) and satellite imagery, this has proven less true. Computer vision is eating these data like a Charleston Chew that got left outside in a New England winter: struggling to bite into it, breaking off small pieces, then chewing ponderously before finally swallowing (or spitting it back out in frustration). IQT CosmiQ Works has closely tracked the maturation of AI and computer vision applications to satellite imagery, learning key lessons about the difficulties inherent to transitioning AI tools between domains. At the same time, CosmiQ Works and the IQT B.Next team have noticed that AI product development for medicine has lagged a few years behind related geospatial applications, and is only just beginning to hit its stride. In this blog series we will explore why AI has struggled to gain traction with both satellite imagery and medicine. We’ll also dig into some of the similarities and differences between satellite imagery, microscopy, and “normal” photographs, and why researchers developing AI methods for microscopy might want to scrutinize the work done in geospatial.

What is computer vision and how is it eating the world of imagery?

Computer vision is a term encompassing the wide variety of methods that computers use to interpret images and videos. Before ~2010, this was primarily achieved through a toolbox including edge detection, watershed segmentation, and other heuristic methods.

Figure 1: An image from the COCO dataset, one of the most popular datasets for computer vision algorithm development, with classical computer vision methods applied. Left, the original image; middle, the image with a Canny edge detector applied; right, the image with a statistical region merging-based segmentation method applied. The image and derivatives generated here are provided under the CC BY/NC/SA 2.0 license.

The explosion of deep artificial neural networks that began in ~2010 has upended the computer vision domain. Deep neural networks apply simple mathematical operations repeatedly to represent increasingly abstract features from the original input image. Deep learning can be thought of as an automatic way to select the right steps to sequentially process data and generate a prediction: for example, to predict where an object is in an image. Deep neural networks reached into image processing with the advent of Convolutional Neural Networks (CNNs), which were first described in the 1980s and gained traction in the new millennium as expanded computational capacity made complex networks feasible. CNNs enable computer vision researchers to achieve a variety of goals, from classifying images to identifying objects of interest to specifically tracing the boundaries of each individual object, among others.

Figure 2: Examples of common CNN outputs for computer vision applications. Left, Image Classification, where the categories of objects present are associated with the image; **middle, Object Detection**, where bounding boxes with category assignments outline each object; **right, Instance Segmentation**, where each object’s pixels are specifically identified and associated with the unique object. The derivatives generated here are provided under the CC BY/NC/SA 2.0 license.

As companies like Google, Amazon, and Apple began building products around these methods, many have begun to expand applications beyond common photographs (think camera phone photographs, security cameras, dashboard cameras on cars) into new domains: medical imaging, geospatial imagery, 3D data like light detection and ranging (LIDAR), and even image projections of audio and genetic data, among others. The differences inherent with those “uncommon” types of images pose unique challenges that must be resolved before the latest and greatest computer vision methodologies can be applied. Here, we’ll explore these challenges in detail the geospatial domain, where IQT CosmiQ Works has spent several years learning how to apply state-of-the-art computer vision methods to satellite imagery. Many of the challenges CosmiQ has faced have analogs in analysis of medical imagery, particularly microscopy data, which will be explored more fully in subsequent blogs.

Expanding the reach of computer vision & AI: the trials and tribulations of satellite imagery

What is satellite imagery?

Let’s start off by asking the most fundamental question: what is satellite imagery and what makes it different from natural scenes? Let’s look at an example of each side by side:

Let’s imagine we were asked to develop algorithms to identify and trace the boundaries of objects of interest in these images (an “instance segmentation” problem, to the CV experts). In the natural scene photograph, we’re tasked to find all of the foreground objects (dog, frisbee, finger); in the satellite image we’re trying to find all of the buildings and roads. What’s similar and different about these challenges?

Let’s start with similarities between these tasks. In both cases, we’re trying to outline all the individual target objects, so the same algorithm type should work for both types of images. They’re both photographs, and therefore convolutional neural networks should help. However, this is where the similarities end, and where satellite-specific challenges appear.

Object size and abundance

Unsurprisingly, target objects usually appear much smaller in satellite images, making them harder to find. In the natural scene photograph above, there is only one dog, and it comprises over 48,000 pixels, about a quarter of the image. By contrast, the satellite image contains many buildings, but they are very small — the buildings in the SpaceNet Atlanta dataset average about 1,200 pixels, or 2.5% the size of the dog on the left. As the pixel arrangement and color provide the information that tells us what’s a dog versus a fire hydrant versus a building, the number of pixels in each object dictate how much information the model can use to identify it. Research has shown that CNNs use texture to identify objects in images, and smaller objects contain less texture information. The end result: smaller objects are harder to find, making satellite imagery analysis more difficult for current state-of-the-art computer vision models.

Just as object size is inconsistent between normal photographs and satellite imagery, so is object abundance. Let’s compare the distribution of the number of buildings per image in the SpaceNet Atlanta dataset versus the number of objects per image in the COCO training dataset:

Figure 5: Datasets split by number of objects per image. Green, the COCO dataset stratified by the number of images with 1–25 unique objects, 26–50 objects, etc. Blue, the same dataset stratified by the number of images containing one specific class of object (dogs). Red, the SpaceNet Atlanta dataset stratified by how many buildings are present in each image. Note how much greater the heterogeneity in object count is for SpaceNet.

The difference is striking: there are more images with many objects in the SpaceNet satellite imagery dataset than in COCO. By contrast, the number of objects of interest in natural scene photographs is much more consistent. Even in cases of rare objects in natural scene imagery, it’s unusual for object-containing images to have more than a few instances of the target: see the COCO “dogs” category breakdown above. The end result is that natural scene images have much more homogeneous object counts. This is particularly relevant for object detection algorithms, where the algorithm must learn how many objects to identify (or have that value provided by a data scientist).

Imaging bands

Unlike everyday photographs, satellite images are rarely composed of a “standard” Red-Green-Blue (RGB) 3-channel combination. Many satellites collect across a wide variety of light wavelengths, including an “extra-blue” coastal band, yellow, near-infrared (NIR), short-wave infrared (SWIR), and others (Figure 6). Extensive research has shown that these extra bands help identify features like vegetation, urban areas, and bodies of water, making them important to include in many satellite computer vision tasks. Eight or more bands are often provided in commercial satellite imagery products, such as the multispectral imagery products from Maxar Technologies.

Figure 6: Comparing satellite imagery bands with common RGB digital camera channels. Most digital cameras only collect three channels to create photographs with visible spectrum information. By contrast, satellites often collect at a variety of wavelengths, many of which are invisible to the human eye. MAXAR WorldView-3 Band information provided by the Satellite Imaging Corporation. These bands vary across sensors, much as wavelengths collected by microscope lenses and filters vary dramatically.

However, because most computer vision models expect 3-channel RGB inputs, they cannot accommodate these additional data. Some geospatial data scientists work around this by beginning with pre-trained models for RGB, and then training the other bands from scratch, but training these models can be difficult and inefficient. The dearth of models designed for and pre-trained on so-called “multi-spectral” satellite imagery makes it very difficult to effectively utilize those data.

Image size

The sheer size of the average satellite image poses another challenge. SpaceNet’s datasets arecomprised of about 35 satellite images covering 10 cities around the world. Their sizes vary, but most of these images are approximately 150K by 50K pixels — or about 900 times the size of a 4K UHD image. Few computer vision models can accommodate images larger than 1024 pixels on a side, meaning that to analyze one satellite image a data scientist would minimally need to split it into over 7,000 pieces and run each one separately — and this only corresponds to about 0.1% of the daily acquisition capacity for MAXAR’s Worldview-3, the satellite that collected most of the SpaceNet dataset! Tiling imagery in this fashion also raises additional challenges: edge effects are common in machine learning algorithms and reconciling objects at the boundaries between two tiles can be difficult. As we will explore in more detail in our next blog, this is also true for microscopy imagery, where whole-slide scans are very large.

Data availability

The limited public availability of well-labeled satellite images has slowed research and development. Though companies are beginning to provide labeling services for everyday imagery, there are very few that label overhead imagery given the additional challenges it poses. As a result, there aren’t many well-labeled public satellite imagery datasets for benchmarking AI algorithms against, meaning it is difficult to train models and reliably assess performance. This slows research and makes it hard to demonstrate product value. Furthermore, most existing open source datasets are either geographically constrained, relatively small, or inconsistently labeled. Companies that wish to apply machine learning to overhead imagery will likely need to commission labeling of a custom data set suited to their use case, which is a difficult and often prohibitively expensive task in the present market.

Domain expertise

A final barrier worth noting is the amount of domain expertise required to explore computer vision applications in the geospatial realm. Beyond all of the data science concepts one must understand to develop computer vision algorithms — linear algebra, convolutional neural networks, and statistics, to name a few — a geospatial data scientist must also understand geographic coordinate reference systems, satellite imagery bands (as mentioned earlier), geospatial data formats (e.g. GeoJSONs) and a number of other geospatial-specific concepts. Add to this the additional software tools geospatial data scientists must be familiar with, such as GDAL, QGIS or ArcGIS, and it’s little wonder that trying to hire well-equipped geospatial data scientists is like trying to hire unicorns. As we will discuss in the next post in this series, this barrier is only amplified in the medical domain.

The state of geospatial machine learning today

Because of the above differences and other challenges, geospatial applications of AI have lagged behind everyday photographs and videos. This has led to the creation of a chicken-and-egg problem: few AI models have been developed for geospatial analysis because there are few common commercial applications for geospatial AI (perhaps because there are few demonstrably worthwhile models available). Geospatial analytics companies are relatively rare, and geospatial analytics research is a tiny fraction of the work presented at computer vision conferences. Between the limited number of experts, the limited commercial market, and the absence of well-labeled data for model development, it’s little wonder that geospatial AI applications have yet to prosper.

Conclusion

In this blog we’ve presented a case study in how challenges can arise going from everyday photographs to unusual imagery. Many of these difficulties have analogs in other domains, including medical imagery, an area that computer vision and AI practitioners are just beginning to explore in a product-relevant fashion. Stay tuned for the second part of the series, where we’ll dig deeper into AI on medical imagery and what lessons that field can learn from geospatial AI product development.