Cloud-Spotting at a Million Pixels an Hour

How I learned to draw clouds in satellite imagery during a data labeling competition

Jon Engelsman
Radiant Earth Insights
14 min readSep 28, 2020

--

Jon won the Best Quality Labeler award for our recent Data Labeling Contest. We asked him to detail his approach and workflow.

An Outreach Day and a Contest

I recently attended the Cloud Native Geospatial Outreach Day, a virtual event designed to “introduce STAC, COG, and other emerging cloud-native geospatial formats and tools to new audiences.” As part of the outreach day, co-sponsors Planet, Microsoft, Azavea, and Radiant Earth teamed up to host a week-long data labeling contest. This friendly competition had contestants race to manually label the shapes of clouds across a large selection of satellite images from around the world.

The contest’s ultimate goal was to generate a crowd-sourced collection of high-quality labeled images, data that can be used to train accurate cloud detection models. These models are used in various land surface monitoring efforts to detect and mask the clouds that are commonly found in satellite imagery.

As an extra motivation, the top prize for the highest scoring labeler was an award of $2000, plus an open-licensed 50cm SkySat Image to be tasked by the winner. As a challenge to myself and because of my love for all things SkySat, I decided I wanted to try and win the competition.

Day 1

The labeling competition kicked off on the same morning as the Outreach Day, and I got to work right after the project images were released for labeling. I spent my first day learning the ins and outs of the annotation tool we used for the labeling contest.

GroundWork

Launched in April 2020, Azavea’s GroundWork is a web-based segmentation labeling tool designed to easily and efficiently create training data sets that can be used to train machine learning models. It allows users to set up and define their labeling projects, from uploading raster imagery source data to defining the segmentation categories to be labeled. This flexibility can be used to label everything from natural objects like water, land, and clouds to human-made objects like cars, buildings, or solar panels.

GroundWork by Azavea.

I had actually signed up for an account when GroundWork first launched, but this competition gave me an opportunity to really take it for a spin. It’s a rather impressive tool, both in terms of its user experience and how it handles geospatial data behind-the-scenes, which I cover in more detail here.

Even though it was my first time using this tool, its streamlined workflow made getting into the data labeling process a breeze.

Workflow

When I first signed in to the GroundWork application, there were a list of projects that had been created for the data labeling competition. More were added throughout the contest as tasks were completed. For this competition, each project in the queue was comprised of Sentinel-2 satellite imagery scenes from various locations around the world.

Project List

Next, selecting a project to work on was as easy as clicking “Start Labeling” to work on a random task or picking one that still had tasks to be completed, as indicated by a % labeled indicator. Clicking on a specific project card directs to a project page that shows a more detailed map of all of the project’s tasks.

Project Page

The tasks are shown as a tile grid overlay on top of the satellite imagery to be labeled. Each task in the grid is color-coded to indicate its current status (e.g., unlabeled, labeled, etc.) and whether it was available to be labeled or not.

To get started labeling, simply clicking on an unlabeled task would bring you to the task page.

Task Page

This task page shows the actual satellite imagery to be labeled, along with the selectors for the segmentation labels available for that task. For the contest, the purpose of each task was to find any pixels on the imagery that appeared to be clouds, then draw around those areas and label them as one of the two segmentation categories, either Clouds (in pink) or Background (in green). When you were finished labeling the entire image, clicking on the Confirm button would submit your labels for automated scoring.

Drawing Clouds from Above

Since these are manual labeling tasks, there were a couple of options for how to label pixels in the image, namely a drawing tool, a “magic” wand tool, a replace tool, and an erase tool.

After a bit of practice, I landed on a labeling technique using the basic drawing tool, which I felt struck the right balance in terms of speed vs. accuracy. Here’s a clip of that labeling process, sped up 10x below.

Although my aim was to win the competition, I still wanted to focus on the quality of labeling, striving for pixel-perfect to the extent I felt possible. That being said, these are clouds we’re talking about here, so things can get really philosophical, real quick when pondering whether or not a specific pixel is part of a cloud. But with the clock ticking on the contest, I put those questions aside and did what I thought was best for each task.

Using this drawing method, it took me about 15 minutes to label a large, complicated task (or roughly about 4 per hour). Each task submitted added about 10–50 points on a leaderboard that tracked competitor’s scores. I spent a good chunk of time this first day just working my way through tasks, and I soon found myself at the top of the leaderboard.

At the end of the day, though, my hand was killing me. The drawing method requires a constant mouse down hold, which is brutal on your hand muscles after prolonged periods of time. On top of that, I started to see other labelers quickly catching up to my score at a pace I knew I couldn’t match with the technique that I was using. I went to bed that night with a lead that was slowly dwindling, knowing that I might have to rethink my strategy if I wanted to remain competitive in the contest.

Day 2

The next day, I stuck with the drawing method for a bit and tried to keep a strong labeling pace to maintain the lead. This worked for a little while, but towards the end of the day, with my hand aching and having dropped down to 2nd place on the leaderboard, I took a break from the contest. I told myself that if I was behind by more than 1,000 points later that evening, then I’d stop trying to compete and just continue labeling at a more leisurely pace. You know, just take some time to really stare at the clouds.

But when I checked later that evening, I was surprised to see I was down from 1st place by only 600 points! Indeed I could manage to come back from that. I decided to give myself a little breathing room and rethink my strategy a bit. I started by looking at the scoring metric in more detail to see if there was something I was missing, that might explain why I was scoring points so slowly.

Here’s the scoring metric as defined by the data labeling contest organizers:

Scoring Metric, Data Labeling Contest

This scoring metric was comprised of three terms. The first term in the equation was fairly straightforward, a simple low weighting (0.1) for the number of individual tasks you had completed.

The third term is a bit more interesting at first glance. It’s a heavy weighting (10) for the summation of a fractional product, specifically the fractions of cloud and background pixels for each completed task. For a task that is half cloud pixels and half background pixels, this fractional product would have a maximum value of 0.25 (0.5*0.5=0.25). This term is intended to provide a scoring based on the relative complexity of any given task. If an easy task is either all clouds or all background (which wouldn’t take much time to complete), then this term would zero out (e.g. 1.0*0.0=0.0 or vice versa). For any task that was anything less than half cloud / half background, then this fractional product would be sub-optimal at some value less than 0.25 (e.g. 0.75*0.25=0.1875).

At first, I thought that this scoring term was the reason I was scoring so slowly. I was worried that it was low from doing too many easy, all-background tasks as breaks between more challenging tasks. As an experiment, I tried to balance it out by doing more all-cloud tasks for a bit, but nothing seemed to make much of a difference, and my score was still only slowly improving.

Taking another look at the scoring metric, I realized there was more to the second term than I had initially thought. It looked pretty simple at first, a medium weighting (0.5) for the number of polygons completed overall, including both clouds and background polygons. Knowing that I tended to “close” or connect a lot of the labeling areas as I was drawing, I worried that I was minimizing the number of polygons in each task, which could be the reason my scoring was so low.

With this thought in mind and after taking a break from labeling, I decided to give the competition one last shot. Knowing that the drawing method wasn’t working for either my score or my hands, I turned to the other labeling tool available: the magic wand.

The Magic Wand

Azavea launched the magic wand as a new feature for GroundWork in late August 2020, only a few weeks before the labeling competition started. The wand provides an easy way to select a bunch of pixels of a similar color, with a varying degree of selection sensitivity depending on how far you move the mouse cursor while labeling.

When I first started using the wand in GroundWork, it took a bit of practice. But once I got the hang of it, the accuracy of pixel selection that it achieved, particularly for cloud labeling, was quite impressive. Here’s a clip of the wand technique I used, sped up 10x below.

The “magic” behind the wand tool is the flood fill algorithm, the same magic found in paint bucket tools from various image editing programs. The wand tool can pick up on subtle hints of pixel color that the human eye can’t readily see, which becomes particularly noticeable at the faint, wispy edges of clouds and can sometimes result in pixel labels with an almost fractal-like quality.

With extensive use of the wand, I started to understand how the pixel selection will react in different situations, allowing me to predict what areas or bands of pixels will be selected based on where I start a click (setting the initial color), and, how far I to drag the mouse (setting the sensitivity of the selection). This behavior made a significant impact on the ability to label cloud pixels accurately.

Both the drawing and wand labeling methods resulted in different styles of labeling, depending on how you used them. When using the drawing method, you might find yourself doing either a continuous click hold-down motion (resulting in smooth, curved areas), or you might resort to the somewhat less accurate, but much faster method of clicking to build polygon areas by connecting line segments (resulting in straight, angular areas). And when using the wand method, you might end up with either rough, angular areas or more refined pixelated areas, depending on how much you zoom in and how small and precise you make your wand selections. You can see how these different methods compare, at least visually, when looking at neighboring tasks that have been labeled.

In terms of the competition, an important side-effect I discovered while using the wand method was that it suddenly felt much easier on my hands and fingers as compared to the drawing method. Since the wand selection movements mostly consisted of brief mouse click holds with less precise mouse movement (as compared to the drawing method), hand fatigue while labeling was greatly reduced to the point of being barely noticeable.

That being said, I had noticed that my time per labeling task had slowed down a bit, up to around 20–25 minutes for a large, complicated task. Because of how readily the wand can select pixel-perfect features, I wanted to make my labels as accurate (to my eye) as possible, which ended up taking a bit more time.

More surprisingly, though, after getting the hang of the wand method, I started noticing that my average score for a reasonably complex task had skyrocketed up to anywhere between 50–250+ points per task. At first, this jump in points was a little confusing since I was going slower than before, and yet the overall effect was that my scoring pace had picked up significantly.

After a bit of thought, I quickly realized what was going on: polygons, or more specifically, holes.

Peeking Through the Clouds

Remember the second term in that scoring equation, 0.5 * N_polygons? Well, it turns out that an interesting side-effect of using the wand method was that it created a significant number of holes within the interior of a labeled area, much more so than the drawing method I had previously used.

Take a second look at that task comparison image in the previous section. Both the center task and left middle task were labeled using the wand method. They look pretty similar, right? Well, you’ll start noticing an important difference if you zoom in on that middle left task. Enhance.

See those dark, purplish bands and the green splotches in the middle of the pink cloud labels? It’s hard to tell from here, but those are actually holes. If you zoom in on a particular cloud and turn off the dark green background, you can see them better as the bright white of the cloud showing through from the base image, in contrast to the surrounding pink label.

The reason for these holes is mainly due to the wand technique that I was using, where I started at the center of the cloud to fill in the brighter pixels and then worked my way outward to fill in progressively darker bands of cloud pixels at the edge. This technique, probably emphasized by the application of the flood fill algorithm used in the magic wand to the color gradient of clouds, is likely what created so many unintentional holes in these cloud label polygons.

The result of this wand side-effect was a massive leap in scoring efficiency for labeling tasks since each and every hole (i.e. a polygon) now contributed to my score via that second term in the scoring metric. In terms of the competition, this boost in scoring was actually a disincentive to close these interior holes. In the task with holes that I highlight above, I was intentionally sloppy in closing some of those interior areas in order to demonstrate this effect. From a macro view, it’s almost impossible to tell a difference with the more judiciously labeled task to its right. But with a focus on maintaining labeling accuracy as much as possible, I still spent the extra time to close any of the holes where it made sense to do so, although some were still quite difficult to spot with just a cursory glance.

Days 3–4, also known as Cloud Nine

By now, I had hit a stride using the wand method, striking the right balance of label accuracy and scoring pace. The next few days were spent investing a fair bit of time labeling, which significantly increased my overall point score. By the middle of Day 4, I had built a relatively large lead of about 7,000 points ahead of the 2nd place labeler, and I was sitting comfortably (or so I thought) on top of the leaderboard under the name “Narrowband”.

I thought it was smooth sailing from then on to the end of the competition, and I started day-dreaming about where I was going to task that SkySat image.

End of the Line

Well, it turned out that my time at the top wasn’t to last very long. The labeling technique that I had settled on ended up being no match for the diligent pace of labeling that others in the competition could maintain. I had tried to keep it a close race in the points for as long as possible, but eventually, the other labelers proved to be too formidable. As my lead dwindled, I knew my time at the top would soon be over, and with it, my dream of tasking that SkySat image.

Knowing I was out of the race, I decided to shift gears a bit and reflect on the brief but exciting journey into the world of imagery labeling that had brought me to this point.

Looking Back

Overall, this contest was a great introduction to the world of satellite imagery annotation labeling. Most of my previous experience with labeling or classifying satellite imagery has been from editing features in OpenStreetMap, but this contest took that experience to a whole other level. In both cases, it’s important to step out of the task sometimes and just appreciate the fact that you’re helping to contribute to crowd-sourced data that will help many downstream efforts in the future.

The contest also offered a unique opportunity to learn and extensively use the GroundWork annotation tool, an opportunity which has given me a real appreciation for both the tool and the team that built it. It made the overall labeling process incredibly intuitive and provided a great platform for this type of contest. I don’t know that I’ll be able to think about clouds and the flood fill algorithm ever the same again!

Last, I’m excited to see how the labeling data that was generated by the competition will be used by Azavea, Radiant Earth, and others in training future machine learning models to better handle clouds in satellite imagery. While labeling particularly challenging clouds, I often found myself asking the impossible question, “what would the model think about this?” I’m eager to see if I can learn the answer to that question from their future work.

As for the final leaderboard scores? Take at a look at the impressive scores of the top 10 labelers. As for me, I’m more than happy to take home 7̵t̵h̵ ̵p̵l̵a̵c̵e̵ a Best Quality Labeler award!

Thanks to the Organizers

Many thanks to the teams at Azavea and Radiant Earth for hosting the competition, with a particular shout out to Joe Morrison and Chris Brown of Azavea and Hamed Alemohammad of Radiant Earth for their support and for taking the time to answer everyone’s questions about the contest on Slack. And a huge shout-out and thanks to the other labelers for a great competition!

--

--