S2cloudness or how we processed 130 Bn km² of cloud masks in less than a day
Spoiler alert: in reality it took us about two years to do this but hey, the last step, processing of data, took much less than a day — less than ten hours actually, thanks to AWS.
Clouds are one of the main annoyances of remote sensing experts working with optical data, and not just because they often make it impossible to see the area of interest. A major problem with clouds lies with the fact that we cannot rely on the multi-spectral imaging data without first checking if the clouds are distorting the signal. With a five days revisit time of Sentinel-2 one can typically get 70, in some places even 100, observations per year for any place in the world. This is simply too much to check manually. Not to mention that you have various types of cloudiness — from thick clouds (white, nothing penetrating through them) to a bit of haze, which seems OK, but influences the reflectance just enough to corrupt the observation.
When we established a data science group within Sinergise (EO Research), the first thing they noticed is that clouds are really in the way of any proper analysis. Sure, you can filter the scenes by cloud cover meta-data, e.g. setting the maximum at 20%, but this does not help much if the filter is set per scene of 100 x 100 km and the 20% just happens to cover your area. Or you are throwing useful data away. There are a bit more detailed cloud masks stored as GML with each scene, but these are not too accurate, and working with GMLs at this scale is clumsy. Scene classification layer, that comes with Sen2Cor, was only available for a limited area at the time (not anymore, check this info). Therefore we were looking into other options and we expected plenty — it seems that cloud detection is just about the first thing that any remote sensing experts would do when venturing in machine learning field. There was Hollstein’s adaptation of Landsat algorithm, which was pretty good. And Braaten-Cohen-Yang cloud detector, which was less accurate but much faster due to using fewer bands. None of these were good enough though. Many more were talked about, but none was available for practical work, often due to experts keeping things for themselves (a common issue, which is not helping our field). Because of this and, typically, because of researchers always believing they can do better themselves, our team created their own. It was not rocket science, far from it. The approach was nevertheless quite a bit innovative — we reused, with permission, cloud masks coming from multi-temporal MAJA processor, which were one of the best at the time, and trained a simple gradient boosted decision trees algorithm in a way that it works on pixel level and that one can run it even at different resolutions. The result was pretty good and processing was simple and cheap, ICT-wise. This and, probably most importantly, the fact that we made it openly available under one of the most permissive licenses (CC-BY-SA), resulted in the library being widely used. The code with the model was downloaded more than 80.000 times. It was used by our users and by our competitors. We found out recently that Google is using it to create cloud masks within Google Earth Engine. We were happy with this, even though it did not bring much (probably any) revenue to us. It did raise awareness of what we do and it helped the field move one step forward (we really have to stop doing each and every step again and again). Most importantly though, it made our researchers feel proud and content that the work they do is actually used and making an impact. Isn’t this why we are all here?
s2cloudless was a success, but it was still too complicated for use for the vast majority of our users. They are keen on getting the relevant data via API, using Custom scripts. s2cloudless requires them to implement a separate process, calculate it and then use it directly or feed it back to Sentinel Hub using BYOC mechanism. It was complex and it was costly. s2cloudless is using 10 bands so one needs to fetch quite a bit of data to make it work. We have therefore received numerous questions on when we will make this available as an auto-generated layer. A couple of months ago we started to seriously consider this. There were many triggers. One was that s2cloudless performed very well within ESA’s and NASA’s CMIX-II cloud detection comparison exercise. Then there was an increased uptake of Sentinel Hub for machine learning purposes, where this information is essential. Some ML experts would wave this away, saying something like “deep learning will learn to detect clouds in the process anyway”. But why would you complicate the process if there is no need to? It’s way cheaper to include masks as an input. Last but not least, we have recently launched Batch processing, which can create time stacks for feeding into ML processes in an effortless and cost efficient manner. There, the cloud layer made significant difference. So we have set-up to work on this.
There was quite a lot of thinking and discussion going into it. Hey, if you’re gonna run something on 13 million scenes, you better make it right in the first run. One of the most important questions we have asked ourselves was which resolution to process it at. The first answer being “the highest possible, of course”, so 10 meters. But does this really make sense? Many of the bands going into the algorithm are at 60 meters, which means that scientifically the accuracy of the result is not better than 60 meters. Further more, the clouds are typically not very small chips in the sky, so will a couple of meters make a difference? Lastly, the algorithm is performing buffering at the end, to compensate for the border areas, which are affected by clouds, even though they are not really cloudy — the objective of this specific cloud detection is to remove messy pixels from the follow-up processing and in this case it is better to leave out one too many, rather than too few. On the other hand, there are costs directly correlated with resolution. Both during the time of processing (i.e. on our side) as well as when the data is used (more data means more memory to be used and longer processing times, therefore more expensive VMs and higher costs for the user). Therefore we have chosen the 160m resolution, very close to what our data science team was using for a couple of years already within eo-learn, with good results, and on a sweet spot of the cost in terms of data processing. We find these considerations very important. It is easy and sexy to do it at highest resolution. And one feels powerful when spinning hundreds of VMs. However, if we want to create a sustainable business, we need to optimize our costs (we are not a start-up favouring growth to business results). It is crucial that the added value we create is larger than the cost that went into the creation.
It took us quite a while to set-up the (efficient) mass scale processing, to test it, to validate results with the data scientists, to update it, re-validate it… Altogether we spent several weeks on tweaking and testing. Once we had everything ready and sample territories done, we wrote a blog post and planned to process the globe in a couple of weeks. Then, however, we got a surprise message that Google is doing the same in their GEE, using our own library. This pushed us a bit so we sped up the process. AWS Open Data team kindly granted us credits for this task, so we did not have to worry about every single €/$. We have parallelized the process over two auto scaling groups, each occupying up to 800 Spot instances (depending on the availability) and they churned through 13 million scenes at peak processing rate of 780 scenes per second. It took about 9.5 hours. On top of a couple of weeks of work. On top of two years of practical experience with cloud cover. But hey, we made it in less than a day!
And now to the costs… Processing 6 PB of data does cost something… We try to make use of Spot whenever we can to keep these down. The total cost for VMs was $613.65. Plus about $56 for S3 GET requests and $680 for yearly storage. If we processed the cloud masks at 10 meter resolution, it would cost at least 256-times as much, probably more. Without much added value…
So this is it for the clouds. There are improvements possible, obviously, but we believe that produced masks are good enough for most purposes. We might introduce the multi-temporal option in the future, once we see it works well enough. There are however still the shadows. We do not have a good solution for these yet, so hoping someone else will step in.
We would like to express our gratitude to Amazon Web Services Open Data team, who made all of this possible — they made the S3 storage available for hosting of the Sentinel data (now holding more than 14 PB of data), they believed in us to manage this archive and supported us while establishing Sentinel Hub from the very beginning, with some credits, lots of promotion and other benefits. Thanks especially to Jed and Joe.
Further reading about this topic: