Digital Twin Sandbox Sentinel-2 collection available to everyone

Have you ever wanted to perform rapid prototyping of your machine learning models on a global scale? Now you can.

Grega Milcinski
Sentinel Hub Blog


Written by: Matic Lubej and Grega Milcinski

Sometimes things neatly fall in place. Over the New Year “holidays” period, which was not really festive this year, one of our data scientists, Matic Lubej, was toying around with an idea to create a Sentinel-2 time-lapse of the whole world, full year. That is quite a lot of data to process — around 3 PB or hundreds of trillions of pixels of information — perhaps a bit too much to just get a couple of seconds long animated GIF as a result. Then the idea struck — let’s store the intermediate results and make them available to anyone who wants to use them in their work.

“Such collection is an essential input for any kind of EO ML process. To make it by oneself, one would have to process several PB of data. We’ve done this for you.”

When running ML experiments with EO imagery one typically needs a multi-temporal and multi-spectral stack of data. Ideally clear of clouds. And, even better, without gaps due to clouds or orbit borders. A while ago we implemented the concept of an interpolated patch - a harmonized dataset, where best pixel values are taken over the specified period, then interpolated to fit a uniform temporal spacing (like what Pangeo and xcube are doing). There are easy ways available to build these on-the-fly as well as cost-efficient ways to do it at large scale. However, if one wants to do it on a really really large scale, it becomes expensive and takes quite a lot of time both to produce the harmonized dataset, as well as to use it. This gave us an idea to create a 120-meter resolution harmonized yearly stack, which is still feasible, in terms of costs, to produce it as well as to consume it — it is just 5 TB of data for a whole year, so something one can easily process in due time. It is therefore super easy to create various ML-based workflows and try them out on a global scale — land cover, crop classification, yield prediction and a ton of other things that we cannot even imagine.

About the collection

SOURCE: Sentinel-2 L2A
TEMPORAL AVAILABILITY: 10-daily periods in 2019 (2020 to be added in April, previous years as well if we find the data useful) *
GEOGRAPHICAL AREA: Land surface area between 58 degrees South and 72 degrees North **
BAND INFORMATION: B02 (blue), B03 (green), B04 (red), B08 (NIR), B11 (SWIR), B12 (SWIR)
PROCESSING SCRIPT: Interpolated time-series
COGs tiled by UTM zones ***
* To produce one year of data, we processed 18 months, covering a three months before and after, to ensure continuity over a longer period.
** We are aware of some technical glitches resulting in about 0.1% of the data missing (you will notice black gaps, they are obvious). These will be added shortly.
*** Technically we use Sentinel Hub's 100km grid for Batch processing, then merge it by UTM zones.

The data is available via several means:

The Alps, which would be full of skiers and mountaineers during this period of the year. Slovenia, where our company is located in the Eastern part of the image.

What is this good for?

First and foremost, we envision this collection to be used in various machine learning exercises. Just about any model that was working on Sentinel-2 data should work with these data as well, hopefully even better, as the data is cloudless (wherever possible). We have removed most of the complexity of the remote sensing, therefore making the collection easy to use even for non-experts, e.g. computer vision data scientists. As a starting point, one can perform land cover or crop-type classification, perhaps bare soil or mowing detection. Sentinel Hub custom scripts work here as well, e.g. monthly snow report or snow cover change detection (make sure to check if the script is using DN or reflectance and adjust accordingly).

Having the data easily accessible at such a large scale also makes it convenient to observe various global phenomena, for research as well as for educational purposes. An example that is often shown is a variation of NDVI during the year. Now you can perform these analyses yourself, in EO Browser, Jupyter Notebook or elsewhere.

The mesmerizing breathing cycle of our Planet Earth

One could also use the data as a nice background map — the resolution is twice the one of the Blue Marble and the fact that it is available in 37 intervals over the year might make it a very interactive. With COGs being openly accessible it should be straightforward to integrate it using Sentinel Hub OGC services or rio-tiler-mosaic.

Last but not least, there is a ton of material for artistic expressions of our planet, either for still imagery or by creating beautiful time-lapses.

What will you do with it?

We are eager to see, how the community will use it!

Share with us your experiences as well as ideas for future evolutions of the data.

Interested in other parts? Spin it here!

This activity was performed in the scope of Horizon 2020 Global Earth Monitor project and the European Space Agency’s Digital Twin activities within Phi Lab. Thanks to both for the support and, importantly, the great Sentinel-2 dataset, as well as CreoDIAS and AWS for their willingness to share the data publicly.

TTThe project has received funding from European Union’s Horizon 2020 Research and Innovation Programme” under the Grant Agreement 101004112, Global Earth Monitor project.