Large-scale data preparation — introducing Batch Processing

Grega Milcinski
Sentinel Hub Blog
Published in
6 min readJan 7, 2020

A few years ago, when designing Sentinel Hub Cloud API as being the option to access petabyte-scale EO archives in the cloud, our assumption was that people are accessing the data sporadically — each consuming different small parts. Very rarely or almost never would they download a full scene, e.g. 100x100km, so there was no point to focus on this part. It looks that our guess was right albeit with a bit of a twist. Indeed, the vast majority of the users consume small parts at once — often going to the extreme, e.g. just a few dozens of pixels (typical agriculture field of 1 ha would be composed of 100 pixels). There are however a few users, less than 1 % of the total, who do consume a bit more. Quite a bit, one could say, as they generate almost 80% of the volume processed. They typically operate a machine learning process. At. Large. Scale. Noticing these patterns we were thinking of how we could make their workflows more efficient. Batch Processing is our answer to this, managing large scale data processing in an affordable way.

The basic Sentinel Hub API is a perfect option for anyone developing applications relying on frequently updated satellite data, e.g. Sentinel-2. There is a single end-point, where one simply provides the area of interest (e.g. field boundaries), the acquisition time, processing script and some other optional parameters and gets results almost immediately — often in less than 0.5 seconds. No unnecessary data download, no decoding of various file formats, no bothering about scenes stitching, etc. As long as the data was taken by the satellite, it simply is there. And it costs next to nothing — 1.000 EUR per year allows one to consume 1 million sq. km of Sentinel-2 data each month. A developer working on a precision farming application can serve data for tens of millions of “typical” fields every 5 days.

Three-months Sentinel-2 mosaic of Australia (August — October 2019) using Pierre Markuse’s mosaicking script. 10 million sq. km and more than 30 TB of data processed in about 15 minutes to produce a cloudless mosaic at 120 m spatial resolution (GeoTiff, 2GB, other resolutions)

Data scientists, however, “abused” (we are super happy about such kind of abuse!) the convenience of the API and integrated it in a “for loop”, which splits the area in 10x10km chunks, downloads various indices and raw bands for each available date, then creates a harmonized time-series feature by filtering out cloudy data and interpolating values to get uniform temporal periods. The process is pretty straightforward but also prone to errors. With millions of such requests, some will fail and one has to retry them. It might also take quite a while, days or even weeks. Last but not least, this no longer “costs nothing”.

Typical time-series feature used for machine learning

We have realized that for such a use-case, we can optimize our internal processing flow and at the same time make the workflow simpler for the user — we can take care of the loops, scaling and retrying, simply delivering results when they are ready.

The Batch Processing workflow is straightforward:

  1. Prerequisites are a Sentinel Hub account and a bucket on object storage on one of the clouds supported by Batch (currently AWS eu-central-1 region but soon on CreoDIAS and Mundi as well).
  2. Configure the request using the process API pattern. It’s probably best if you actually try the request first with process API (at a smaller area or smaller scale) as you get results immediately and you can iteratively adjust the request until you get a result you are looking for.
  3. Adjust the request parameters so that it fits the Batch API and execute it over the full area — e.g. country or continent. This will start preparatory works but not yet actually start the processing. Request identifier will be included in the result, for the later reference. Another very important information received is the estimate of the processing units required to process everything (this depends on the area size, resolution, number of scenes in the region, temporal period, etc.).
  4. Run analysis on the request to move to the next step (processing units estimate might be revised at this point).
  5. Start the process. We will now split the area into smaller chunks and parallelize processing to hundreds of nodes.
  6. There is an API function (/tiles) available to let you know, which parts of the results are already ready.
  7. There is an API function to check the status of the request, which will take from 5 minutes to a couple of hours, depending on the scale of the processing.

In the end, results will be nicely packed in GeoTiffs (soon COG will be supported as well) on the user’s bucket to be used for whatever follows next.

There are several advantages to this approach:

  • no need for your own management of the pre-processing flow,
  • much faster results (the rate limits from the basic account settings are not applied here),
  • less costly.

While building Batch Processor we assumed that areas might be very large, e.g. the whole world large. It does therefore not make sense to package everything in the same GeoTiff — it would simply be too large. When thinking about what grid would be best, we realized that this is not as straightforward as one would have expected. Existing Sentinel-2 MGRS grid is certainly a candidate but it contains many (too many) overlaps, which would result in unnecessary processing and wasted disk storage. So we took that grid and cleaned it quite a bit. It is also important that the grid size fits various resolutions as one does not want to have half a pixel on the border. And for various resolutions, it makes sense to have various sizes. We currently support 10, 20, 60, 120, 240 and 360 meter resolution grids based on UTM and will extend this to WGS84 and other CRSs in the near future.

The default grid, based on Sentinel-2 MGRS, but optimized to avoid “orbit” overlaps

The beauty of the process is that data scientists can tap into it, monitor which parts (grid cells) were already processed and access those immediately, continuing the work-flow (e.g. machine learning modeling). And, if it makes sense, also delete them immediately so that disk storage is used optimally (we do see people processing petabytes of data with this so it makes sense to avoid unnecessary bytes).

Batch Processor is not useful only for machine learning tasks. One can also create cloudless mosaics of just about any part of the world using their favorite algorithm (perhaps interesting tidbit — we designed Batch Processing based on the experience of Sentinel-2 Global Mosaic, which we are operating for 2 years now) or to create regional scale phenology maps or something similar.

There are also some short-term future plans for further development:

  • Exploitation within eo-learn to support large-scale generation of EOPatches and parallel consumption of these.
  • Integration with xcube, one of our Euro Data Cube project partner’s Python toolkits for typical data cube operations based on xarray. We have recently successfully integrated it with the basic process API, resulting in cool on-the-fly analysis, which will be only extended further. In a very similar fashion integration could be done with Open Data Cube, exploiting Batch Processor to automatically, and at scale, pre-process the needed input data.
  • Support of “Bring your own COG” functionality so that generated datasets will immediately become available through standard Sentinel Hub API.

The basic Batch Processor functionality is now stable and available for staged roll-out in order to test various cases. If you would like to try it out and build on top of it, make sure to contact us. We are eager to see, what trickery our users will come up with!

For technical information, check the documentation.

What the exploitation of EO data will bring us?

--

--