Do you really want people using your data ?

Vincent Sarago
8 min readNov 19, 2018

--

Note: This post was originally called: `The Ultimate data format`

In this post we will focus on Cloud Optimized GeoTIFF and other formats used by public dataset (AWS pds, Digitalglobe Opendata, …). This post is mostly a brain dump of some though and knowledge I needed to share since the remotepixel's huge AWS bill happened last august. I hope this will give some clue or at least some idea to people who want to open/share raster dataset.

First, can you guess the difference between both images 👇

Both are the same file, on the left is the raw data from Digitalglobe Open Data Program and on the right side is the same file transformed to COG using rio-cogeo.

PS: I ❤️ DigitalGlobe and the goal of this whole introduction is not to blame them for the format, we can’t blame them to give us free data 😃 especially for disaster responses (Digitalglobe Open Data Program).

Well, files are almost the same, except COGs have internal overviews and internal tilling. The biggest difference is the storage size: 1.5 Gb vs 69 Mb 😱

So how to produce a file which is 22x lighter ? Well the answer is compression! I won't go too deep into compression itself but you should check this awesome article by Koko Alberti: https://kokoalberti.com/articles/geotiff-compression-optimization-guide/.

For the file above we used WEBP compression, which has just been added to GDAL libtiff by Norman Barker and Even Roault in #704. "WebP is a modern image format that provides superior lossless and lossy compression for images on the web" (source), develloped by Google. This compression schema claims to be better then JPEG (lossy) and PNG (lossless):

WebP lossless images are 26% smaller in size compared to PNGs. WebP lossy images are 25–34% smaller than comparable JPEG images at equivalent SSIM quality index.

The WEBP format is supported by most browsers (except Safari) and image software… and now inside GeoTIFF 🎉 (supported in QGIS if build against GDAL 2.4.0 or HEAD).

Can you spot the difference 👇 ?

WebP vs Raw (it's a GIF)

👆Look closely, this is a GIF that shows the difference between Raw and WEBP. It's really hard to spot but WEBP compression introduces artifacts (when using default parameters) which should be acceptable at least for visualisation.

Alright enought with WebP. (Note: JPEG compression would have saved a lot of space too).

AWS Public Dataset: PDS

Let's see what are the formats used by three major AWS Public Dataset: CBERS-4, Landsat-8, Sentinel-2

Note: Most of the following numbers comes from https://github.com/vincentsarago/awspds-benchmark

CBERS-4, Landsat-8 and Sentinel-2 data formats.

Those three dataset have their own 👍/👎(e.g Landsat and CBERS are both GeoTIFF but Landsat uses external overviews). The biggest difference is for Sentinel-2 which use JPEG2000 compression.

Format matters

For years I've heard multiple times that users shouldn't need to download the data but should use/create services to access them via the cloud. While this is a good idea (who wants to store Gb of data on their own laptop), the data format has a huge impact on processing/access cost which can result in thousand $ bill.

RemotePixel use case

If you see this post it might be because you also know my side project RemotePixel.ca and maybe you remember my last post:

Let's see how my August AWS bill is related to Sentinel-2 data format. The bill was mostly due to GET/LIST requests which are billed to the AWS users since the sentinel-2-l1c bucket is in `requester-pays`mode.

Remotepixel's AWS cost in August 2018.

The LIST requests (2 600$) were due remotepixel simple search api which now seems not simple but totally dumb.

The other part of the bill was due to GET requests (nearly 1 Billion Get requests 😱).

I believe most of the sentinel-2 data requests came from Remotepixel viewer which was at the time a really simple AWS PDS viewer (now only Landsat and CBERS-4 data are available on the viewer). So basically, users were able to visualized Sentinel-2 data using a tile server based on AWS Lambda. The idea behind the tile server is 1 tile = 1 Lambda call, but when checking the number of AWS Lambda calls there was something odd. There was only 1 Million calls … responsible for 1 Billion GET calls 🤔. How this is possible ? lets check how many GET requests GDAL does when reading a file over the internet 👇

AWS PDS benchmark: https://github.com/vincentsarago/awspds-benchmark

We have our anwser 🎉 😱 😢, getting a mercator tile for Sentinel-2 data needs > 100 http call (GET) per band… (😱 again) while for Landsat its around 5 calls.

A better data format ?

Well again, there is no ultimate data format, but let's see how thoses three PDS would behave if translating them to proper COG (512x512 internal tilling, internal overview, high level Deflate compression) using rio-cogeo

Less HTTP calls and less data transfer 🎉 (Landsat and CBERS dataset are also lighter).

Size / computing time / access cost

At the end of the day, people mention size being the key point to choose the data format. This is (I think) why we have a Sentinel-2 archive in JPEG2000 format and when I see my august AWS bill, this make me sad. JPEG2000 is not a cloud friendly format, even with the most advanced driver (KDU) you need to transfer twice more data (800kb vs 1.3Mb) and do almost 25 times more GET requests (3 vs 74) to do partial reading over the internet. But yes JPEG2000 weights only 95Mb while the proper COG version is around 180Mb.

What about processing time ?

COGs are made to be accessed partially over the internet, so you don't need to download the whole data (just get what you need). Basically you download less data so your process is faster.

On the other end, JPEG2000 are lighter, so you can download the whole data and process the whole file… hopefully we now have OpenJPEG (a free and open source driver to read JPEG2000 shipped in GDAL by default) which is performant enough to extract the data locally, so the processing time should be acceptable but again you'll need to download the whole file.

If you chose to read the JPEG2000 over the internet (as we saw earlier) this will result in a lot of GET calls and a lot of useless data transfer.

$ facts

AWS S3 pricing: https://aws.amazon.com/s3/pricing/

Based on ☝️let's write a scenario of a web viewer using AWS Lambda.

JPEG2000

  • Size: 25 Tb
  • Storage: 25 Tb * 1000 * 0.023 = 575 $ / month
  • 1M tile requests / month
  • Data access: (1M * 110 (GET requests) / 1000) * 0.004 = 440 $ *
  • Processing time (1536 Mb AWS Lambda): (3 second * 1M * 1536 / 1000) * 0.00001667 $ = 76,81 $ **

*Using Kakadu driver you might reduce this by half (~60 GET requests) but you have to pay couple thousand $ to get the license)

**AWS lambda cost 0.00001667 per GigaSecond | considering 3 sec per tile is quite optimistic

COST: 575 + 440 + 76.81 = 1091.81 $ (440 + 76.81 = 516.81 $ for processing)

COG (deflate)

  • Size: 50 Tb
  • Storage: 50 Tb * 1000 * 0.023 = 1150 $ / month
  • Data access: (1M * 5 (GET requests) / 1000) * 0.004 = 20 $
  • Processing time (1536 Mb AWS Lambda): (1 second * 1M * 1536 / 1000) * 0.00001667 $ = 25.60 $ *

*Reading a tile from a COG is at least 3 times faster than for JPEG2000

COST: 1150 + 20 + 25.60 = 1195.60 $ (20 + 25.60 = 45.60 $ for processing)

Those number are made from hypothesis but I believe they are close to what's going in in real world between JPEG2000 and COG. Basically if you just care about storage cost JPEG2000 is your best option, but at the end someone will have to pay $$$ to access/process the data. I believe if you store the data and provide services around, COG should be a better long term solution.

The Ultimate data format ?

As we saw in the intro, image formats (compression) can have a huge impact on data accessibility and thus usage (easier to download a 70Mb file than a 1.5Gb one).

Short answer to the question: there is not such thing as an Ultimate data format, in the real world there are plenty of good data formats. At the end of the day it rely and what you want the user to do.

Here are the question you should answer before choosing a format.

  • Do you want users to visualise the data online ?
  • Do you want users to download the data to run processes ?:
  • Do you want users to create services on the cloud ?
  • Do you care about compression artefacts ?
  • What is your data type (Byte|Float|Int) ?
  • Do you provide processing services ?

Unsolicited 2cents advise:

  • Use WEBP compression for RGB or RGBA dataset (there is a lossless option). This is the best option if you are looking for space saving, but sadly is only compatible with GDAL 2.4.0 . JPEG compression might be a safer choice.
  • use Deflate compression with PREDICTOR=2 and ZLEVEL=9 options for non-Byte or non RGB datasets.
  • Use internal overviews any time.
  • Use 256 or 512 internal block size (256 for deflate and 512 for WEBP/JPEG compressed datasets ?)
  • Prioritize internal bitmask instead of nodata value. And maybe give $ to someone to fix the small `bug` in GDAL which puts bitmask at the end of COGs.

More reads:

--

--

Vincent Sarago

Making COG at @DevelopmentSeed & Creator of @RemotePixel