COG Talk — Part 4: Enabling Spatio-temporal data processing at scale

Published in

Development Seed

8 min readFeb 10, 2020

This blog is the fourth in a series called `COG Talk`, which looks at ways to use Cloud Optimized GeoTIFFs to efficiently render and analyze planetary data at massive scale.

After a refresh of what COGs are in Part 1, the introduction of mosaics in Part 2, and a fun experiment in Part 3, today we are going to see how COGs can be useful for large scale spatio-temporal dataset.

Introduction slide from a talk given at GéoMTL conference (slides).

Cloud Optimized GeoTIFFs (COGs)

First, the basics. As of today, the Cloud Optimized GeoTIFF specification can be summarized as a tiny list of requirements:

the data has to be tiled (internally split into chunks of regular size)
the file has a header with the location of each tile
the file can have internal overview

Basically, you take a well known open format (created in the 80's), enforce good usage and internal architecture, and then have a binary file optimized for remote access. Because the header has a map to the internal tiles and the geographic information, libraries like GDAL can easily understand which tiles to fetch (using GET Range-Requests) for a given area of interest (AOI), minimizing data transfer and HTTP requests.

Like this cute raccoon, GDAL is able to take just what it needs from a COG and runs really fast.

Web map tile is a common format for distributed processing (see chip-n-scale) or for simple raster dataset visualization. Because we can read partial parts of the COG in an optimized way, and, if present, obtain a preview of the high resolution data from internal overviews, we can dynamically generate the tiles from COGs at request time.

Web maps are often based on static raster tiles stored as jpeg or png. A full set of tiles is created for each zoom level and stored in a tree-based file structure that allows users to zoom and pan a map. This approach requires you to pre-generate a tile tree consisting of millions of files. With a COG you can use internal overviews to create multiple zooms and internal tiles to stand in for map tiles, and thus, only have one file to manage for a large area. We call this process "dynamic tiling" because we access the raw data (e.g. surface reflectance or elevation) and then apply algorithms on it before creating the tile to display in the browser.

How to create valid COGs

Common Geographic Information System (GIS) software like QGIS supports exporting raster data to COG natively but if you want to do it programmatically you can use GDAL commands

# First add internal overviews 
$ gdaladdo my-file.tif# Then translate the geotiff to a COG (`TILED=YES`) and keep overviews 
$ gdal_translate -of GTiff -co TILED=YES -co COPY_SRC_OVERVIEWS=YES -co COMPRESS=DEFLATE my-file.tif my-cog.tif$ rio cogeo my-file.tif my-cog.tif --cog-profile deflate

or use rio-cogeo (see COG Talk — Part 1)

$ rio cogeo my-file.tif my-cog.tif --cog-profile deflate

COGs everywhere

More organizations are storing their data as a COG ( Landsat level 2, USGS DEM, MODIS on AWS), and while it’s not the most storage-efficient format (in comparison to JPEG2000), a COG is a more user-friendly format that enables fast and cheap access to the data (see this blog for a comparison).

Having access to more data creates another kind of problem (a good one): due to the increase in data availability, we need to implement easier ways to access/process/share them at scale. Development Seed regularly advocates for open datasets, but what we love even more is to enable people to access and use the data. Take for instance, the example of opening up Landsat data. It was a big win for the open data community, but it posed several challenges using it with Earth Explorer. Libra was born as a result of needing a better tool and process for using this data, and still to this day has 3000 visitors per month 😱 !

One scene or a million!

The combination of distributed cloud services and an increase in datasets being stored as COGs, means users are now pivoting from single scene workflows to large scale processing (e.g. state, country wide). To support this, we created the mosaic-json specification, an open standard for representing metadata about sets of Cloud-Optimized GeoTIFF (see COG Talk - Part-2). With simplicity and performance in mind, the specification uses a simple quadkey based spatial index and allows overlapping scenes to create spatio-temporal mosaics.

In a dynamic tiling workflow, the mosaicJSON is a simple JSON file that acts as a proxy between Web Map tile requests (using Z-X-Y Slippy map tilenames) and the list of files intersecting with this tile.

Real World example

ABoVE: Landsat-derived Annual Dominant Land Cover Across ABoVE Core Domain, 1984–2014

Ref: https://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=1691
Format: GeoTIFF + internal overview

The ABoVE dataset is comprised of 175 different GeoTIFFs at 30m resolution derived from Landsat (5 and 7) surface reflectance values. While it covers a pretty large area, the other interesting part of this dataset is the temporal aspect, because each file has 31 different bands, one for each year between 1984 to 2014.

The dataset is distributed as a GeoTIFF with internal overview. Sadly, they are not aligned with the Cloud Optimized GeoTIFF specification because they are not internally tiled and the overviews are located at the end of the files.

GeoTIFF → COG → mosaicJSON

To create user interfaces (UI) and tools that are as responsive as possible, we need the files stored as proper Cloud Optimized GeoTIFFs.

Here are the steps we took to convert the files:

1. Download the whole dataset and upload the GeoTIFFs to S3.

# head over https://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=1691 and login
$ wget {url of the zip}
$ unzip Annual_Landcover_ABoVE_1691.zip
$ aws s3 sync . s3://my-bucket/raw/ABoVe/


2. Deploy cogeo-watchbot-light (a serverless stack based on AWS Lambda to run rio-cogeo at scale).

$ git clone https://github.com/developmentseed/cogeo-watchbot-light
$ cd cogeo-watchbot-light
$ sls deploy --stage production --bucket my-bucket --region us-east-1

3. Create COGs.

# create list of files to translate
$ aws s3 ls s3://my-bucket/raw/ABoVE/ | grep "Simplified" | awk '{print "s3://my-bucket/cogs/ABoVE/"$NF}' > list_raw_files.txt

# Send jobs to the stack
$ pip install rio-cogeo rio-tiler
$ cd scripts/
$ cat list_raw_files.txt | python -m create_jobs - \
   -p webp \
   --co blockxsize=256 \
   --co blockysize=256 \
   --op overview_level=6 \
   --op overview_resampling=bilinear \
   --prefix cogs/ABoVE \
   --topic arn:aws:sns:us-east-1:{AWS_ACCOUNT_ID}:cogeo-watchbot-light-production-WatchbotTopic

4. Create a MosaicJSON (using cogeo-mosaic).

$ aws s3 ls s3://my-bucket/cogs/ABoVE/ | awk '{print "s3://my-bucket/cogs/ABoVE/"$NF}' | cogeo-mosaic create - -o mosaic.json

Explore

When the COGs and the mosaicJSON are formatted correctly we can use the cogeo-mosaic-tiler stack to create web map tiles dynamically and visualize the data in a web map.

See it live: https://cogeo.xyz/projects/ABoVE/index.html

The temporal side

With the introduction of the mosaicJSON specification (see COG Talk — Part 2), we released an open source python module rio-tiler-mosaic to handle the creation of tiles using multiple files. One important feature of this plugin is its ability to do pixel selection dynamically, meaning the user can choose how each pixel value is created (e.g. take the pixel value from the first image or from the last one). It also enables custom pixel selection methods:

"""Custom stddev pixel selection method."""import numpy
from rio_tiler_mosaic.methods.base import MosaicMethodBaseclass bidx_stddev(MosaicMethodBase):
    """Return bands stddev."""def __init__(self):
        """Overwrite base and init bands stddev method."""
        super(bidx_stddev, self).__init__()
        self.exit_when_filled = Truedef feed(self, tile):
        """Add data to tile."""
        tile = numpy.ma.std(tile, axis=0, keepdims=True)
        if self.tile is None:
            self.tile = tile        pidex = self.tile.mask & ~tile.mask
        mask = numpy.where(pidex, tile.mask, self.tile.mask)
        self.tile = numpy.ma.where(pidex, tile, self.tile)
        self.tile.mask = mask

The pixel selection method above was specifically design for the ABoVE dataset. On each tile request the dynamic tiler using this method will return the standard deviation value for the stack of bands, enabling us to find where the land cover classification values have changed over the 31 year span.

See it live: https://cogeo.xyz/projects/ABoVE/stddev.html

Standard deviation values for the full stack of 31 values (years).

❤️ STAC + COG

DevelopmentSeed is advancing the adoption of the new STAC metadata specification (checkout our latest sat-api-pg project). Standardized and well-formatted metadata is an important step toward the democratization of remote sensed data and it also help us to create nice visualization tools.

Introducing landsatlive.live

By combining both STAC and mosaicJSONs we built a small demo called landsatlive.live (shoutout to the great landsat.live by our friends at Mapbox). This demo lets you visualize mosaics created dynamically using sat-api (our STAC search api) for the Landsat 8 dataset hosted on AWS.

A mosaic of Landsat 8 scenes. Web map tiles created dynamically using sat-api + mosaic-json + rio-tiler-mosaic.

COG for the best

Cloud Optimized GeoTIFF is one of the go-to data formats for when organizations are looking to store their data in the cloud. Utilizing this approach breaks down the barrier of accessing data. We’ve put together an open and user-friendly suite of tools that allows users to process data by combining STAC, mosaicJSON, and the cogeo-* tools, allowing anyone to access and analyze planetary data at scale.

For more information on how COGs can solve many of your data-related problems feel free to ping me on Twitter or LinkedIn! If you are interested in joining Development Seed to help us build further and faster take a look at our open positions!