Predicting Deforestation: Acquiring Satellite Imagery

Published in

Project Canopy

17 min readOct 20, 2020

Sentinel-2 image of Nkoteng, Cameroon, 2020, showing industrial agriculture planted over savanna

This is the second post in a series documenting the design, development and implementation of a machine learning pipeline leveraging satellite imagery to predict drivers of deforestation in the Congo Basin rainforest. For more context, read the first article, Machine Learning for Predicting Deforestation: Research and Planning.

(co-authored by David Nagy)

Project Canopy is a new nonprofit NGO aiming to bolster evidence-based conservation policymaking by aggregating, analyzing and communicating all data relevant to the Congo Basin rainforest. As data scientists at this organization, we had to develop an appropriate example of such data and analysis for the organization’s MVP. We decided to train a model to detect logging road locations (see our previous article). To prepare for this analysis, we first had to download Sentinel satellite imagery of the Congo Basin region. We needed to determine the best way to (a) query for specific satellite products, and (b) download those products locally. In this article, we will describe the four methods we found for accomplishing those two tasks, as well as their respective strengths and weaknesses. The best method depends on your use case; for us, we decided on Sentinel Hub due to its more extensive documentation and straightforward API architecture.

First, we will discuss various in-browser websites that let you view and download Sentinel images. These are very helpful for exploration and downloading individual products, but are not set up to easily allow for bulk downloads. The second method is Sentinelsat, an API that directly connects to the Copernicus Access Hub, which is the official portal for accessing Sentinel content. Unfortunately, this API has many issues, including a relative lack of documentation and constant timeout errors when querying. Third is Google Earth Engine (using Google Cloud Platform), which is a very useful open-access depository for Sentinel images, but lacks a good querying system. The final, preferred method is Sentinel Hub, which uses Amazon Web Services. Sentinel Hub offers the best combination of reliable querying and convenient downloading, though you do potentially need to pay for the service.

In-Browser Web Interfaces

If you are interested in acquiring a single recent product of a Military Grid Reference System (MGRS) tile covering a relatively small area (<100 km2) and not planning to perform machine learning, time series analysis, or any sort of complex statistical analysis, it may be easiest to just use a web interface without delving into API access. With a web interface, it is relatively easy to create an area of interest (AOI), query results based on criteria like cloud cover and date of imagery acquisition, and finally download the products directly from a web browser.

Copernicus Access Hub

The Copernicus Open Access Hub is provided directly by the European Space Agency (ESA), which is responsible for the Sentinel 2A/2B mission satellites. It is widely regarded as the default approach for acquiring Sentinel products.

Unfortunately, the lag time for getting query results and the relatively counter-intuitive interface makes this a poor choice for on-the-fly product lookups and downloads. We found the API associated with this service to also be limited in its query response time, availability of products older than a year, and poor documentation. (Again, see our below for more details.)

EO Browser By Sentinel Hub

Sentinel Hub’s EO Browser stands out as the best choice that we found for easily accessing the Sentinel 2A/2B images. The AOI can be drawn as a polygon (and not just a bounding box), you can select between different composite images (True Color, False color, NDVI, etc), product results instantaneously load, and the interface is intuitive. Notably, the API provided by the Sentinel Hub service is better documented and lacks the latency issues compared to Copernicus’s native Sentinel Sat endpoints.

However, as you might expect, a web interface is highly inefficient for bulk downloads of image products, as you would have to manually download each one individually. Since we needed to do bulk downloads, we had to turn to APIs.

Sentinelsat API

Sentinelsat is an API that connects directly to the Copernicus Access Hub. As such, you need to register for a free Copernicus account to use it. The API itself we found to be relatively useful, despite some important weaknesses; it’s certainly good enough to use for most purposes. Unfortunately, the Copernicus Access Hub itself is severely limited in terms of the images you can download, which ended up being this API’s fatal flaw, as we’ll discuss in more detail later.

Basics

The documentation, while not elaborate, does a decent job of describing the basics. Start with your imports:

from sentinelsat import SentinelAPI, read_geojson, geojson_to_wktfrom datetime import datefrom env_vars import sentinel_username, sentinel_passwordimport pandas as pd

The dateobject is used for limiting your queries (more info on this later). senintel_username and sentinel_password are your Copernicus username and password; to call the API, plug them into a SentinelAPI object like so:

api = SentinelAPI(sentinel_username, sentinel_password, ‘https://scihub.copernicus.edu/dhus’)

You can use this apiobject to search Copernicus for certain products by using api_query(). We recommend calling help(api.query)to see the doctext directly, as that’s much more elaborate than the website linked above (though it’s still somewhat sparse). The upshot is that you can limit your search results with various arguments. The main arguments we used were:

footprint: A well-known text (WKT) string that describes a shape somewhere on the globe. You combine this with the area_relationargument to geographically limit your search (more information on this below). Note that, if your footprint is in geojson format, you can import read_geojson and geojson_to_wkt from sentinelsatand get it in WKT format that way. For example:

footprint = geojson_to_wkt(read_geojson(geojson_location))

where geojson_location is the path on your computer where your geojson lies.

area_relation: As I said above, you can geographically limit your search using a WKT “footprint”. The area_relation argument tells the API the relation your target area, or “area of interest” (AOI) has to your footprint. There are three options, and unfortunately, the docstring appears to mix up two of them. From our testing, here are what the values appear to actually do:

Intersects: The default option, this returns products that intersect with — i.e., have some part of them overlap with — your footprint.
Contains: Contrary to the docstring, this appears to return products that contain the footprint, i.e., it retains any product that fully contains the footprint.
IsWithin: Again, contrary to the docstring, this appears to return products that are inside the footprint, essentially the reverse of “contains.”

We ended up using a combination of Containsand Within to get all products inside a certain area we drew. Note that we found the best results when we used a simple bounding box as our footprint; the API had some trouble when we tried a more complicated polygon.

date: This allows you to enter a starting date and an ending date, only returning products within that range. There are multiple date formats but we found it most straightforward to just use a dateobject from the built-indatetime module. Make sure to put your start and end dates into a tuple before entering them into this argument.
raw: This allows you to attach additional arguments that aren’t found in the Python API, but can be found on this page.
platformname: As you might expect, this limits your search to particular platforms; we used Sentinel-2.
cloudcoverpercentage: This lets you choose a minimum and maximum amount of cloud cover you want your products to have. We used either (0,5) or (0,10). Note that, in our experience, products with “0%” cloud cover often had large no data areas, so you may want to use (1,5) or (1,10)if you want to avoid downloading products that mostly lack data.

The result of using api.query is an ordered dictionary:

products = api.query(…)

For example, here’s a sample query you can try:

products = api.query(footprint,date=(‘20150601’, date(2015, 12, 31)),area_relation = “IsWithin”,platformname=’Sentinel-2',cloudcoverpercentage=(0, 10))

You can easily put this ordered dictionary into a dataframe:

products_df = api.to_dataframe(products)

This allows you to perform regular pandas operations, such as — and this is something we did — dropping duplicate “tileid” rows so we only had one product per tile. You need to then translate the new dataframe back into an ordered dictionary before you can download the products:

from collections import OrderedDictunordered_products_dict = products_df.T.to_dict()ordered_dict = OrderedDict((k, unordered_products_dict.get(k)) for k in products_df.index)

Then, finally, you can download those products:

api.download_all(ordered_dict, directory_path=X)

Strengths and Weaknesses

The main strength of Sentinelsat is that it provides a direct connection to Copernicus, without any intermediaries or additional fluff. The search and download process itself is also relatively straightforward, as can be seen from the summary above.

It has two weaknesses, though. One is relatively minor but the other is major enough to sink Sentinelsat as a reliable API. First, the connection to Copernicus is not very solid and the API often times out, especially when you run a query for a large number of products. We ended up only downloading one month of products at a time, refreshing the API connection (i.e., rerunning the cell that defines the api object) before each new query. But this issue was mostly just annoying.

The far more serious issue is that Copernicus itself only has a select number of products easily available online. It only keeps products online for twelve months. After that, you have to request an old product to be placed back online, and you can only make a limited number of such requests each day (once every 30 minutes, according to the Sentinelsat documentation). Thus, if you’re only interested in pictures taken in the last year, Sentinelsat is a very good option. But if, like us, you need older pictures too, you will have to look elsewhere.

Google Earth Engine

Introduction

Google Earth Engine (Google EE) is many things, most of which are irrelevant to our project. What is relevant, though, is that it has all the Sentinel-2 satellite images in its cloud server. And unlike Copernicus, which only keeps its images online for a year, it appears that Google EE keeps most or all the products for an indefinite period of time. There’s both a command line tool and a Python module that interact with this cloud. Unfortunately, there’s a downside — unlike Sentinelsat, there does not appear to be any sort of robust search or query function. How, then, can we find the images we want to find?

While the Google EE API lacks a search function, there is a very large CSV file containing data on all its Satellite-2 images, accessible here, which we will refer to as the “EE Index.” Note that the EE Index is in GZ format, so to open in pandas you will have to unzip it and copy it into a CSV:

import gzip
import shutilwith gzip.open(‘./google_earth_engine_index.gz’) as f_in:
 with open(‘./earth-engine-index.csv’, ‘wb’) as f_out:
   shutil.copyfileobj(f_in, f_out)

This index contains the information you need to individually download the images you want. After examining it, we found that the PRODUCT_ID column corresponded to the tile column in the products_df obtained from querying the Sentinelsat API. In other words, it’s possible to combine the search abilities of Sentinelsat and the greater trove of products held by Google EE in the following way:

1. Query the Sentinelsat API and obtain a products_df using the method described in the section above.

2. Load the Google EE index CSV into a pandas dataframe (we called it ee_index).

3. Do an inner merge on the dataframes obtained in steps 1 and 2, with the on argument set to tile for products_dfand PRODUCT_ID for ee_index.

This allowed us to effectively do a search through the Google EE using Sentinelsat. Now the problem was how to use the information in the EE Index to actually download the products from the cloud.

Python API

The first method we tried was the Google EE Python API. This page contains good instructions for installing and initializing it. Do note that you will need to have a Google/Gmail account and sign up to Google Earth Engine with it; just go here and click “Sign Up.” In addition, both installing the client and getting it to line up correctly with your Google EE credentials can be very fiddly. We won’t cover every detail of what we did, but don’t be surprised if it takes a few hours (at least) to get everything to line up correctly.

After getting the installation to work correctly and successfully running ee.Initialize(), the next step was figuring out how to get an “image id” to enter into the ee.Image object. Luckily, we found a GIS Stack Exchange post to help with this. Essentially, a Sentinel-2 product’s Image ID in the Google EE has the following format:

COPERNICUS/S2/<acquisition date>_<granule date>_<Tile id>

Those three variables — acquisition date, granule date, and tile id — can be found in the EE Index’s PRODUCT_ID column. For example, take this product ID:

S2A_MSIL1C_20160101T094412_N0201_R036_T32NNG_20160101T095712

If you divide by underscore, the third string, 20160101T094412, is the acquisition date; the last string, 20160101T095712, is the granule date, and the second to last string, T32NNG, is the tile ID (preceded by a T). Thus, the Image ID of this product is:

COPERNICUS/S2/20160101T094412_20160101T095712_T32NNG

Here’s the full process in code:

import eeee.Initialize()product_id = 'S2A_MSIL1C_20160101T094412_N0201_R036_T32NNG_20160101T095712'acquisition_date = product_id.split(‘_’)[2]granule_date = product_id.split(‘_’)[-1]tile_id = product_id.split(‘_’)[-2]image_id = f’COPERNICUS/S2/{acquisition_date}_{granule_date}_{tile_id}’image = ee.image(image_id)

By doing this, we were able to successfully obtain the image. The next step, then, was downloading it.

Unfortunately, this is where we ran into trouble. First, the Google EE Python client has a serious lack of reliable documentation; we had to rely on the doctext and scrounging for Stack Overflow and GIS Stack Exchange posts. No matter which method we tried, we ran into problems.

For example, take the method demonstrated in the Google EE introduction linked above, image.GetDownloadURL(). This did indeed get us a URL, but for whatever reason, it wasn’t a URL we could use to actually download what we wanted; particularly, we were unable to find a way to put it into a Python function and download the product, which was the goal (since we needed to batch download a bunch of products). This Stack Overflow post led us to try ee.batch.Export.image, but similarly, we couldn’t get it to work. This, combined with the lack of documentation, caused us to look for a different route.

Gsutil Command Line Interface

The main problem with the Google EE Python API was the lack of a reliable way to directly download the products we needed — specifically, the True Color Image (TCI) files. The solution we found was to use the gsutil command line interface (see this page for instructions on downloading it). You can find an introduction to gsutil on this page, but the only command we needed was cp, the copy command.

Why did this work when none of the Python API methods did? The issue is that we couldn’t find a reliable download link, and we couldn’t find a way to just get the TCI file using the Python client without such a link. However, the cp command allows you to target one particular file in the Google Cloud and copy it to your machine — which is exactly what we needed. There was just one more thing: a way to procedurally generate the precise URL of the TCI file.

The issue was this. The Google EE Index contains a BASE_URL column, which always starts with gs://gcp-public-data-sentinel/tiles. If you replace the gs://at the beginning with https://console.cloud.google.com/storage/browser, you can view the product — or more accurately, the directory containing the product and all its associated metadata — in your browser directly, and you can navigate to the specific item you want (in our case, the TCI file). Gsutil can then be used to download directly any individual file in that directory.

After some testing, we found a way to, given any row in the EE Index, procedurally generate the URL to its TCI file:

1. Start with the Base URL from the BASE_URL column.

2. Add /GRANULE/, followed by the granule ID found in the GRANULE_ID column.

3. Add /IMG_DATA/

4. Get the Tile ID by splitting the granule ID on _ and taking the second string.

5. Get the date by splitting the DATATAKE_IDENTIFIER on _and taking the second string.

6. Add the tile ID, then an underscore, then the date, then an underscore, then TCI.jp2.

Here’s the function we wrote in full. Note that ee_index is the Google EE Index read as a pandas dataframe, and row is the relevant row number in the index.:

def generate_tci_url(ee_index, row):
   
   url = ee_index.loc[row, 'BASE_URL']   url += '/GRANULE/'   granule_id = ee_index.loc[row, 'GRANULE_ID']   url += granule_id   url += '/IMG_DATA/'   tile_id = granule_id.split('_')[1]   date = ee_index.loc[row, 'DATATAKE_IDENTIFIER'].split('_')[1]   url += f'{tile_id}_{date}_TCI.jp2'   return url

Once you have the URL for the TCI, you can then download it with gsutil. While this can be done in the command line, it’s easier to run in Python using the subprocess module. Just use subprocess.run(), and enter the following command:

[cloud_env, ‘&&’, ‘gsutil’, ‘cp’, url, dest_folder]

where cloud_env is the location of Cloud SDK’s cloud_env.bat file, url is the URL generated by the process above, and dest_folderis where you want to save the TCI.

In the end, we didn’t actually use this method for our main data downloads, since we also needed some metadata files to sort by image quality and remove clouds. While it is possible to acquire those files from Google EE too, it would require writing custom code to pull out the specific files similar to what we did for the TCI files, whereas — as you will see below — doing this in Sentinel Hub is much more straightforward. But the Google EE method is definitely extremely effective if you know which specific files you need from the Google EE cloud.

Sentinel Hub

Sentinel Hub is one of the leading Satellite data providers, offering an efficient way of archiving, processing and distributing satellite data. Sentinel Hub utilizes Amazon Web Services (AWS) for their data offering and as a service covers the requestor cost that the end user would typically pay to pull data from their Simple Storage Service (S3).

You can find a tutorial on pulling Satellite imagery from Sentinel Hub via their API here.

Create Instance ID

In order to utilize Sentinel Hub’s API (SH API), you will need to acquire an Instance ID from your Sentinel Hub account (SentinelHub offers a free 30-day trial account with full API access).

Here’s how to do that:

Sign up for a free account on Sentinel Hub
Go to Configuration Utility, located in the Dashboard section
Choose New configuration
In the Add new configuration dialogue, enter your preferred name for the configuration, and in the Create configuration based on, select the Simple Sentinel-2 L2A template

5. Under Service endpoints, grab the ID listed and place it in an environment variable config file for use with the SH API

Sentinel Hub wrapper for Python

The Sentinel Hub documentation for acquiring satellite imagery is fairly straightforward for a default “greedy” product pull. In this scenario we will pull all files associated with a product including the 60m, 20m, and 10m resolution spectral files including all bands in their own individual raster file.

First, do your import statements:

from sentinelhub import WebFeatureService, BBox, CRS, DataSource, SHConfig
import pandas as pd
from sentinelhub import AwsTileRequestfrom collections import OrderedDict
from env_vars import sentinel_hub_instance_id
from scripts.mgrs import encode,LLtoUTMINSTANCE_ID = sentinel_hub_instance_id if INSTANCE_ID:
 config = SHConfig()
 config.instance_id = INSTANCE_ID
else:
 config = None

sentinel_hub_instance_id refers to the Instance Id found in step #5 above

Then you may want to create a bounding box to limit your query. Here is the one we used, visualized in QGIS:

Bounding Box covering the entire Congo Basin

#Congo Basin Bounding Box
search_bbox = BBox(bbox=[6.81,-5.27, 30, 5.5], crs=CRS.WGS84)search_time_interval = (‘2019–01–01T00:00:00’, ‘2020–07–31T23:59:59’)
wfs_iterator = WebFeatureService(
 search_bbox,
 search_time_interval,
 data_source=DataSource.SENTINEL2_L2A,
 maxcc=.05,
 config=config
)

The search_bboxis the bounding box visualized above. The WebFeatureService object enables us to query for products that lie within the bounding box in question. We have specified the product type L2A via the argument data_source=DataSource.SENTINEL2_L2A, max cloud cover at 5% via the argument maxcc=.05, and passed in our personal endpoint configuration via config=config.

results = wfs_iterator.get_tiles()
df = pd.DataFrame(results, columns=[‘Tilename’,’Date’,’AmazonID’])
df2 = df.drop_duplicates(subset=’Tilename’)
output = list(df2.itertuples(index=False,name=None))

In order to limit our download to one product per Military Grid Reference System (MGRS) tile, we convert the results object (which is a list of tuples) into a pandas dataframe, drop duplicates by MGRS tile name, then convert the results list back to a list of tuples.

Now we loop through our output tuple list to initiate the download.

#Additional Params
bands = [‘R10m/TCI’]
for product in output:
 tile_name, time, aws_index = product
 data_folder = ‘/Volumes/Lacie/zhenyadata/Project_Canopy_Data/PC_Data/Sentinel_Data/Test/Safe_True’
 request = AwsTileRequest(
 tile=tile_name,
 time=time,
 bands = bands, 
 aws_index=aws_index,
 data_folder=data_folder,
 data_source=DataSource.SENTINEL2_L2A,
 safe_format = True
 )
 request.save_data(redownload=True)

Because we are only interested in the TCI files for our project, we have specified a 10 meter per pixel resolution and only the true color version of that image with bands=['R10m/TCI'] .

You’ll notice that we have to re-specify the product type in data_source and the SAFE format with the safe_format=True argument for the product. This is because L1C and L2A products are both available for most tiles.

Note that we have passed in a redownload=True argument to ensure that any existing (previously downloaded) tiles in the destination folder are re-downloaded. You will want to set this to True in case you are interested in a new set of bands or metadata files contained in the package that you had not downloaded previously.

Sentinelsat vs Sentinel Hub

We used both the Sentinelsat and Sentinel Hub tile query services extensively before eventually settling on Sentinel Hub. As described above, the Sentinelsat API often timed out, especially when querying a time period longer than one month, and even after writing custom code that performed a separate API call for each month, the service was still unreliable. By contrast, Sentinel Hub’s tile lookup call consistently outputs results regardless of the search criteria.

The two outstanding drawbacks for utilizing Sentinel Hub are the 30-day trial period for free use and the fact that the area of interest (AOI) has to be a bounding box (rectangle) vs a polygon (complex asymmetric shape); thus, the search result will likely return more tiles than needed for your purposes. One way of getting around this is to put your relevant tiles into a list and use that as a filter for subsequent search results prior to initiating the download process.

Of course, if you’re working with stakeholders that need to perform product lookups or downloads without utilizing an API, then EOS Viewer, the web interface companion for Sentinel Hub, has a much richer feature set, more intuitive search functionality, and provides a more comprehensive search function compared to SentinelSat’s native European Space Agency web browser.

As for downloading multiple products, Sentinel Hub’s straightforward API and consistent up-time made it a better choice than our custom approach to pulling products from the Google Cloud Platform (GCP) described earlier. In addition to the simplicity of utilizing Sentinel Hub, another key advantage is its usage of parallel multithreading. Essentially, it uses each core on your machine’s CPU separately, so a file download can potentially complete 4 times faster (in the case of 4 cores). While it is possible to write custom code that does this, it’s available by default in Sentinel Hub.

Sentinel Hub Pros:

EOS Viewer is good companion for locating products prior to using the API
Unlimited search results
Straightforward interface
Well-documented
Allows for both SAFE and standard product package downloads
Utilizes multi-threading for parallelized downloads (can be customized)

Sentinel Hub Cons:

Query lookup restricted to bounding box vs custom polygon in the case of SentinelSat
30-day trial for API access