A Gentle Introduction to GDAL Part 7: Transforming Data
The ability to color-code geographic datasets is a key reason to include GDAL in a visualization workflow . It’s great for batch processing a time series or automating production of graphics in near-real-time. But GDAL isn’t limited to making pictures of data. Tools like gdal_calc.py
help transform data from a raw material into usable information. Information that forms the basis for analysis (and mapping).
End users of remote sensing data rarely want imagery — they want answers from the information contained within the imagery (and other more exotic remote sensing techniques, but this post is mainly about deriving data from imagery). Known variously as geophysical parameters, planetary variables, analytics, or insights; derived datasets form the heart of why remote sensing is so valuable for Earth science. They help us understand the past, react in the present, and plan for the future.
In this article I’ll demonstrate a few methods for using GDAL to transform data, helping make the information it contains more usable. I’ll show how to calculate spectral indices from individual bands, how to mask out invalid data (like water in a vegetation index), and how to combine datasets into multivariate maps — along with some background on remote sensing.
An Introduction to Spectral Indices
One key technique to go from numbers to useful information is a spectral index. The technique uses the relationship between the brightness of pixels in different spectral bands to derive a quantity that relates to the physical world.
The images above show the Operational Land Imager’s (the primary instrument on Landsat 8 and 9) seven multispectral bands (excluding the panchromatic band — which is a blend of visible wavelengths, and the thermal infrared bands which are collected by a different instrument and measure energy emitted from the Earth — not reflected sunlight). Each image represents the perspective from the top of the atmosphere (often abbreviated as TOA) in a different color (including three infrared “colors” that are invisible to us puny humans). I’ve scaled them to be roughly equivalent, rather than stretched for maximum contrast, which is a more typical display technique. (I cheated a bit and brightened the two shortwave infrared bands, because the Sun is much brighter in visible wavelengths.)
Looking at the bands next to each other, you might get some clues as to what features each band is sensitive to, and how comparing bands in various ways might reveal some insights about the Earth’s surface. Water is dark in near and shortwave infrared light, but bright in coastal blue. Vegetation is bright in near infrared but dark in all the other bands. Smoke is transparent in shortwave infrared. The images taken with blue light are washed out because the short wavelengths are scattered by the atmosphere.
That last fact is important, since observers are often (but not always) interested in properties of the land and water, and the atmosphere interferes in a way that’s dependent on wavelength. Therefore most algorithms require surface reflectance — rather than digital numbers or radiance — as inputs. Surface reflectance is a measure of the proportion of light reflected from the Earth’s surface, but it can be tricky to calculate. Fortunately most providers of satellite data make a surface reflectance product available so it’s not something you’ll usually need to worry about.
The images above show the same Lake Tahoe scene as before, but the surface reflectance of each band rather than top of atmosphere radiance. Each band is scaled from 0 to 50% reflectivity, so they can be directly compared with one another. Equal brightness in the different images indicates the same proportion of light is being reflected in each different bands. Surface reflection measurements also help ensure observations taken in different locations or at different times are analogous — important for time series analysis or comparisons across regions.
With surface reflectance data, surface properties — vegetation health, burn severity, water quality, soil conditions — can be derived by comparing the relative reflectance of a pixel at different wavelengths. The resulting parameters are called spectral indices, and transform satellite data from an image to a quantity linked to real world physical properties.
Notice how I’m being slightly vague in this description — many spectral indices produce a number that is related to properties of the Earth’s surface, but unitless. So the indices are not absolute measurements of a physical property, but a relative measurement. It’s a bit of a strange concept. In practice, indices are a good way to tell if there’s more or less of something, but not necessarily exactly how much more or how much less. (And yes, I’m still being precisely imprecise with my language.)
The results of indices can also vary based on small differences between sensors, specifically the wavelengths of light measured by each band, which can vary from mission to mission. So the exact same index derived from Sentinel-2, Landsat 9, or a commercial sensor may give different results. The challenge is significant enough that both NASA (Harmonized Landsat and Sentinel-2) and the ESA (Copernicus Sentinel-2 MSI Level-2H and Level-2F) have programs to make Sentinel-2 and Landsat data consistent with each other. Just something to be aware of if you’re trying to compare indices across multiple instruments.
Yet Another NDVI Tutorial (Using gdal_calc.py to Calculate a Spectral Index)
OK, I know — you’re probably already familiar with the equation to calculate Normalized Difference Vegetation Index (NDVI), and you can get pre-baked NDVI from just about any Earth observing mission under the sun, so why bother? First of all, because NDVI is among the simplest algorithms out there, so it’s a good place to introduce gdal_calc.py,
a program that enables you to use the mathematical capabilities of numpy through GDAL. (I’ll explain why that’s powerful soon.) NDVI is also ubiquitous, with a history going all the way back to the earliest Landsat missions, so it’s a good place to start. I promise things will get more interesting.
Here’s the general flow of the process to calculate a spectral index and visualize it with GDAL:
- Get data
- Determine the equation(s)
- Convert into reflectance (if necessary)
- Execute calculations with
gdal_calc.py
- Apply color table to convert data to an image with
gdaldem
(which I described in detail in Part 6: Visualizing Data)
I’m going to illustrate how to calculate NDVI with Landsat 8 surface reflectance data of Lake Tahoe and its surroundings. The same data I’ve already shown, which was collected on September 22, 2021. The data are available from the USGS Earth Explorer. If you’re unsure how to download Landsat data from the USGS, I wrote a tutorial on Earth Explorer that’s still more-or-less correct, but you’ll want to choose Landsat > Landsat Collection 2 Level 2 data > Landsat 8–9 OLI/TIRS C2/L2
under Select Your Data Set(s)
instead of Landsat Archive > L8 OLI/TIRS
. You’ll only need the surface reflectance bands — I won’t be using surface temperature data at all.
If you want to work with another area, feel free to find a location that’s interesting to you. Just pick someplace with both plants and water, since I’ll be demonstrating vegetation and water quality indices. Make sure the data are from Landsat 8 or 9 — a few things I discuss will be specific to the Operational Land Imager, so older Landsats or Sentinel 2 would be more of an extra credit project. You’ll also need both GDAL and Python installed. (Instructions in A Gentle Introduction to GDAL Part 5.)
Calculating NDVI
Here’s the equation for NDVI:
NDVI = (Near Infrared − Red)/(Near Infrared + Red)
Conceptually simple, but there’s a little bit of complexity introduced by working with real world satellite data.
Here’s the GDAL command (code blocks don’t wrap in Medium, so you’ll need to copy/paste into a text editor to see the whole thing.):
gdal_calc.py -D LC08_L2SP_043033_20210922_20210930_02_T1_SR_B4.TIF -E LC08_L2SP_043033_20210922_20210930_02_T1_SR_B5.TIF --calc="((E * 0.0000275 - 0.2) - (D * 0.0000275 - 0.2))/((E * 0.0000275 - 0.2) + (D * 0.0000275 - 0.2))" --outfile=tahoe_LC08_20210922_SR_NDVI.tif --type=Float32 --co COMPRESS=DEFLATE --co PREDICTOR=2
Unfortunately, it’s not nearly as simple as the plain NDVI equation. I’ll break down the parts, and explain why there are some extra bits.
The first part of the command assigns files to the variables used for the calculation:
-D LC08_L2SP_043033_20210922_20210930_02_T1_SR_B4.TIF -E LC08_L2SP_043033_20210922_20210930_02_T1_SR_B5.TIF
Each band of input data needs to be linked to a letter (either uppercase or lowercase) and that letter is used as shorthand in the part of the code that describes the calculation. I assigned the red band on Landsat 8, B4
to D
and the near infrared band B5
to E
. I chose “D” and “E” since they’re the 4th and 5th letters in the alphabet, corresponding to the 4th and 5th Landsat 8 bands. Easier to remember than something arbitrary, at least to me.
--calc="((E * 0.0000275 - 0.2) - (D * 0.0000275 - 0.2))/((E * 0.0000275 - 0.2) + (D * 0.0000275 - 0.2))"
Is the calculation itself, called by --calc
and enclosed in double quotes. The operations are performed on the bands defined previously, and can be any function available in NumPy. The NDVI calculation uses simple arithmetic (only addition +
subtraction -
and division /
), but having access to trigonometric functions, logarithms, and other mathematical operation unlocks a whole bunch of interesting possibilities. I won’t explore them much in this post, but maybe later.
The calculation is complicated by the fact that Landsat surface reflectance data are stored as unsigned integers (Type=UInt16
if you check with gdalinfo
) not floating point numbers. So there is a scale
( 0.0000275) and offset
(0.2) applied to each value in the file. This mathematical trick allows fractional reflectance values that range between 0 and 1 to be squeezed into a file format that only accepts whole numbers from 0 to 65535. The USGS explains this in more detail in a FAQ.
(The additive offset may look a little weird, but it allows room for negative reflectances that don’t exist in the real world, but can be generated if the surface reflectance correction overestimates the scattering effect of the atmosphere. Something that’s relatively common over dark surfaces, like vegetation and water.)
The calculation is significantly simpler if the data don’t need to be re-formatted, in which case the operation would look like this:
--calc="(E - D)/(E + D)"
Back to the straightforward NDVI equation (remember D
is the red band and E
is the near infrared band)! As with most computer languages and mathematical notation, the parentheses are needed to force addition and subtraction to be performed before division.
The rest of the command specifies the output file name--outfile=tahoe_LC08_20210922_SR_NDVI.tif
and forces the file to be written as floating point numbers--type=Float32
. (By default gdal_calc.py
will write the output file in the same format as the input files. Which is a problem, because writing floating point numbers into an unsigned integer file results in unintelligible garbage. So you need to be explicit about the data type.) Finally --co COMPRESS=DEFLATE --co PREDICTOR=2
executes lossless compression (note that there’s a double dash because this is sending arguments to a python program, rather than running GDAL directly).
Making an Image from the Index
The output file will have NDVI values stored as floating point numbers in a single-channel grayscale GeoTIFF. If you want something for display in color you’ll still need to visualize the data with gdaldem color-relief.
The general workflow for this method is to convert a single band data file into a full-color image by correlating the data values with red, green, and blue color values defined in a text file. Read A Gentle Introduction to GDAL Part 6: Visualizing Data for more info.
Here is the command using the NDVI data created by gdal_calc.py
:
gdaldem color-relief tahoe_LC08_20210922_SR_NDVI.tif ndvi_landsat_palette_beige_green.txt tahoe_LC08_20210922_SR_NDVI_color_0-.9_geo.tif -alpha -co COMPRESS=DEFLATE -co PREDICTOR=2
And here are the contents of the text file that correlate values to colors. (Note that this file will also create an alpha channel based on a “no data” value of -9999
— this ensures the area outside of the boundaries of the Landsat scene will be transparent.)
-9999,0,0,0,0
-9998,251,246,237,255
0,251,246,237,255
0.1,249,244,220,255
0.2,236,238,203,255
0.3,216,229,185,255
0.4,189,218,166,255
0.5,156,203,149,255
0.6,116,185,132,255
0.7,62,164,117,255
0.8,0,141,102,255
0.9,0,116,89,255
1,0,116,89,255
I created this beige-green gradient with HCL Wizard, my current favorite perceptual palette generator. (Which is a good reminder that the “tools” section of Subtleties of Color needs an update …)
And here’s what it looks like (with an added color key):
I think the beige to green color palette makes the landscape fairly easy to read. Bare granite of the High Sierra peaks is light beige. Sparsely vegetated desert lowlands in Nevada (upper right) and the Caldor Fire burn scar are both a little bit darker and greener. Dry grasslands in the Central Valley (lower left) are greener still. High altitude forests along the Sierra Crest (upper left to lower right) are a medium green; while wetter, lower elevation forests on the Sierra’s West Slope are dark green. Fields in Nevada and California are the darkest green, representing dense vegetation nourished by irrigation and fertilizer.
Look closely, and one discrepancy stands out — NDVI values for water range wildly from below zero to above 0.9 (outside the bounds of the color palette) seemingly at random. In fact, The variability is so high it’s best to flag water as “no data” in an NDVI map, replacing it with either a fill value or more meaningful measurement. To understand why, and to find a replacement, it’s worth delving a little more deeply into how NDVI works.
NDVI relies on two properties of vegetation — the strong absorption of visible light (particularly red light) by chlorophyll, and the strong reflection of near infrared light by the cellular structure of leaves. High NDVI values (over 0.5 or so) occur when healthy leaves cover a large fraction of any given pixel. Lower NDVI values can be due to leaves with relatively low amounts of chlorophyll (unhealthy or senescent leaves) or sparse vegetation. The algorithm can’t distinguish between lots of unhealthy leaves or a lack of leaves.
Although mathematically NDVI can range from -1 to +1, the realistic range for pixels with at least some vegetation varies from 0 to 0.8 or so (the precise minimum and maximum values depends on the attributes of the sensor).
If there’s no vegetation at all NDVI isn’t really meaningful. In general, natural vegetation-free surfaces (sand, bare rock, exposed soil) reflect similarly throughout the visible and near infrared spectrum. So NDVI will tend to be near zero, or even slightly negative — but the exact value of NDVI doesn’t have physical significance in the absence of vegetation.
Unless there’s water present. Clear water absorbs light more efficiently with longer wavelengths, so gets darker from blue to green to red to near infrared. As a result, under ideal conditions NDVI will be negative for water pixels. If the water is clear. Water with sediment in it (during a flood, or spring melt, for example) or with algae growing near the surface, can be bright in near infrared, and often exhibits positive NDVI. Worse yet, under certain conditions sunlight will reflect directly into the sensor taking measurements (a phenomenon called sunglint), effectively blinding it. All of this makes it difficult to confidently separate water from land pixels only using NDVI.
Put another way — NDVI is sensitive to the presence of water, but there’s no threshold value that separates “water” from “not water”. At least if you’re limited to measurements in the visible and near infrared wavelengths. Water absorbs shortwave infrared (SWIR) extremely efficiently — so well that mud or wet sand can be nearly black in Landsat’s SWIR bands. Algorithms that incorporate this extra information have a much better chance of accurately detecting water.
Detecting Water in Landsat Data
One of the advantages of calculating your own spectral indices is that you are not limited to pre-generated products — you can find an algorithm that works for the specific problem you’re trying to solve. Poking around a bit for algorithms that use Landsat’s shortwave infrared bands for water detection, I found the Automated Water Extraction Index (AWEI), described in this paper: An Automated Method for Extracting Rivers and Lakes from Landsat Imagery.
The formula is:
AWEIsh = BLUE + 2.5 × GREEN − 1.5 × (NIR + SWIR1) − 0.25 × SWIR2
(The lowercase “sh” after AWEI denotes that this is the variant developed for use in mountainous areas, where shadows can be misidentified as water. NIR = near infrared, SWIR1 = shortwave infrared one, and SWIR2 = shortwave infrared two.)
This algorithm was initially developed using data from the Thematic Mapper (TM) aboard Landsat 7, and fortunately the bands on Landsat 8 & 9’s Operational Land Imager (OLI) are very close, so it works with both instruments. SWIR1 is OLI band 6, sensitive to wavelengths from 1.57–1.65 µm, and SWIR2 is OLI band 7, sensitive to wavelengths from 2.11–2.29 µm.
Here is the equivalentgdal_calc.py
command:
gdal_calc.py -B tahoe_LC08_20210922_B2_SR_float.tif -C tahoe_LC08_20210922_B3_SR_float.tif -E tahoe_LC08_20210922_B5_SR_float.tif -F tahoe_LC08_20210922_B6_SR_float.tif -G tahoe_LC08_20210922_B7_SR_float.tif --calc="B + 2.5 * C - 1.5 * (E + F) - 0.25 * G" --outfile=tahoe_LC08_20210922_SR_AWEIsh.tif --type=Float32
Notice that I didn’t include the scale and offset values necessary to convert from 16-bit unsigned integers to 32-bit floating point data. I pre-calculated those values and saved the files with simpler names to make the code easier to read. As before, I linked the data file for each band to a capital letter and defined the calculation with --calc
.
The output file is 32-bit floating point data, but the only important thing is that positive values are likely water, and negative values are likely land. Scaling so zero and above is white and less than zero is black turns out like this:
Which looks pretty good. Great even, with the caveat that there’s a smoke plume from the Caldor Fire just south of Tahoe that’s being incorrectly identified as water. It’s good to be aware that the real world can be a messy place, even if you’ve chosen your data carefully and done all your calculations correctly. The Landsat surface reflectance product does include some data quality information which could help clean that up, but applying it properly is complicated enough to be worth its own tutorial.
So far I’ve calculated NDVI and AWEI (to find where there is (or is not) water), and I want to flag all the water pixels to be No Data (which will be transparent when the data is converted into an image or merged with another dataset). To do this, I have to combine the NDVI file and the AWEIsh files.
The trick is to zero out all the NDVI pixels that are water, not land, then subtract a large value from those pixels and set it to No Data. You can’t simply set zero as No Data because there are valid, non-water NDVI pixels with a value of zero or lower. Here’s the command I used:
gdal_calc.py -V tahoe_LC08_20210922_SR_NDVI.tif -W tahoe_LC08_20210922_SR_AWEIsh.tif - outfile= tahoe_LC08_20210922_SR_NDVI_masked.tif --calc="V*(W<0) + (-9999*(W>=0))" - NoDataValue=-9999 -co COMPRESS=DEFLATE -co PREDICTOR=2
I’ve assigned the NDVI data to V
and water data to W
, then used the <
(less than) operator on AWEIsh to create an array with 1
where there is land (where AWEIsh is less than zero) and 0
where there is water (AWEIsh is greater than or equal to zero). Multiplying that array with NDVI sets all water pixels to zero and leaves NDVI as-is. Then I use the >=
(greater than or equal to) operator on AWEIsh to create an array where 1
indicates water and 0
indicates land, and multiply that by -9999
. Finally I add the water mask array to the NDVI array, which sets all water pixels to -9999
, and then designate -9999
as No Data
, with the option--NoDataValue=-9999
. This results in a nice clean dataset with land pixels having the calculated NDVI values and water pixels assigned No Data
.
I again used gdaldem
to create a color image:
gdaldem color-relief tahoe_LC08_20210922_SR_NDVI_masked.tif ndvi_landsat_palette_beige_green.txt tahoe_LC08_20210922_SR_NDVI_masked_color.tif -alpha -co COMPRESS=DEFLATE -co PREDICTOR=2
In the resulting image all the land pixels are color-coded NDVI and all the water pixels are transparent. Which not only removes the distraction of all the oddly-colored water bodies, but allows for another dataset — perhaps one focused on water quality — to be featured. As with vegetation and binary water detection, there are a number of ways to estimate water quality with Landsat data. I’m not an expert, but the Water Quality Index (WQI) defined in A reflectance-based water quality index and its application to examine degradation of river water quality in a rapidly urbanising megacity looks reasonable.
This is the Water Quality Index equation:
(GREEN + (BLUE − RED) − SWIR1)/(GREEN + (BLUE − RED) + SWIR1)
And the corresponding gdal_calc command:
gdal_calc.py -B tahoe_LC08_20210922_B2_SR_float.tif -C tahoe_LC08_20210922_B3_SR_float.tif -D tahoe_LC08_20210922_B4_SR_float.tif -F tahoe_LC08_20210922_B6_SR_float.tif -W tahoe_LC08_20210922_SR_AWEIsh.tif --outfile=tahoe_LC08_20210922_SR_WQI_masked.tif --calc="(C + (B - D) - F)/(C + (B - D) + F) * (W>=0) + ((W<0) * -9999)" --NoDataValue=-9999 --co COMPRESS=DEFLATE --co PREDICTOR=2
As with my calculation of the Automated Water Extraction Index, I used pre-scaled surface reflectance data converted to floating point data, renamed to be more readable. I also converted the land pixels to No Data
with the AWEI mask at the same time I calculated WQI, rather than doing it in a separate step. Since the water pixels were valid in this case, I set all the locations where AWEI was less than zero to a no data value of -9999
.
One last execution of gdaldem color-relief
, this time using a classic Color Brewer palette:
gdaldem color-relief tahoe_LC08_20210922_SR_WQI_masked.tif color_brewer_Bl9.txt tahoe_LC08_20210922_SR_WQI_color_masked.tif -alpha -co COMPRESS=DEFLATE -co PREDICTOR=2
The Color Brewer 9-class Blues palette formatted for gdaldem color-relief:
-9999,0,0,0,0
-9998,247,251,255,255
0,247,251,255,255
0.125,222,235,247,255
0.25,198,219,239,255
0.375,158,202,225,255
0.5,107,174,214,255
0.625,66,146,198,255
0.75,33,113,181,255
0.875,8,81,156,255
1,8,48,107,255
Which creates this map of blue lakes floating in empty space …
Nice, but not quite the result I want, which is to combine the NDVI data with the water quality data. Since areas of no data are already masked out in both the NDVI and WQI images, the data can be merged with a single line using gdalwarp:
gdalwarp tahoe_LC08_20210922_SR_WQI_color_masked.tif tahoe_LC08_20210922_SR_NDVI_masked.tif tahoe_LC08_20210922_WQI_NDVI.tif -co COMPRESS=DEFLATE -co PREDICTOR=2
gdalwarp
calls the command, tahoe_LC08_20210922_SR_WQI_color_masked.tif
and tahoe_LC08_20210922_SR_NDVI_masked.tif
are the input file names, tahoe_LC08_20210922_WQI_NDVI.tif
is the output file name, and -co COMPRESS=DEFLATE -co PREDICTOR=2
tells GDAL to losslessly compress the TIFFs to save drive space (single dash this time).
The result is a combined map of vegetation and water quality in Lake Tahoe and the surrounding Sierra Nevada, based on a single Landsat scene.
To recap: I used gdal_calc.py
to calculate spectral indices and masks from multispectral data on the command line, colorized the data with gdaldem color-relief
, and combined the files with gdalwarp
.
I findgdal_calc.py
convenient because it enables access to the powerful mathematical functions of numpy without the overhead of Python or other programming languages. As such it’s great for prototyping visualizations or working with limited numbers of files. When you’re working with large numbers of files or performing more complex tasks (like generating a similar map from a time series of Landsat scenes) it’s probably better to look into full blown programming languages (which will likely use GDAL under the hood for parsing scientific data formats and reading georeferencing information). Which will be the topic of Part 8 — using the Python GDAL bindings to open and visualize HDF (and maybe NetCDF) files.
A Few More Thoughts on Spectral Indices
It’s really easy to browse through a website, crack open a textbook, or read a blog post (including this one!) and find a spectral index that purports to quantify one physical parameter or another — there are literally hundreds of them. And often times you can calculate an index, throw a palette on it, and think “that looks about right”. It’s much harder to know which indices are appropriate to solve which problems, and understand the limitations inherent in different remote sensing systems and how that might affect your results. Remote sensing is hard!
For example, look at carefully at the water quality index data for Lake Tahoe. There are features that look like ripples and eddies that I’m convinced are due to surface glint and don’t represent differences in water quality.
There are a ton of variables and uncertainties that can impact the accuracy of even the simplest and most well understood indices, like NDVI. Seemingly small variations in the angle of the sun, the amount of ozone in the atmosphere, or sensor characteristics can have a large impact on the calculated measurement. A little effort put into understanding these limitations goes a long way towards helping to realize the potential of satellite data to mitigate many social and environmental challenges.
Next up in A Gentle Introduction to GDAL Part 8: using GDAL to open sometimes-tricky scientific data formats like HDF & NetCDF, along with using GDAL to enable Python to read & write geospatial data.
Thanks to Joe Kington for being a sounding board and (as always) Frank Warmerdam for encouraging me to learn GDAL in the first place.