A Gentle Introduction to GDAL Part 7: Transforming Data

18 min readOct 30, 2023

The ability to color-code geographic datasets is a key reason to include GDAL in a visualization workflow . It’s great for batch processing a time series or automating production of graphics in near-real-time. But GDAL isn’t limited to making pictures of data. Tools like gdal_calc.py help transform data from a raw material into usable information. Information that forms the basis for analysis (and mapping).

End users of remote sensing data rarely want imagery — they want answers from the information contained within the imagery (and other more exotic remote sensing techniques, but this post is mainly about deriving data from imagery). Known variously as geophysical parameters, planetary variables, analytics, or insights; derived datasets form the heart of why remote sensing is so valuable for Earth science. They help us understand the past, react in the present, and plan for the future.

In this article I’ll demonstrate a few methods for using GDAL to transform data, helping make the information it contains more usable. I’ll show how to calculate spectral indices from individual bands, how to mask out invalid data (like water in a vegetation index), and how to combine datasets into multivariate maps — along with some background on remote sensing.

An Introduction to Spectral Indices

One key technique to go from numbers to useful information is a spectral index. The technique uses the relationship between the brightness of pixels in different spectral bands to derive a quantity that relates to the physical world.

These images show each of Landsat 8’s seven multispectral bands. Radiance data — while calibrated — varies based on the brightness of the Sun at different wavelengths and the state of the atmosphere. The shorter the wavelength the greater the effect of atmospheric scattering, from both nitrogen and oxygen molecules, trace gases (in particular water vapor and ozone) and particulates like smoke. Longer wavelengths pass more directly through the atmosphere. Notice how the smoke from the Caldor Fire (just south of Lake Tahoe) nearly disappears in the two shortwave infrared bands. (Data collected on September 22, 2021. Available on the USGS Earth Explorer.)

The images above show the Operational Land Imager’s (the primary instrument on Landsat 8 and 9) seven multispectral bands (excluding the panchromatic band — which is a blend of visible wavelengths, and the thermal infrared bands which are collected by a different instrument and measure energy emitted from the Earth — not reflected sunlight). Each image represents the perspective from the top of the atmosphere (often abbreviated as TOA) in a different color (including three infrared “colors” that are invisible to us puny humans). I’ve scaled them to be roughly equivalent, rather than stretched for maximum contrast, which is a more typical display technique. (I cheated a bit and brightened the two shortwave infrared bands, because the Sun is much brighter in visible wavelengths.)

Looking at the bands next to each other, you might get some clues as to what features each band is sensitive to, and how comparing bands in various ways might reveal some insights about the Earth’s surface. Water is dark in near and shortwave infrared light, but bright in coastal blue. Vegetation is bright in near infrared but dark in all the other bands. Smoke is transparent in shortwave infrared. The images taken with blue light are washed out because the short wavelengths are scattered by the atmosphere.

That last fact is important, since observers are often (but not always) interested in properties of the land and water, and the atmosphere interferes in a way that’s dependent on wavelength. Therefore most algorithms require surface reflectance — rather than digital numbers or radiance — as inputs. Surface reflectance is a measure of the proportion of light reflected from the Earth’s surface, but it can be tricky to calculate. Fortunately most providers of satellite data make a surface reflectance product available so it’s not something you’ll usually need to worry about.

Grayscale images of surface reflectance data from each of the seven multispectral bands on Landsat 8. Each has been processed identically, so they show how various surfaces reflect light differently in each wavelength. Notice how the landscape varies based on wavelength. These differences form the basis of the algorithms that quantify surface properties. (Data collected on September 22, 2021. Available on the USGS Earth Explorer.)

The images above show the same Lake Tahoe scene as before, but the surface reflectance of each band rather than top of atmosphere radiance. Each band is scaled from 0 to 50% reflectivity, so they can be directly compared with one another. Equal brightness in the different images indicates the same proportion of light is being reflected in each different bands. Surface reflection measurements also help ensure observations taken in different locations or at different times are analogous — important for time series analysis or comparisons across regions.

With surface reflectance data, surface properties — vegetation health, burn severity, water quality, soil conditions — can be derived by comparing the relative reflectance of a pixel at different wavelengths. The resulting parameters are called spectral indices, and transform satellite data from an image to a quantity linked to real world physical properties.

Notice how I’m being slightly vague in this description — many spectral indices produce a number that is related to properties of the Earth’s surface, but unitless. So the indices are not absolute measurements of a physical property, but a relative measurement. It’s a bit of a strange concept. In practice, indices are a good way to tell if there’s more or less of something, but not necessarily exactly how much more or how much less. (And yes, I’m still being precisely imprecise with my language.)

The results of indices can also vary based on small differences between sensors, specifically the wavelengths of light measured by each band, which can vary from mission to mission. So the exact same index derived from Sentinel-2, Landsat 9, or a commercial sensor may give different results. The challenge is significant enough that both NASA (Harmonized Landsat and Sentinel-2) and the ESA (Copernicus Sentinel-2 MSI Level-2H and Level-2F) have programs to make Sentinel-2 and Landsat data consistent with each other. Just something to be aware of if you’re trying to compare indices across multiple instruments.

Yet Another NDVI Tutorial (Using gdal_calc.py to Calculate a Spectral Index)

OK, I know — you’re probably already familiar with the equation to calculate Normalized Difference Vegetation Index (NDVI), and you can get pre-baked NDVI from just about any Earth observing mission under the sun, so why bother? First of all, because NDVI is among the simplest algorithms out there, so it’s a good place to introduce gdal_calc.py, a program that enables you to use the mathematical capabilities of numpy through GDAL. (I’ll explain why that’s powerful soon.) NDVI is also ubiquitous, with a history going all the way back to the earliest Landsat missions, so it’s a good place to start. I promise things will get more interesting.

Here’s the general flow of the process to calculate a spectral index and visualize it with GDAL:

Get data
Determine the equation(s)
Convert into reflectance (if necessary)
Execute calculations with gdal_calc.py
Apply color table to convert data to an image with gdaldem (which I described in detail in Part 6: Visualizing Data)

I’m going to illustrate how to calculate NDVI with Landsat 8 surface reflectance data of Lake Tahoe and its surroundings. The same data I’ve already shown, which was collected on September 22, 2021. The data are available from the USGS Earth Explorer. If you’re unsure how to download Landsat data from the USGS, I wrote a tutorial on Earth Explorer that’s still more-or-less correct, but you’ll want to choose Landsat > Landsat Collection 2 Level 2 data > Landsat 8–9 OLI/TIRS C2/L2 under Select Your Data Set(s) instead of Landsat Archive > L8 OLI/TIRS. You’ll only need the surface reflectance bands — I won’t be using surface temperature data at all.

If you want to work with another area, feel free to find a location that’s interesting to you. Just pick someplace with both plants and water, since I’ll be demonstrating vegetation and water quality indices. Make sure the data are from Landsat 8 or 9 — a few things I discuss will be specific to the Operational Land Imager, so older Landsats or Sentinel 2 would be more of an extra credit project. You’ll also need both GDAL and Python installed. (Instructions in A Gentle Introduction to GDAL Part 5.)

Calculating NDVI

Here’s the equation for NDVI:

NDVI = (Near Infrared − Red)/(Near Infrared + Red)

Conceptually simple, but there’s a little bit of complexity introduced by working with real world satellite data.

Here’s the GDAL command (code blocks don’t wrap in Medium, so you’ll need to copy/paste into a text editor to see the whole thing.):

gdal_calc.py -D LC08_L2SP_043033_20210922_20210930_02_T1_SR_B4.TIF -E LC08_L2SP_043033_20210922_20210930_02_T1_SR_B5.TIF --calc="((E * 0.0000275 - 0.2) - (D * 0.0000275 - 0.2))/((E * 0.0000275 - 0.2) + (D * 0.0000275 - 0.2))" --outfile=tahoe_LC08_20210922_SR_NDVI.tif --type=Float32 --co COMPRESS=DEFLATE --co PREDICTOR=2

Unfortunately, it’s not nearly as simple as the plain NDVI equation. I’ll break down the parts, and explain why there are some extra bits.

The first part of the command assigns files to the variables used for the calculation:

-D LC08_L2SP_043033_20210922_20210930_02_T1_SR_B4.TIF -E LC08_L2SP_043033_20210922_20210930_02_T1_SR_B5.TIF

Each band of input data needs to be linked to a letter (either uppercase or lowercase) and that letter is used as shorthand in the part of the code that describes the calculation. I assigned the red band on Landsat 8, B4 to D and the near infrared band B5 to E. I chose “D” and “E” since they’re the 4th and 5th letters in the alphabet, corresponding to the 4th and 5th Landsat 8 bands. Easier to remember than something arbitrary, at least to me.

--calc="((E * 0.0000275 - 0.2) - (D * 0.0000275 - 0.2))/((E * 0.0000275 - 0.2) + (D * 0.0000275 - 0.2))"

Is the calculation itself, called by --calc and enclosed in double quotes. The operations are performed on the bands defined previously, and can be any function available in NumPy. The NDVI calculation uses simple arithmetic (only addition + subtraction - and division /), but having access to trigonometric functions, logarithms, and other mathematical operation unlocks a whole bunch of interesting possibilities. I won’t explore them much in this post, but maybe later.

The calculation is complicated by the fact that Landsat surface reflectance data are stored as unsigned integers (Type=UInt16 if you check with gdalinfo) not floating point numbers. So there is a scale ( 0.0000275) and offset (0.2) applied to each value in the file. This mathematical trick allows fractional reflectance values that range between 0 and 1 to be squeezed into a file format that only accepts whole numbers from 0 to 65535. The USGS explains this in more detail in a FAQ.

(The additive offset may look a little weird, but it allows room for negative reflectances that don’t exist in the real world, but can be generated if the surface reflectance correction overestimates the scattering effect of the atmosphere. Something that’s relatively common over dark surfaces, like vegetation and water.)

The calculation is significantly simpler if the data don’t need to be re-formatted, in which case the operation would look like this:

--calc="(E - D)/(E + D)"

Back to the straightforward NDVI equation (remember D is the red band and E is the near infrared band)! As with most computer languages and mathematical notation, the parentheses are needed to force addition and subtraction to be performed before division.

The rest of the command specifies the output file name--outfile=tahoe_LC08_20210922_SR_NDVI.tif and forces the file to be written as floating point numbers--type=Float32. (By default gdal_calc.py will write the output file in the same format as the input files. Which is a problem, because writing floating point numbers into an unsigned integer file results in unintelligible garbage. So you need to be explicit about the data type.) Finally --co COMPRESS=DEFLATE --co PREDICTOR=2 executes lossless compression (note that there’s a double dash because this is sending arguments to a python program, rather than running GDAL directly).

Making an Image from the Index

The output file will have NDVI values stored as floating point numbers in a single-channel grayscale GeoTIFF. If you want something for display in color you’ll still need to visualize the data with gdaldem color-relief.

The general workflow for this method is to convert a single band data file into a full-color image by correlating the data values with red, green, and blue color values defined in a text file. Read A Gentle Introduction to GDAL Part 6: Visualizing Data for more info.

Here is the command using the NDVI data created by gdal_calc.py:

gdaldem color-relief tahoe_LC08_20210922_SR_NDVI.tif ndvi_landsat_palette_beige_green.txt tahoe_LC08_20210922_SR_NDVI_color_0-.9_geo.tif -alpha -co COMPRESS=DEFLATE -co PREDICTOR=2

And here are the contents of the text file that correlate values to colors. (Note that this file will also create an alpha channel based on a “no data” value of -9999 — this ensures the area outside of the boundaries of the Landsat scene will be transparent.)

-9999,0,0,0,0
-9998,251,246,237,255
0,251,246,237,255
0.1,249,244,220,255
0.2,236,238,203,255
0.3,216,229,185,255
0.4,189,218,166,255
0.5,156,203,149,255
0.6,116,185,132,255
0.7,62,164,117,255
0.8,0,141,102,255
0.9,0,116,89,255
1,0,116,89,255

I created this beige-green gradient with HCL Wizard, my current favorite perceptual palette generator. (Which is a good reminder that the “tools” section of Subtleties of Color needs an update …)

And here’s what it looks like (with an added color key):

Normalized Difference Vegetation Index (NDVI) map of the Lake Tahoe region. Beige and light yellow indicates low NDVI, while dark green indicates high NDVI. — Normalized Difference Vegetation Index (NDVI) calculated from red and near infrared surface reflectance data with gdal_calc.py. Dark green represents thriving vegetation. Light green represents sparse vegetation, while beige is bare land or fresh burn scar. Water has not been masked. (Landsat 8 data collected on September 22, 2021. Available on the USGS Earth Explorer.)

I think the beige to green color palette makes the landscape fairly easy to read. Bare granite of the High Sierra peaks is light beige. Sparsely vegetated desert lowlands in Nevada (upper right) and the Caldor Fire burn scar are both a little bit darker and greener. Dry grasslands in the Central Valley (lower left) are greener still. High altitude forests along the Sierra Crest (upper left to lower right) are a medium green; while wetter, lower elevation forests on the Sierra’s West Slope are dark green. Fields in Nevada and California are the darkest green, representing dense vegetation nourished by irrigation and fertilizer.

Look closely, and one discrepancy stands out — NDVI values for water range wildly from below zero to above 0.9 (outside the bounds of the color palette) seemingly at random. In fact, The variability is so high it’s best to flag water as “no data” in an NDVI map, replacing it with either a fill value or more meaningful measurement. To understand why, and to find a replacement, it’s worth delving a little more deeply into how NDVI works.

Unmasked NDVI (left) compared with a true color image (right) of Lake Tahoe. NDVI measurements of water bodies can vary in unpredictable ways. (Landsat 8 data collected on September 22, 2021. Available on the USGS Earth Explorer.)

NDVI relies on two properties of vegetation — the strong absorption of visible light (particularly red light) by chlorophyll, and the strong reflection of near infrared light by the cellular structure of leaves. High NDVI values (over 0.5 or so) occur when healthy leaves cover a large fraction of any given pixel. Lower NDVI values can be due to leaves with relatively low amounts of chlorophyll (unhealthy or senescent leaves) or sparse vegetation. The algorithm can’t distinguish between lots of unhealthy leaves or a lack of leaves.

Although mathematically NDVI can range from -1 to +1, the realistic range for pixels with at least some vegetation varies from 0 to 0.8 or so (the precise minimum and maximum values depends on the attributes of the sensor).

If there’s no vegetation at all NDVI isn’t really meaningful. In general, natural vegetation-free surfaces (sand, bare rock, exposed soil) reflect similarly throughout the visible and near infrared spectrum. So NDVI will tend to be near zero, or even slightly negative — but the exact value of NDVI doesn’t have physical significance in the absence of vegetation.

Unless there’s water present. Clear water absorbs light more efficiently with longer wavelengths, so gets darker from blue to green to red to near infrared. As a result, under ideal conditions NDVI will be negative for water pixels. If the water is clear. Water with sediment in it (during a flood, or spring melt, for example) or with algae growing near the surface, can be bright in near infrared, and often exhibits positive NDVI. Worse yet, under certain conditions sunlight will reflect directly into the sensor taking measurements (a phenomenon called sunglint), effectively blinding it. All of this makes it difficult to confidently separate water from land pixels only using NDVI.

These four images compare radiance (top row) and surface reflectance (bottom row) in the red (left) and near infrared (right) bands of Landsat 8. Contrast between land and water is higher in near infrared than red, so NDVI (which uses red and near infrared light) yields some information about where there’s water. But it’s far from perfect, and particularly susceptible to mis-identifying water when there’s a lot of sediment present or there is sunglint on the surface. In the radiance images the lower-right portion of Lake Tahoe is lighter than the upper-left, which suggests some sunlight is reflecting off the surface into the sensor. The surface reflectance algorithm isn’t adequately compensating for the change in brightness, which results in the relative lightness flipping (areas in sunglint are darker) in the surface reflectance data. This is responsible for the extreme variablilty in NDVI apparent in Tahoe. (Landsat 8 data collected on September 22, 2021. Available on the USGS Earth Explorer.)

Put another way — NDVI is sensitive to the presence of water, but there’s no threshold value that separates “water” from “not water”. At least if you’re limited to measurements in the visible and near infrared wavelengths. Water absorbs shortwave infrared (SWIR) extremely efficiently — so well that mud or wet sand can be nearly black in Landsat’s SWIR bands. Algorithms that incorporate this extra information have a much better chance of accurately detecting water.

Detecting Water in Landsat Data

One of the advantages of calculating your own spectral indices is that you are not limited to pre-generated products — you can find an algorithm that works for the specific problem you’re trying to solve. Poking around a bit for algorithms that use Landsat’s shortwave infrared bands for water detection, I found the Automated Water Extraction Index (AWEI), described in this paper: An Automated Method for Extracting Rivers and Lakes from Landsat Imagery.

The formula is:

AWEIsh = BLUE + 2.5 × GREEN − 1.5 × (NIR + SWIR1) − 0.25 × SWIR2

(The lowercase “sh” after AWEI denotes that this is the variant developed for use in mountainous areas, where shadows can be misidentified as water. NIR = near infrared, SWIR1 = shortwave infrared one, and SWIR2 = shortwave infrared two.)

This algorithm was initially developed using data from the Thematic Mapper (TM) aboard Landsat 7, and fortunately the bands on Landsat 8 & 9’s Operational Land Imager (OLI) are very close, so it works with both instruments. SWIR1 is OLI band 6, sensitive to wavelengths from 1.57–1.65 µm, and SWIR2 is OLI band 7, sensitive to wavelengths from 2.11–2.29 µm.

Here is the equivalentgdal_calc.py command:

gdal_calc.py -B tahoe_LC08_20210922_B2_SR_float.tif -C tahoe_LC08_20210922_B3_SR_float.tif -E tahoe_LC08_20210922_B5_SR_float.tif -F tahoe_LC08_20210922_B6_SR_float.tif -G tahoe_LC08_20210922_B7_SR_float.tif --calc="B  + 2.5 * C - 1.5 * (E + F) - 0.25 * G" --outfile=tahoe_LC08_20210922_SR_AWEIsh.tif --type=Float32

Notice that I didn’t include the scale and offset values necessary to convert from 16-bit unsigned integers to 32-bit floating point data. I pre-calculated those values and saved the files with simpler names to make the code easier to read. As before, I linked the data file for each band to a capital letter and defined the calculation with --calc.

The output file is 32-bit floating point data, but the only important thing is that positive values are likely water, and negative values are likely land. Scaling so zero and above is white and less than zero is black turns out like this:

Water mask from the Automated Water Extraction Index optimized for use in areas likely to have shadows. Spectral indices that include shortwave infrared data are generally more accurate for detecting water than indices that only include visible and near infrared bands. (Landsat 8 data collected on September 22, 2021. Available on the USGS Earth Explorer.)

Which looks pretty good. Great even, with the caveat that there’s a smoke plume from the Caldor Fire just south of Tahoe that’s being incorrectly identified as water. It’s good to be aware that the real world can be a messy place, even if you’ve chosen your data carefully and done all your calculations correctly. The Landsat surface reflectance product does include some data quality information which could help clean that up, but applying it properly is complicated enough to be worth its own tutorial.

So far I’ve calculated NDVI and AWEI (to find where there is (or is not) water), and I want to flag all the water pixels to be No Data (which will be transparent when the data is converted into an image or merged with another dataset). To do this, I have to combine the NDVI file and the AWEIsh files.

Idealized versions of the NDVI & AWEIsh arrays. Positive values in AWEIsh indicate water pixels.

The trick is to zero out all the NDVI pixels that are water, not land, then subtract a large value from those pixels and set it to No Data. You can’t simply set zero as No Data because there are valid, non-water NDVI pixels with a value of zero or lower. Here’s the command I used:

gdal_calc.py -V tahoe_LC08_20210922_SR_NDVI.tif -W tahoe_LC08_20210922_SR_AWEIsh.tif - outfile= tahoe_LC08_20210922_SR_NDVI_masked.tif --calc="V*(W<0) + (-9999*(W>=0))" - NoDataValue=-9999 -co COMPRESS=DEFLATE -co PREDICTOR=2

I’ve assigned the NDVI data to V and water data to W, then used the < (less than) operator on AWEIsh to create an array with 1 where there is land (where AWEIsh is less than zero) and 0 where there is water (AWEIsh is greater than or equal to zero). Multiplying that array with NDVI sets all water pixels to zero and leaves NDVI as-is. Then I use the >= (greater than or equal to) operator on AWEIsh to create an array where 1 indicates water and 0 indicates land, and multiply that by -9999. Finally I add the water mask array to the NDVI array, which sets all water pixels to -9999, and then designate -9999 as No Data, with the option--NoDataValue=-9999. This results in a nice clean dataset with land pixels having the calculated NDVI values and water pixels assigned No Data.

Graphical version of the water mask calculation. Numpy’s less than (<) and greater than (>) operators allow creation of a mask based on a threshold value. This allows invalid data to be removed from a map, or different types of data to be combined seamlessly.

I again used gdaldem to create a color image:

gdaldem color-relief tahoe_LC08_20210922_SR_NDVI_masked.tif ndvi_landsat_palette_beige_green.txt tahoe_LC08_20210922_SR_NDVI_masked_color.tif -alpha -co COMPRESS=DEFLATE -co PREDICTOR=2

A map of masked NDVI data in the Lake Tahoe region. In most cases vegetation indices aren’t meaningful for water pixels, and it’s best to remove them. (Landsat 8 data collected on September 22, 2021. Available on the USGS Earth Explorer.)

In the resulting image all the land pixels are color-coded NDVI and all the water pixels are transparent. Which not only removes the distraction of all the oddly-colored water bodies, but allows for another dataset — perhaps one focused on water quality — to be featured. As with vegetation and binary water detection, there are a number of ways to estimate water quality with Landsat data. I’m not an expert, but the Water Quality Index (WQI) defined in A reflectance-based water quality index and its application to examine degradation of river water quality in a rapidly urbanising megacity looks reasonable.

This is the Water Quality Index equation:

(GREEN + (BLUE − RED) − SWIR1)/(GREEN + (BLUE − RED) + SWIR1)

And the corresponding gdal_calc command:

gdal_calc.py -B tahoe_LC08_20210922_B2_SR_float.tif -C tahoe_LC08_20210922_B3_SR_float.tif -D tahoe_LC08_20210922_B4_SR_float.tif -F tahoe_LC08_20210922_B6_SR_float.tif -W tahoe_LC08_20210922_SR_AWEIsh.tif --outfile=tahoe_LC08_20210922_SR_WQI_masked.tif --calc="(C + (B - D) - F)/(C + (B - D) + F) * (W>=0) + ((W<0) * -9999)" --NoDataValue=-9999 --co COMPRESS=DEFLATE --co PREDICTOR=2

As with my calculation of the Automated Water Extraction Index, I used pre-scaled surface reflectance data converted to floating point data, renamed to be more readable. I also converted the land pixels to No Data with the AWEI mask at the same time I calculated WQI, rather than doing it in a separate step. Since the water pixels were valid in this case, I set all the locations where AWEI was less than zero to a no data value of -9999.

One last execution of gdaldem color-relief, this time using a classic Color Brewer palette:

gdaldem color-relief tahoe_LC08_20210922_SR_WQI_masked.tif color_brewer_Bl9.txt tahoe_LC08_20210922_SR_WQI_color_masked.tif -alpha -co COMPRESS=DEFLATE -co PREDICTOR=2

The Color Brewer 9-class Blues palette formatted for gdaldem color-relief:

-9999,0,0,0,0
-9998,247,251,255,255
0,247,251,255,255
0.125,222,235,247,255
0.25,198,219,239,255
0.375,158,202,225,255
0.5,107,174,214,255
0.625,66,146,198,255
0.75,33,113,181,255
0.875,8,81,156,255
1,8,48,107,255

Which creates this map of blue lakes floating in empty space …

Map of Water Quality Index (WQI) for the Lake Tahoe region with land areas removed. Dark blue indicates high water quality, while lighter blue represents lower quality water. (Landsat 8 data collected on September 22, 2021. Available on the USGS Earth Explorer.)

Nice, but not quite the result I want, which is to combine the NDVI data with the water quality data. Since areas of no data are already masked out in both the NDVI and WQI images, the data can be merged with a single line using gdalwarp:

gdalwarp tahoe_LC08_20210922_SR_WQI_color_masked.tif tahoe_LC08_20210922_SR_NDVI_masked.tif tahoe_LC08_20210922_WQI_NDVI.tif -co COMPRESS=DEFLATE -co PREDICTOR=2

gdalwarp calls the command, tahoe_LC08_20210922_SR_WQI_color_masked.tif and tahoe_LC08_20210922_SR_NDVI_masked.tif are the input file names, tahoe_LC08_20210922_WQI_NDVI.tif is the output file name, and -co COMPRESS=DEFLATE -co PREDICTOR=2 tells GDAL to losslessly compress the TIFFs to save drive space (single dash this time).

The result is a combined map of vegetation and water quality in Lake Tahoe and the surrounding Sierra Nevada, based on a single Landsat scene.

Combined NDVI and WQI map, generated entirely with GDAL. (Landsat 8 data collected on September 22, 2021. Available on the USGS Earth Explorer.)

To recap: I used gdal_calc.py to calculate spectral indices and masks from multispectral data on the command line, colorized the data with gdaldem color-relief, and combined the files with gdalwarp.

I findgdal_calc.py convenient because it enables access to the powerful mathematical functions of numpy without the overhead of Python or other programming languages. As such it’s great for prototyping visualizations or working with limited numbers of files. When you’re working with large numbers of files or performing more complex tasks (like generating a similar map from a time series of Landsat scenes) it’s probably better to look into full blown programming languages (which will likely use GDAL under the hood for parsing scientific data formats and reading georeferencing information). Which will be the topic of Part 8 — using the Python GDAL bindings to open and visualize HDF (and maybe NetCDF) files.

A Few More Thoughts on Spectral Indices

It’s really easy to browse through a website, crack open a textbook, or read a blog post (including this one!) and find a spectral index that purports to quantify one physical parameter or another — there are literally hundreds of them. And often times you can calculate an index, throw a palette on it, and think “that looks about right”. It’s much harder to know which indices are appropriate to solve which problems, and understand the limitations inherent in different remote sensing systems and how that might affect your results. Remote sensing is hard!

For example, look at carefully at the water quality index data for Lake Tahoe. There are features that look like ripples and eddies that I’m convinced are due to surface glint and don’t represent differences in water quality.

Water Quality Index map of Lake Tahoe. The features in the lake that look like ripples or eddies are much more likely to be due to sunglint than real differences in water quality. (Landsat 8 data collected on September 22, 2021. Available on the USGS Earth Explorer.)

There are a ton of variables and uncertainties that can impact the accuracy of even the simplest and most well understood indices, like NDVI. Seemingly small variations in the angle of the sun, the amount of ozone in the atmosphere, or sensor characteristics can have a large impact on the calculated measurement. A little effort put into understanding these limitations goes a long way towards helping to realize the potential of satellite data to mitigate many social and environmental challenges.

Next up in A Gentle Introduction to GDAL Part 8: using GDAL to open sometimes-tricky scientific data formats like HDF & NetCDF, along with using GDAL to enable Python to read & write geospatial data.

Thanks to Joe Kington for being a sounding board and (as always) Frank Warmerdam for encouraging me to learn GDAL in the first place.