A Gentle Introduction to GDAL Part 6.1: Visualizing Data

Robert Simmon
16 min readOct 6, 2023

--

You may be wondering what I mean by “data visualization with GDAL”? After all, aren’t satellites images and maps — prominently featured in previous installments — both data? Well, yes. But there are differences between types of data that I think it’s worth expanding on, before diving into examples of using GDAL to create thematic maps (maps “used to emphasize the the spatial pattern of one or more geographic attributes” from Thematic Cartography and Geographic Visualization). If you want, you can just skip to the tutorials, but if you’re curious, hang around for a bit.

Images vs. Data

The first Earth observations returned from space — everything from Corona spy satellites to the first pictures from a weather satellites to photographs of the Earth from Apollo astronauts — were more on the “image” side of the data/image divide than the “data” side. All of these examples were recorded with, broadcast by, and stored on analog equipment (seriously, the early TIROS data is stored in a series of oversized books), with the variations in brightness representing relative, not absolute, differences in intensity.

Half Dome and Glacier Point, Yosemite National Park, imaged by a Corona satellite on October 21, 1964. Image by Robert Simmon based on data from the USGS Earth Explorer. CC-by-SA-4.0.

The information these pictures contained was more qualitative than quantitative, with the types of analysis that could be performed more suited to words than equations. These distinctions aren’t absolute. It’s certainly possible to make qualitative imagery into quantitative data with operations like counting aircraft or tracing the outline of a glacier. But for the most part, early satellite data is best thought of as a photograph (and some data were photographs, with the film returned to Earth).

Landsat Return Beam Vidicon image of Madagascar collected on June 8, 1981 (left), and Earth from Apollo 16, taken on April 16, 1972 (right). Early satellite data was, in general, more qualitative than quantitative. Images from the USGS (left) and NASA (right).

This changed in the 1970s as satellite remote sensing moved into the digital age, spearheaded by Landsat’s Multispectral Scanner (MSS). (Keep in mind there’s nothing inherently superior about digital vs. analog — an analog HDTV signal is higher quality than a digital DVD. “Digital” in this usage just means measurements broken up into discrete values. Digital data can be more easily stored and replicated than analog data, and computers — which require digital inputs — enable revolutionary types of analysis.) The MSS was originally intended to be complementary to Landsat 1’s principal instrument — the TV-like Return Beam Vidicon (RBV). However, users soon realized the MSS was superior to the RBV. Scenes from the MSS were more uniform, more precisely aligned with features on the surface of the Earth, were more consistent from image to image, and contained two near infrared bands (which measured light in wavelengths slightly longer than what human eyes can see) that proved critical for mapping vegetation.

False-color image of Half Dome, Yosemite National Park, California taken by the Multispectral Scanner engineering model, which ended up being flown on Landsat-1. Photo courtesy NASA/USGS Landsat Program.

Crucially, data from the MSS were calibrated. Rather than being a relative measure, each pixel represented the amount of energy detected by the sensor in four separate wavelengths of light. In other words the instrument collected photons over an 80 by 80 meter point on the Earth’s surface, converted those photons into an electric charge which was then measured and stoered as a digital number (sometimes referred to as a “DN”). A process very similar to a modern digital camera. The resulting digital number can then be re-converted into real world units of radiance — Watts per meter squared per steradian.

That sounds scary, and maybe it is a little bit, but it accounts for the complexity of precisely measuring the Earth’s features from space. Radiance is usually converted into reflectance which is conceptually a bit easier to understand. Also known as albedo (a term I first heard listening to Vangelis, believe it or not), reflectance is the proportion (or percentage) of sunlight incident on an area being measured that is received by the sensor. For examples asphalt reflects less light than fresh snow. Unlike radiance, which varies based on ambient conditions, reflectance is a property of a surface. So observations collected at different times of day, different places, or under different conditions, can all be compared with each other. There is a lot of complexity here, and this is a digression within a digression, so if you want to know more I found this explanation very helpful.

Landsat 1 Multispectral Scanner (MSS) false-color image of the Sierra Nevada included Yosemite National Park and Lake Tahoe, collected on September 6, 1974. Precisely calibrated data from the MSS allowed scientists to quantitatively map the surface of the Earth. Image by Robert Simmon based on data from the USGS Earth Explorer. CC-by-SA-4.0.

In my mind, calibration is what distinguished images from data, although there’s no sharp boundary between the two. It’s possible to make data from images, and images from data. Sometimes the conversion of data to image is reversible (i.e. one can derive the original calibrated data from the image), but more often the process is lossy, so the image can’t be used to perfectly re-create the data it was derived from.

There’s a few reasons for this. Human perception is — to put it mildly — imprecise, and the changes needed to make data interpretable to our eyes and brain inherently distort the underlying data. Plus, most image formats contain less information than scientific data formats. Sharpening, color-correction, bit-depth reduction, compression, limitations on dynamic range, and the small number of bands in some types of files can all result in the irretrievable loss of information. So imagery is usually a subset of the data it’s based on.

Why is the collection of calibrated data from space so important? With calibrated measurements, it’s possible to relate the amount of light (more accurately “electromagnetic radiation”, although there are a handful of space borne remote sensing technologies that don’t rely on light at all!) detected by a sensor to real-world “geophysical parameters”.

These three visualizations of the same scene from Landsat 8 — true-color (red, green, blue), false-color (shortwave infrared, near infrared, green), and a map of vegetation and water quality — demonstrate the spectrum from “image” to “data”. The true color scene was rendered from radiances to approximate what the human eye would see. The false-color image was derived from surface reflectance data, and uses two bands invisible to human eyes. The map combines normalized differential vegetation index (NDVI) and water quality index (WQI) both of which are numeric quantities that relate to properties of the Earth’s surface. Images & map by Robert Simmon, based on data acquired on September 22, 2021, downloaded from the USGS Earth Explorer. CC-by-SA-4.0.

In other words, it’s possible to go from the detection of varying amounts of light at different wavelengths (many of them invisible to humans!), to mapping things that help us better understand the world around us. Some of these parameters are straightforward: vegetation health, cloud cover, air pollution, or surface temperature. Others are more exotic: energy balance, total ozone, turbidity, latent heat, evapotranspiration … there are literally thousands of different quantities being measured by NASA, other national and international organizations, and private companies all day every day.

Visualizing GeoTIFFs with GDAL

Which (finally!) brings me back to the topic at hand. How can GDAL — a code library mostly known for converting map projections and translating between file formats — help turn these massive archives of geospatial data into useful (and maybe beautiful) maps? If you read my previous installment in this series A Gentle Introduction to GDAL: Shaded Relief you hopefully noticed that GDAL’s gdaldem program is not only capable of creating shaded relief from elevation data, but also mapping elevation to color. Not just elevation, but just about any type of georeferenced numeric data. In addition, GDAL has robust capabilities to read metadata and interpret a wide variety of scientific data formats. This unlocks some data sources that are, shall we say, tricky to access. (And yes, I am talking about HDF. You’ve been warned.)

I’ll also confess that I stole this entire technique from Josh Stevens. (See Commanding Cartography: Take Control of Faster, More Elegant Workflows from the Command Line.) He was one of several people who dragged me (kicking and screaming) into modern cartography and data visualization workflows.

The basic pattern to use gdaldem to render data is:

gdaldem color-relief <input_dem> <color_text_file> <output_color_relief_map>

<input_dem> is the full name of the input data file (reminder — this doesn’t need to be a DEM!)

<color_text_file> is a text file relating data values to colors. The format is data value, red value (from 0–255), green value (from 0–255), blue value (from 0–255), with an optional value for alpha (also from 0–255, with 0 indicating fully transparent and 255 indicating fully opaque — for this to work you need to add the -alpha flag.)

<output_dem> is the full name of the output image file.

I’ll work through an example using Daymet data — 2022 annual maximum and minimum temperatures for North America — downloaded from Oak Ridge National Laboratory. It’s a nice dataset because it’s derived “using inputs from multiple instrumented sites and weights for each site that reflect the spatial and temporal relationships of the estimation location to the instrumental observations”. Which means minimal artifacts and no missing data — perfect for visualization.

The two file names are daymet_v4_tmax_annavg_na_2022.tif and daymet_v4_tmin_annavg_na_2022.tif. It’s usually a good idea to get a sense of what’s in the file before trying to render it with gdalinfo:

gdalinfo daymet_v4_tmax_annavg_na_2022.tif

There’s a ton of output text, mostly focused on projection information, but I want to call attention to some of the information about the data itself (if you’ve heard the term “metadata” before, but didn’t know what it means, it’s data about data!):

Band 1 Block=256x256 Type=Float32, ColorInterp=Gray
Min=-25.118 Max=37.871
Minimum=-25.118, Maximum=37.871, Mean=9.832, StdDev=12.086
NoData Value=-9999

Breaking it down:

Band 1 indicates this information is about the first dataset in the file (in this case there’s only one, but a true-color image would have at least three (red, green, and blue), a multispectral dataset could have four to a few dozen, and a hyperspectral dataset would have hundreds). And you may have noticed that in this case Band 1 is temperature — it’s not even a band at all! Another reminder that images and data are related, and often can be treated similarly, but aren’t always the same thing.

Type=Float32 denotes that the numbers that make up the data are formatted as 32-bit floating point value. This type of number can be positive or negative, integers or decimals, with about 7 significant figures. I’ll cover some other common options later. Also keep in mind that some of these data types (floating point and signed 16 bit integers, in particular) can’t be interpreted correctly by image processing programs (like Photoshop) and need to be translated to be displayed on a screen.

ColorInterp=Gray means that the data aren’t assigned to a color channel or alpha channel, (which makes sense, since there’s only one band). The data could be assigned to a color palette, like in a GIF or 8-bit PNG, but that’s a bit of a special case.

Min=-25.118 Max=37.871 are the maximum and minimum values in the dataset. These are important, since they’ll help you pick end points when you apply a color ramp. If they’re not in the displayed metadata, add the -mm flag, like this: gdalinfo -mm daymet_v4_tmax_annavg_na_2022.tif That will compute the values, but it can take a long time, since it needs to read every value in the entire dataset!

Minimum=-25.118, Maximum=37.871, Mean=9.832, StdDev=12.086 more statistics. I’m honestly not sure why Minimum and Maximum are repeated. Mean (average) and StdDev (standard deviation) are both useful, but aren’t needed to make the data into a picture.

NoData Value=-9999 specifies the value for missing data. NoData represents ocean water in this particular dataset, but it could be bad data, missing data, cloudy data, or even land in an ocean dataset. It’s important to remember that NoData is not the same thing as zero! So it usually gets its own color, or is even made transparent.

Phew, even stripping out all the geographic info and basics about the file, that was a lot! But it’s necessary to define the relationships between color and value, which is what the <color_text_file> contains. Here’s what I used for this particular dataset, contained in a file named daymet_v4_temperature_palette_-20_40.txt:

-9999,0,0,0,0
-100,69,117,180,255
-20,69,117,180,255
-10,145,191,219,255
0,224,243,248,255
10,255,255,191,255
20,254,224,144,255
30,252,141,89,255
40,215,48,39,255

It follows the same format as I described in A Gentle Introduction to GDAL Part 5: Shaded Relief. Except in this case instead of elevation or aspect (which is itself derived form elevation) the value is average max temperature. And the same pattern works regardless of dataset, as long as the data is a range of numbers:

value,red,green,blue,alpha (alpha is optional)

In the case of the elevation data in my previous post, the file was a nice perfect square with no missing data, so figuring out how to represent NoData or assigning a value for alpha wasn’t important. This example, however, has plenty of missing data points. I could assign those missing data to a color (preferentially a neutral color that’s distinct from any of the colors on the color ramp), but it’s more elegant to make them transparent. That’s indicated by the 0 in the fifth column of the first row. All the other values have an alpha of 255 — fully opaque (like the red, green, and blue channels, the alpha channel is represented by an 8-bit number, which has a possible range from 0 to 255).

A few other subtleties I should explain. There’s a line for temperatures well below the minimum value in the file: -100 which has the same color assigned to it as -20. (Both are red = 69, green = 117, blue = 180, alpha = 255; a dark blue that’s opaque.) If that additional, lower value wasn’t there, GDAL would try to interpolate from that dark blue to the black assigned to -9999, since the algorithm doesn’t take into account that -9999 really means “nothing to see here”. There’s a small tail of temperature data below the -20˚ C that I want as the minimum of the range, and that extra line makes sure it appears correctly.

Here’s what that color ramp looks like applied to two DAYMET datasets: average minimum daily temperature (left) & average maximum daily temperature (right) for 2022 over North America.

2022 average minimum (left) and maximum (right) temperatures. Maps by Robert Simmon based on data from DAYMET, Oak Ridge National Laboratory. CC-by-SA-4.0.

Yet another aside: these maps use a palette (slightly modified from the ColorBrewer RdYlBu scheme) usually reserved for divergent datasets — datasets that are positive and negative (electric charge), or otherwise have a departure from a central value (the difference of one year’s rainfall from average). Temperature has a continuous range from low to high, so could be suitably displayed by a sequential palette, which goes from light/pale to dark/saturated (or vice-versa) instead of having a light/neutral color in the middle with dark/saturated colors on either extreme. The trouble with using a sequential palette here is that the lighter values are usually de-emphasized, and viewers are typically going to be interested in both cold and hot extremes. So the divergent palette works. Plus people often have a strong association with “cool” (green and blue) and “warm” (yellow, orange, and red) colors, so palettes that break those expectations can be confusing for viewers.

And y’know what? Rules are meant to be broken. Just do so with a clear idea of why you’re breaking them.

The code to generate both images is almost identical, with the only difference being the input and output filenames. (And make sure you specify different output file names! Because gdal_dem will happily overwrite your files with no warning. Batch processing a large stack of humungous files and realizing they were all named you_forgot_to_change_the_output_file_name_dummy.tif is … unpleasant.)

gdaldem color-relief -alpha daymet_v4_tmax_annavg_na_2022.tif  daymet_v4_temperature_palette_-20_40.txt daymet_v4_tmax_annavg_na_2022_color.tif -co COMPRESS=DEFLATE -co PREDICTOR=2

gdaldem color-relief -alpha daymet_v4_tmin_annavg_na_2022.tif daymet_v4_temperature_palette_-20_40.txt daymet_v4_tmin_annavg_na_2022_color.tif -co COMPRESS=DEFLATE -co PREDICTOR=2

(A few notes: code blocks in Medium don’t wrap, so copy/paste into a text editor or terminal or scroll right to see the full command. I also include these potentially cryptic “creation options” in almost all of my GDAL examples: -co COMPRESS=DEFLATE -co PREDICTOR=2. Creation options are specific to writing particular file types in GDAL, in this case TIFFs. These are telling GDAL to write the TIFFs with an efficient lossless compression algorithm to save drive space, at the expense of processing time. (Koko Alberti wrote a very thorough explanation of GeoTIFF compression options, if you’re curious.))

Another confession — I didn’t figure out the colors and ranges for these maps by looking at the minimum and maximum values spit out by gdalinfo and editing a text file. I did that part of the work in QGIS, which is more interactive than the command line (there’s plenty of other options, too, like Photoshop with Avenza Geographic Imager, ENVI, or ArcGIS). However, once I decided on the color ramp and range, I used GDAL. If you know what you want, it’s faster, less error prone, and more repeatable to execute a bit of code than it is to manually punch in numbers and edit file names.

Screen shot of QGIS showing assignment of colors to data values in the symbology sub-menu of the layer properties dialog box (there’s *a lot* of options available in QGIS…). It’s often more efficient to prototype color palettes in a GUI than to switch between writing code and previewing reasults. GDAL (or another programmatic approach), on the other hand, is better for repeatability.

That’s the basic workflow for visualizing data with GDAL:

  • Read the metadata with gdalinfo to determine maximum and minimum values
  • Write a text file that maps values to colors: value, red, green, blue, alpha
  • Run gdaldem with color-relief

When Things Get Weird: Working with Scale and Offset

With GeoTIFFs, that’s usually straightforward, but there can be curveballs.

Here’s an example from another dataset: global September air temperature for 1981–2010 from Climatologies at High Resolution for the Earth’s Land Surface Areas (CHELSEA). (Download here.) First, the metadata from gdalinfo:

Band 1 Block=43200x1 Type=UInt16, ColorInterp=Gray
Min=2100.000 Max=3103.000 Computed Min/Max=2099.000,3104.000
Minimum=2100.000, Maximum=3103.000, Mean=2801.645, StdDev=212.587
NoData Value=-2147483647
Offset: -273.15, Scale:0.1

Which seems … fine, but the values don’t look quite right for temperature (thousands of degrees?) and there’s a new line: Offset: -273.15, Scale: 0.1 What’s that all about? There’s another clue, too: Type=UInt16.

Any ideas?

There’s several reasons for the odd scaling of the data. First is that the data are in UINT16, an integer format (no decimal points), so to get a precision of a tenth of a degree the data are stored as 10 times the original value. That means they need to be multiplied by 0.1 when the data is read back in. The second reason is that UINT16 is an unsigned integer, which means no negative values. Since there’s lots of places in the world where’s it below 0˚C the data need to be offset so that negative data can be stored correctly in the chosen format.

Which dovetails nicely with the fact that satellite temperature data is generally measured in kelvins (degrees above Absolute Zero). (It’s a long story, but satellite temperature estimates are made indirectly, by measuring the brightness temperature of a surface. Which itself relates to the blackbody radiation emitted by the surface, (mostly) determined by its temperature above Absolute Zero.) That’s what the Offset parameter is for: 0˚ in Celsius = 273.15 Kelvin. It’s necessary to subtract 273.15 from the data to convert from Kelvin to Celsius (after dividing by 10, of course).

I hope that all made sense, because if you want to display the data for an audience that’s more comfortable in ˚C (I’m not even going to get into ˚F) than K, either the palette or the data needs to be scaled first. Further complicating things, some software (like QGIS) automatically applies scale and offset when displaying datasets with that information in the metadata. Which is great if your workflow ends in QGIS, but not so great if you’ve used QGIS as a preview tool and are going back to GDAL to visualize the data.

Doing math on every pixel of a large dataset (43,200 by 21,600 pixels in this case) can take a while, so it’s generally more efficient to translate the units in the palette file rather than modifying the underlying data. For a small number of value-color pairs it’s not too hard to do it by hand or in a spreadsheet, but Python’s a good tool for reading a text file, converting that to numbers, doing some calculations, converting back to text, and writing a new file. I’m not going to explain this in detail (although I might in the future … ) but here’s what my program looks like:

# open original palette, in ˚C
palette_c = open("ast_palette_RdYlBl.txt", "r")

# create an empty list to store the data
palette_k = []

# read the original palette into the list as floats & convert from ˚C to K
for line in palette_c:
line_clean = line.strip()
line_list = line_clean.split(",")
line_list[0] = str((float(line_list[0]) + 273.15) * 10)
palette_k.append(line_list)

# close the data object pointing to the original file
palette_c.close()

# create a destination file
outfile = open("palette_c_gdal.txt", "w")

# write each line of the modified palette into the file plaette_k.txt with newlines
for palette_line in palette_k:
row_string = '{},{},{},{}'.format(palette_line[0], palette_line[1], palette_line[2], palette_line[3])
outfile.write(row_string)
outfile.write("\n")

# close the outfile data object
outfile.close()

Hopefully the comments are sufficient to explain what’s going on (familiarity with Python will help).

That prgram will convert this:

-28,69,117,180
-22,111,159,205
-16,153,196,225
-10,189,222,236
-4,219,237,237
2,243,246,212
8,255,243,182
14,254,227,149
20,253,195,116
26,250,153,87
32,241,104,65
38,215,48,39

To this (the only difference is the first column, converting from degrees Celsius to kelvins):

2451.5,69,117,180
2511.5,111,159,205
2571.5,153,196,225
2631.5,189,222,236
2691.5,219,237,237
2751.5,243,246,212
2811.5,255,243,182
2871.5,254,227,149
2931.5,253,195,116
2991.5,250,153,87
3051.5,241,104,65
3111.5,215,48,39

All that’s left is to run the GDAL command (remember to copy/paste to see the whole line):

gdaldem color-relief CHELSA_tas_09_1981-2010_V.2.1.tiff palette_k_-22_38_gdal.txt CHELSA_tas_09_1981–2010_V.2.1_color_-22_38.tiff -co COMPRESS=DEFLATE -co PREDICTOR=2

Here’s the result (with the color key added later):

CHELSEA global average temperature, September 1981–2010 in equirectangular projection. Lots of global datasets are stored in this projection (it creeates a nice rectangular array of evenly spaced latitudes and longitudes), so consider reprojecting the data if appropriate for your needs.

Oh, and one more thing — the data are stored in a equirectangular projection (a regular grid that spans ±90˚ latitude and ±180˚ longitude). Great for importing into a 3D program like Blender or Maya, not so great for displaying as a map. High latitudes (near the North and South Poles) are stretched horizontally, making them appear to cover more area than mid and equatorial latitudes. For global thematic data an equal area projection like Equal Earth is a better choice. You can convert a map into Equal Earth by using -t_srs '+proj=eqearth' with thegdalwarp program. (If you’re not familiar with gdalwarp I explain the basics in part 2 of this series: Map Projections & gdalwarp.)

gdalwarp -t_srs '+proj=eqearth' -r bilinear -dstalpha CHELSA_tas_09_1981-2010_V.2.1_color.tiff CHELSA_tas_09_1981-2010_V.2.1_color_eqEarth.tiff -wo SOURCE_EXTRA=1000 -co COMPRESS=DEFLATE -co PREDICTOR=2
Equal Earth variation of the CHELSEA September global average temperature dataset. A side effect of switching to an equal-area projection is to reduce the visual impact of Antarctica, almost all of which has average temperatures lower than the minimum shown with this color scaling.

To reiterate, my workflow with this data was:

  1. Read the metadata with gdalinfo
  2. Preview and decide on the scaling of minimum and maximum values in QGIS (which automagically converted from Kelvins to degrees Celsius based on the scale and offset values in the metadata)
  3. Create a text file mapping Celsius values to colors
  4. Convert temperature in Celsius to temperature in kelvins with Python
  5. Finally, generate the image with gdaldem color-relief

Somehow this got long, and I’m only about halfway through the examples I wanted to show — so I’m going to call this “A Gentle Introduction to GDAL Part 6.1” and continue with Part 6.2 after the 2023 NACIS meeting. That will cover some of the ways to convert imagery into data with gdal_calc.py, dealing with HDF and NetCDF, a very brief introduction to using GDAL with Python, and maybe a few surprises.

  1. A Gentle Introduction to GDAL
  2. Map Projections & gdalwarp
  3. Geodesy & Local Map Projections
  4. Working with Satellite Data
  5. Shaded Relief
  6. Visualizing Data (you are here)
  7. Transforming Data
  8. Reading Scientific Data Formats

Thanks (again, still) to Frank Warmerdam, who got me over my fear of GDAL, Joshua Stevens for describing how to use gdaldem on non-elevation data, and Joe Kington for being a sounding board.

--

--

Robert Simmon
Robert Simmon

Written by Robert Simmon

Data Visualization, Ex-Planet Labs, Ex-NASA. Blue Marble, Earth at Night, color.