A Gentle Introduction to GDAL Part 8: Reading Scientific Data Formats

26 min read1 day ago

Among its many well-known capabilities, GDAL has a hidden superpower — the ability to read scientific data formats like Hierarchical Data Format (HDF), Network Common Data Form (NetCDF), and Gridded Binary (GRIB). Many essential climate and satellite datasets created by the likes of NASA, NOAA, the World Meteorological Organization (WMO), and the European Space Agency (ESA) are stored and distributed in one of these formats. They contain records of everything from global temperatures to land cover to ocean salinity. Unfortunately, many people who’d be interested in using these data don’t even know they exist, in part because they can be a %@#&$ to read.

Twin maps of the biosphere in the Eastern and Western Hemisphere. — A wealth of data is stored & distributed in scientific data formats that can be difficult to work with. However, tools like GDAL can help unlock these measurements and enable the creation of many unique maps. These maps of the global biosphere combine ocean color, vegetation, and land cover classification data from the Visible Infrared Imaging Radiometer Suite (VIIRS) with bathymetry and elevation data from GEBCO. Map by Robert Simmon.

It’s not that HDF, NetCDF, or GRIB are inherently bad formats. They were created to handle complex, multi-dimensional datasets, which is a tough problem that requires flexibility. That includes raster data, vector data, point data, time series data, text … pretty much anything you can think of. Unfortunately, that flexibility means every dataset is packaged in a unique way, which makes opening the data a multi-step process that sometime feels like solving a puzzle:

Read the metadata to find out the file’s structure.
Parse the data.
Reformat the data for analysis and visualization.

Fortunately, these datasets are self-describing. This means there is information in the file, stored in a standardized way, that indicates how each file can be read. Which is where GDAL comes in. [If you’re unfamiliar with GDAL (a command line library designed to manipulate geospatial data), you can start with part 1 of this series. It’s best to be familiar with how to usegdal_translate to convert between file and data types, gdalwarp to change map projections, and how to use gdaldem to make scientific data into imagery.

Why GDAL? As a purely command line tool GDAL can be difficult to learn, but it’s self-contained and well supported. Command-line alternatives like Python rely on multiple libraries that are difficult to set up if you’re not familiar with all the dependencies. Data archives occasionally provide applications written to read HDF and NetCDF with graphical user-interfaces, but they are often limited to a single operating system, and/or become unsupported after a few years.

Generalized mapping software like ArcGIS & QGIS can both read NetCDF & HDF — in fact I often use QGIS along with GDAL to both preview files and create polished maps. But they have a learning curve of their own, and sometimes more limited functionality compared to running GDAL directly.

Once you get over the initial learning curve of GDAL, it becomes an efficient way to convert scientific data into formats that are more widely accessible. GDAL can even be paired with scripting or programming languages (Bash, Python, R) to automate repetitive tasks.

To follow along, I recommend installing the latest version of GDAL with Conda (3.9.1 as of late August, 2024) along with the HDF, NetCDF, and GRIB drivers (these are essential):

conda install -c conda-forge libgdal-netcdf
conda install -c conda-forge libgdal-hdf5
conda install -c conda-forge libgdal-grib

Then download this collection of sample data I put together to help demonstrate some of the common challenges to working with these formats.

Reading Global Data

I’ll start with an example of a relatively simple NetCDF file: Copernicus Global Ocean OSTIA Sea Surface Temperature and Sea Ice Analysis data. It’s a global dataset in a regular latitude-longitude grid, with only four separate variables in each file, each with the same structure (things can get much, much more complicated).

Step 1: Read the metadata to find out the file’s structure with gdalinfo:

This will output all the metadata GDAL can find — I’ll try to explain the most significant elements. Perhaps the most important is that this single file contains four different subdatasets. Each of which is a separate parameter/measurement.

Here’s the command:

gdalinfo METOFFICE-GLO-SST-L4-NRT-OBS-SST-V2_1722627257748.nc

And this is the output (which I’ll break up to better explain the important parts):

Driver: netCDF/Network Common Data Format
Files: METOFFICE-GLO-SST-L4-NRT-OBS-SST-V2_1722627257748.nc
Size is 512, 512

These lines are essentially a preamble. GDAL is using the NetCDF raster driver, the file name, and a size (which refers to the header, not the data itself). This is followed by some background information (follow along on the command line, or copy/paste into a text editor to see the full output, Medium doesn’t wrap code blocks):

Metadata:
  NC_GLOBAL#comment=WARNING Some applications are unable to properly handle signed byte values. If values are encountered > 127, please subtract 256 from this reported value
  NC_GLOBAL#Conventions=CF-1.11
  NC_GLOBAL#history=Created from sst.nc; obs_anal.nc; seaice.nc
  NC_GLOBAL#institution=UKMO
  NC_GLOBAL#references=Good S, Fiedler E, Mao C, Martin MJ, Maycock A, Reid R, Roberts-Jones J, Searle T, Waters J, While J, Worsfold M. The Current Configuration of the OSTIA System for Operational Production of Foundation Sea Surface Temperature and Ice Concentration Analyses. Remote Sensing. 2020; 12(4):720 https://doi.org/10.3390/rs12040720
  NC_GLOBAL#source=AVHRR18_G-NAVO-L2P-V1.0, AVHRR19_G-NAVO-L2P-V1.0, AVHRR_SST_METOP_B-OSISAF-L2P-V1.0, VIIRS_NPP-OSPO-L2P-V2.3, AMSR2-REMSS-L2P-V07.2, GOES13-OSISAF-L3C-V1.0, SEVIRI_SST-OSISAF-L3C-V1.0, OSISAF_ICE, NCEP_ICE
  NC_GLOBAL#subset:datasetId=METOFFICE-GLO-SST-L4-NRT-OBS-SST-V2
  NC_GLOBAL#subset:date=2024-08-02T19:34:17.749Z
  NC_GLOBAL#subset:productId=SST_GLO_SST_L4_NRT_OBSERVATIONS_010_001
  NC_GLOBAL#subset:source=ARCO data downloaded from the Marine Data Store using the MyOcean Data Portal
  NC_GLOBAL#title=Global SST & Sea Ice Analysis, L4 OSTIA, 0.05 deg daily (METOFFICE-GLO-SST-L4-NRT-OBS-SST-V2)

This background info includes the fact that the data adheres to version 1.11 of the NetCDF Climate and Forecast (CF) Metadata Conventions (a standardized way of formatting data in a NetCDF file), data sources, a title for the dataset, and even a reference to the open access paper describing the dataset!

This block is followed by a list of the subdatasets, each of which holds a different variable:

Subdatasets:
  SUBDATASET_1_NAME=NETCDF:"METOFFICE-GLO-SST-L4-NRT-OBS-SST-V2_1722627257748.nc":analysed_sst
  SUBDATASET_1_DESC= analysed_sst (32-bit floating-point)
  SUBDATASET_2_NAME=NETCDF:"METOFFICE-GLO-SST-L4-NRT-OBS-SST-V2_1722627257748.nc":analysis_error
  SUBDATASET_2_DESC=[1x3600x7200] analysis_error (32-bit floating-point)
  SUBDATASET_3_NAME=NETCDF:"METOFFICE-GLO-SST-L4-NRT-OBS-SST-V2_1722627257748.nc":mask
  SUBDATASET_3_DESC=[1x3600x7200] mask (32-bit floating-point)
  SUBDATASET_4_NAME=NETCDF:"METOFFICE-GLO-SST-L4-NRT-OBS-SST-V2_1722627257748.nc":sea_ice_fraction
  SUBDATASET_4_DESC=[1x3600x7200] sea_ice_fraction

The information for each subdataset contains the dataset dimensions: [1x3600x7200] (by the CF conventions, each subdataset has an explicit time dimension — even if there’s only a single timestep — along with height and width); its name: analysed_sst, analysis_error, mask, sea_ice_fraction; and the data type: (32-bit floating-point).

Finally, a list of corner coordinates. Like the size denoted above, this refers to the header, not the data (and therefore not particularly meaningful):

Corner Coordinates:
Upper Left  (    0.0,    0.0)
Lower Left  (    0.0,  512.0)
Upper Right (  512.0,    0.0)
Lower Right (  512.0,  512.0)
Center      (  256.0,  256.0)

To get the information associated with each subdataset, you need to run gdalinfo on that individual variable with the name specified in the metadata (bolded in the text below):

SUBDATASET_1_NAME=NETCDF:"METOFFICE-GLO-SST-L4-NRT-OBS-SST-V2_1722627257748.nc":analysed_sst

This starts with the driver followed by a colon (NETCDF:), the name of the file in plain quotes followed by another colon "METOFFICE-GLO-SST-L4-NRT-OBS-SST-V2_1722627257748.nc':, and finally the variable name analysed_sst.

This command looks like:

gdalinfo NETCDF:"METOFFICE-GLO-SST-L4-NRT-OBS-SST-V2_1722627257748.nc":analysed_sst

and returns another long block of metadata, describing the characteristics of the sea surface temperature data. Some of the information is redundant, so I’ll just include the most important parts:

Driver: netCDF/Network Common Data Format
Files: METOFFICE-GLO-SST-L4-NRT-OBS-SST-V2_1722627257748.nc
Size is 7200, 3600
Origin = (-180.000006104363450,89.999998473697119)
Pixel Size = (0.050000001695657,-0.049999999152054)

The metadata starts with an overview. The dataset is 7,200 by 3,600 pixels, the first data point is located at -180˚ longitude and 90˚ latitude (The values in the metadata are slightly shifted from this point, which I suspect is due to confusion between coordinates referring to the corner of each pixel vs. the center of each pixel. Different systems have different conventions, which are not always accounted for when data is converted between formats.), and the pixels size is 0.05 by 0.05 degrees (Again, I think the slight difference is an error. A global dataset with a 2:1 aspect ratio should have identical dimensions in both width (x) and height (y)).

Metadata:
  analysed_sst#long_name=Analysed sea surface temperature
  analysed_sst#standard_name=sea_surface_foundation_temperature
  analysed_sst#units=kelvin
  analysed_sst#_FillValue=nan

Further down there’s information about the data itself: Analysed sea surface temperature in units of kelvin with a fill value of nan. It’s always good to know the official name of a dataset, as well as the fact that the units are in Kelvins rather than the more familiar degrees Celsius. Also important to note that invalid or missing data is indicated with the floating point “not a number” value. (Flagging fill values is a bit more complicated with data stored as integers, which can’t contain “not a number”.)

  NC_GLOBAL#title=Global SST & Sea Ice Analysis, L4 OSTIA, 0.05 deg daily (METOFFICE-GLO-SST-L4-NRT-OBS-SST-V2)
  NETCDF_DIM_EXTRA={time}
  NETCDF_DIM_time_DEF={1,6}
  NETCDF_DIM_time_VALUES=1722297600
  time#axis=T
  time#calendar=gregorian
  time#long_name=Time
  time#standard_name=time
  time#units=seconds since 1970-01-01 00:00:00

Next is a section that indicates the time period the data covers. This uses the UNIX time convention, which I’ll describe in more detail later, with a file that contains multiple timesteps. For now, just note that 1722297600 is July 30, 2024.

Band 1 Block=7200x1 Type=Float32, ColorInterp=Undefined
  NoData Value=nan
  Unit Type: kelvin
  Metadata:
    NETCDF_VARNAME=analysed_sst
    units=kelvin
    _FillValue=nan
    standard_name=sea_surface_foundation_temperature
    long_name=Analysed sea surface temperature
    NETCDF_DIM_time=1722297600

The final block of metadata may look like it’s just the most important elements from the rest of the file, but it’s confirmation that the subdataset is only a single layer. While this file has four subdatasets with one layer each, NetCDF and other scientific file formats can hold multidimensional data. Some files will contain additional dimensions, like multiple timesteps, layers in the atmosphere, or both. In these cases GDAL interprets each layer just like a color band in a multispectral image file. I’ll demonstrate working with time series data later.

That may have seemed like an overly long explanation, but reading and understanding metadata is the most challenging part of working with this type of file. Standards are loose, so many datasets have quirks that require a bit of thought to resolve.

Step 2: Parse the data.

Once you’re familiar with the metadata, it’s time to extract one or more subdatasets. You can read the file and convert it to a GeoTIFF (or one of the many, many other formats GDAL supports) with gdal_translate. (Again, if you’re new to GDAL, learn about the basics in part 1 of this series.) The generic command to perform the format conversion with no other changes is:

gdal_translate subdataset output_file.tif

GDAL will recognize the .tif extension in the output file name and write a GeoTIFF. For this specific file the command will be (note that I’ve included backslashes to break a long command into multiple lines):

gdal_translate \
NETCDF:"METOFFICE-GLO-SST-L4-NRT-OBS-SST-V2_1722627257748.nc":analysed_sst \
analysed_sst_20240730.tif

I’ve simplified the output name and added the date: analysed_sst_20240730.tif. That’s (almost) it! GDAL converts the sea surface temperature subdataset into a GeoTIFF, with the floating point values (representing temperature in Kelvins) intact.

That creates an output file suitable for analysis, but since it remains floating-point it can’t be displayed natively as an image. To make an image for the web, scale to the minimum and maximum values (found in the output from gdalinfo if you run it on the output TIFF: Min=271.310 Max=307.800 ) with the -scale option in gdal_translate, use -ot to set the type of the output data to Byte (8 bits of data ranging from 0–255), and write a single-band grayscale PNG (GDAL will parse the extension of the output file and write a PNG instead of a TIFF):

gdal_translate \
-scale 271.310 307.800 0 255 -ot Byte \
analysed_sst_20240730.tif analysed_sst_20240730.png

Global grayscale map of sea surface temperature data from July 30, 2024, scaled from 271.310 307.800 Kelvins. Missing data (including land and sea ice) is transparent (white).

Step 3: Reformat the data for analysis and visualization.

There is one problem, and it’s a significant one. Despite the fact that there’s some geolocation info, if you try to reproject the map, it won’t work. GDAL won’t return an error—the file just won’t be reprojected.

Run gdalinfo on the GeoTIFF to look for clues. There is information on the size of the data and the location of the upper left pixel:

Size is 7200, 3600
Origin = (-180.000006104363450,89.999998473697119)
Pixel Size = (0.050000001695657,-0.049999999152054)

But no information about the resolution, despite the fact that it was present in the original NetCDF file. The corner coordinates suggest the data is in WGS84, but it’s best to be sure. Since the metadata in the NetCDF file is incomplete, check the dataset’s product user manual:

lat:standard_name = "latitude" ;
lat:units = "degrees_north" ;
lat:valid_min = -90.f ;
lat:valid_max = 90.f ;
lat:axis = "Y" ;
lat:reference_datum = "geographical coordinates, WGS84 projection" ;
float lon(lon) ;
lon:standard_name = "longitude" ;
lon:units = "degrees_east" ;
lon:valid_min = -180.f ;
lon:valid_max = 180.f ;
lon:axis = "X" ;
lon:reference_datum = "geographical coordinates, WGS84 projection" ;

This confirms that the data are in WGS84, and that the extent should run from -180˚ to 180˚ East — West and 90˚ to -90˚ North — South. There’s (at least) two ways to fix this. One with gdal_translate and one withgdalwarp.

Using gdal_translate you can assign the correct reference system when you convert the NetCDF into a GeoTIFF:

gdal_translate -a_srs EPSG:4326 -a_ullr -180 90 180 -90 \
NETCDF:"METOFFICE-GLO-SST-L4-NRT-OBS-SST-V2_1722627257748.nc":analysed_sst \
OSTIA_SST_20240730_wgs84.tif

-a_srs EPSG:4326 assigns the reference system to WGS84 while -a_ullr -180 90 180 -90 dictates the correct corner coordinates.

Or with gdalwarp you can specify the source reference system as WGS84 when reprojecting (if you’re not familiar with gdalwarp read part 2 of this series: Map Projections & gdalwarp):

gdalwarp -s_srs EPSG:4326 -t_srs EPSG:8857 -r bilinear \
NETCDF:"METOFFICE-GLO-SST-L4-NRT-OBS-SST-V2_1722627257748.nc":analysed_sst \
OSTIA_SST_20240730_eqearth.tif

-s_srs is the source spatial reference system (WGS84) while -t_srs is the target spatial reference system (the wonderful Equal Earth in this case).

Once the data is assigned a projection (or reprojected into an appropriate projection), you can analyze or visualize it with whatever software you want. You can even assign a color palette on the command line (see A Gentle Introduction to GDAL Part 6: Visualizing Data). I made this map with QGIS, the magma color scale, and the Natural Earth vector dataset:

Global map of sea surface temperature for July 30, 2024, in the equal Earth projection. The coldest temperatures (-2˚ C) are colored black and hug the Antarctic coastline. The warmest water (36˚ C) is in the Persian Gulf, represented with pale yellow. — Once the data in a NetCDF file or other scientific format is accessed and georeferencing is applied, it can be combined with other data and made into a map with any of the familiar cartographic tools: QGIS, ArcGIS, or even GDAL itself. Map by Robert Simmon using data from the Copernicus Marine Service.

Swap out the subdataset name to access the other variables. (Remember to change the names of the output files, too, gal_translate will overwrite without warning!)

gdal_translate -a_srs EPSG:4326 -a_ullr -180 90 180 -90 \
NETCDF:"METOFFICE-GLO-SST-L4-NRT-OBS-SST-V2_1722627257748.nc":analysis_error \
OSTIA_analysis_error_20240730.tif

gdal_translate -a_srs EPSG:4326 -a_ullr -180 90 180 -90 \
NETCDF:"METOFFICE-GLO-SST-L4-NRT-OBS-SST-V2_1722627257748.nc":mask \
OSTIA_mask_20240730.tif

gdal_translate -a_srs EPSG:4326 -a_ullr -180 90 180 -90 \
NETCDF:"METOFFICE-GLO-SST-L4-NRT-OBS-SST-V2_1722627257748.nc":sea_ice_fraction \
OSTIA_sea_ice_fraction_20240730.tif

Here’s what each of those other variables looks like, scaled to their minimum and maximum values:

Grayscale maps of land and ice mask, sea ice fraction, and sea surface temperature error extracted from the OSTIA sea surface temperature NetCDF file.

Regional Data

Although global satellite datasets are often stored in some variation of a lat — lon projection (rectangular grid), some use a more specialized, often regional, coordinate system. For example, many sea ice, snow cover, and other cryosphere datasets use a projection optimized for the poles. Fortunately, when the projection information is properly encoded in the NetCDF file, GDAL can read it with no problem.

Sea ice thickness is an important measure of Arctic warming, measured by Cryosat-2 since late 2010. In the set of sample files I’ve included data covering March 9–15, 2024, the period of minimum ice extent, downloaded from ESA’s Earth Online platform: W_XX-ESA,SMOS_CS2,NH_25KM_EASE2_20240309_20240315_r_v206_01_l4sit.nc Run gdalinfo on it to explore the metadata:

gdalinfo W_XX-ESA,SMOS_CS2,NH_25KM_EASE2_20240309_20240315_r_v206_01_l4sit.nc

As with the previous sea surface temperature dataset, there’s a lot of information packed in there, including a long list of subdatasets. (Some scientists like to pack all of their inputs and complementary data along with the final measurements.) Three look particularly interesting: sea ice thickness, sea ice concentration, and sea ice type:

SUBDATASET_4_NAME=NETCDF:"W_XX-ESA,SMOS_CS2,NH_25KM_EASE2_20240309_20240315_r_v206_01_l4sit.nc":analysis_sea_ice_thickness

SUBDATASET_8_NAME=NETCDF:"W_XX-ESA,SMOS_CS2,NH_25KM_EASE2_20240309_20240315_r_v206_01_l4sit.nc":sea_ice_concentration

SUBDATASET_9_NAME=NETCDF:"W_XX-ESA,SMOS_CS2,NH_25KM_EASE2_20240309_20240315_r_v206_01_l4sit.nc":sea_ice_type

Use gdalinfo to inspect the metadata associated with the sea ice thickness variable:

gdalinfo NETCDF:"W_XX-ESA,SMOS_CS2,NH_25KM_EASE2_20240309_20240315_r_v206_01_l4sit.nc":analysis_sea_ice_thickness

As with the sea surface temperature data, there is a lot of information, but notice how it’s not all formatted in the same way, and there are different fields between the two files. As I mentioned before, even though HDF and NetCDF are ostensibly standards, the standards are geared towards specifying how the structure of the data is described, rather than how the data is stored. Fortunately, this particular dataset is formatted according to the “NetCDF Climate and Forecast (CF) Metadata Conventions” which are a set of guidelines that help make NetCDF more easily readable.

Look through the metadata, focusing on the end of the output. There is a series of entries under Band 1 that contains some of the most important information describing the subdataset, which will help to interpret it:

Band 1 Block=432x432 Type=Int32, ColorInterp=Undefined

This indicates that the dataset is an array (grid) of 432 by 432 numbers, stored as the Integer 32 data type (±2,147,483,647), and ColorInterp=Undefined means there’s no recommended color palette (Some scientists are particular about this. I have thoughts.)

Min=4.000 Max=4607.000 
Minimum=4.000, Maximum=4607.000, Mean=1220.340, StdDev=733.463

Since this is an integer data type there’s no option to have “not a number” (NaN) so missing or invalid data needs to either be specified by setting aside a specific number or with a mask. A negative value for data that’s always positive is a natural choice.

Unit Type: m
Offset: 0,   Scale:0.001

Information about how to interpret the values encoded in the data — sea ice thickness in meters, except the raw thickness values have been multiplied by 1,000, so you need to divide by 1,000 when reading the data. For example, the minimum sea ice thickness in the data is 0.004 meters, and the maximum value is 4.607 meters, but the raw data is scaled from 4.000 to 4,607.000. This allows decimal data to be stored in an integer format. Fortunately there’s no offset (value that needs to be added to the data), which you might find when there’s a need to encode negative values in an unsigned data type.

Offset and scale may seem arbitrary, but they’re useful to encode data efficiently. Say you have integer data that ranges from -1 to 1, with a precision of .01. That’s only 201 possible values (or 202 if you need to include a value for missing data). You could use the 32-bit floating point data type, which requires four bytes per data point. Or you could use single byte integers (which can contain 256 possible values) and include offset and scale parameters. The resulting dataset would only need ¼ the space to store, and consume ¼ the bandwidth to transfer from one computer to another.

Software like QGIS will read and apply offset and scale automatically, but if you’re manipulating the data or applying a color palette on your own you’ll need to use the -unscale flag with gdal_translate , or make sure to include them in any calculations.

Under Metadata there are additional parameters, most of them redundant, but one unique & important:

grid_mapping=Lambert_Azimuthal_Grid

This points to the geolocation information, which is itself stored in a different section of the metadata. If grid_mapping is present in a NetCDF file, this is the preferred way for GDAL to access geolocation info. Scroll up in the gdalinfo output until you find a series of entries beginning with Lambert_Azimuthal_Grid. This defines the data’s full coordinate system:

  Lambert_Azimuthal_Grid#false_easting=0
  Lambert_Azimuthal_Grid#false_northing=0
  Lambert_Azimuthal_Grid#grid_mapping_name=lambert_azimuthal_equal_area
  Lambert_Azimuthal_Grid#inverse_flattening=298.2572326660156
  Lambert_Azimuthal_Grid#latitude_of_projection_origin=90
  Lambert_Azimuthal_Grid#longitude_of_projection_origin=0
  Lambert_Azimuthal_Grid#proj4_string=+proj=laea +lon_0=0 +datum=WGS84 +ellps=WGS84 +lat_0=90.0
  Lambert_Azimuthal_Grid#semi_major_axis=6378137

This projection is a variation of EASE-Grid 2 which is used extensively by the cryosphere community (standard EASE-Grid 2 goes all the way the equator). Since all the necessary geolocation information is included in the file, you can convert analysis_sea_ice_thickness into a fully georeferenced GeoTIFF by running gdal_translate on the subdataset:

gdal_translate \
NETCDF:"W_XX-ESA,SMOS_CS2,NH_25KM_EASE2_20240309_20240315_r_v206_01_l4sit.nc":analysis_sea_ice_thickness \
analysis_sea_ice_thickness.tif

Here’s what each of the three interesting variables I mentioned (sea ice thickness, sea ice concentration, and sea ice type (first year vs. multi-year)) look like, with some light cartography to help show off the projection & distinguish land from ice-free ocean:

Grayscale maps of sea ice thickness (left), sea ice concentration (center), and ice type (single vs. multi-year, right) in a modified North Pole EASE-Grid 2 projection. The thickest ice is concentrated along the north coast of Greenland and Arctic Canada. Coastlines are shown in yellow while 15˚ graticules are shown in blue. — Grayscale maps of sea ice thickness (left), sea ice concentration (center), and ice type (single vs. multi-year, right) in a modified EASE-Grid 2 projection.

You can even run gdalwarp directly on the NetCDF file to convert from the polar perspective of EASE Grid to something more intuitive, like an orthographic projection with the feel of a globe (sacrificing equal area properties, alas). [Note that I’ve included the -dstalpha flag, which makes areas outside the map transparent.]

gdalwarp -t_srs '+proj=ortho +lon_0=0 +lat_0=75' -r bilinear -dstalpha \
NETCDF:"W_XX-ESA,SMOS_CS2,NH_25KM_EASE2_20240309_20240315_r_v206_01_l4sit.nc":analysis_sea_ice_thickness \
analysis_sea_ice_thickness_ortho75_alpha.tif

Two orthographic maps of sea ice thickness centered on 75˚N and the Prime Merdian for the week of March 12–18, 2011 and March 9–15, 2024. Thees weeks represent sea ice maximum for the first and most recent years of the Cryosat-2 mission. The thickest ice, colored in light green, is concentrated along the north coast of Greenland and Arctic Canada. Thinner ice, colored dark blue, fills the rest of the Arctic Ocean, but does not extend outside it. — Maps of sea ice thickness for the week of March 12–18, 2011 and March 9–15, 2024. These weeks represent sea ice maximum for the first and most recent years of the Cryosat-2 mission. Maps by Robert Simmon based on CryoSat-SMOS Merged Sea Ice Thickness data from the Alfred Wegener Institute and ESA.

Time Series Data

Next let’s look at data that packs another dimension — time — into a subdataset. Energy balance is a key indicator of climate change. It’s the difference between the amount of solar energy absorbed by the Earth & heat energy radiated back out to space. Energy balance has been measured by the Clouds and Earth’s Radiant Energy Systems (CERES) instruments since the early 2000s. I included energy balance climatology data (the monthly long-term average from June 2005 to July 2015), archived by the CERES project, with the other sample data files: CERES_EBAF-TOA_Ed4.2_Subset_CLIM01-CLIM12.nc As always, run gdlainfo to take a look:

gdalinfo CERES_EBAF-TOA_Ed4.2_Subset_CLIM01-CLIM12.nc

This metadata is relatively concise, mostly featuring the listing of subdatasets. The energy balance data itself is stored in toa_net_all_clim (top of atmosphere net all-sky climatology):

SUBDATASET_4_NAME=NETCDF:"CERES_EBAF-TOA_Ed4.2_Subset_CLIM01-CLIM12.nc":toa_net_all_clim
  SUBDATASET_4_DESC=[12x180x360] TOA Net Flux - All-Sky (32-bit floating-point)

This subdataset is similar to those we’ve looked at before, with one significant difference. Instead of consisting of a single two-dimensional array, it has a series of arrays storing data over time. One slice for each of the 12 months of the year. This is indicated in brackets in the subdataset description: SUBDATASET_4_DESC=[12x180x360]. Following the CF conventions, the dimensions are time, latitude, and longitude, with the specifics in the metadata, accessible in the usual way:

gdalinfo NETCDF:"CERES_EBAF-TOA_Ed4.2_Subset_CLIM01-CLIM12.nc":toa_net_all_clim

The subdataset includes more metadata than the top-level file, including the explanation for the time step in ctime (specifying that each step represents the decade-long average for that month from July 2005 to June 2015. So time step 1 = January 2006 — January 2015, time step 2 = February 2006 — February 2015, … time step 12 = December 2005 — December 2014.).

  ctime#climatology=climatology_bounds
  ctime#delta_t=0000-01-00 00:00:00
  ctime#long_name=Climatological Monthly Means Based on 07/2005 to 06/2015
  ctime#standard_name=climatology time
  ctime#units=months of a climatology year

Along with a description of the data in each of the 12 steps, defined (and treated by GDAL) as individual bands (to learn the specifics, read the GDAL NetCDF driver documentation), with data types and statistics similar to those contained in the previous examples:

Band 1 Block=360x1 Type=Float32, ColorInterp=Undefined
  Min=-199.900 Max=160.200 
  Minimum=-199.900, Maximum=160.200, Mean=-20.878, StdDev=107.190
  NoData Value=-999
  Unit Type: W m-2
  Metadata:
    NETCDF_VARNAME=toa_net_all_clim
    long_name=Top of The Atmosphere Net Flux, All-Sky conditions, Climatological Means based on 7/2005 to 6/2015
    standard_name=TOA Net Flux - All-Sky
    CF_name=toa_net_downward_flux
    comment=none
    units=W m-2
    valid_min=     -400.000
    valid_max=      400.000
    _FillValue=-999
    STATISTICS_MAXIMUM=160.19999694824
    STATISTICS_MEAN=-20.878459184997
    STATISTICS_MINIMUM=-199.89999389648
    STATISTICS_STDDEV=107.18976161762
    NETCDF_DIM_ctime=1

It’s possible to run a minimal gdal_translate command…

gdal_translate \
NETCDF:"CERES_EBAF-TOA_Ed4.2_Subset_CLIM01-CLIM12.nc":toa_net_all_clim \
toa_net_all_clim.tif

… but the results are a little weird:

Extracting CERES data from the NetCDF file without specifying bands will create a single 12-band file, one band for each month. Visualizing the first three bands yields an RGB image with data from January in the red channel, February in the green channel, and March in the Blue channel.

Instead of representing a single numerical value (net flux) the data is being treated as an image, with the first three bands used as color channels. January = red, February = green, and March = blue. Interesting, but hard to interpret. To extract a single month of data specify the band in the gdal_translate command with -b:

gdal_translate -b 3 \
NETCDF:"CERES_EBAF-TOA_Ed4.2_Subset_CLIM01-CLIM12.nc":toa_net_all_clim \
toa_net_all_clim_03.tif

The -b 3 option pulls a single band (March) from the toa_net_all_clim NetCDF subdataset, which is then saved as a floating point GeoTIFF. This leaves one more oddity — the map, despite being global, doesn’t extend from 180˚ West to 180˚ East, but from 0˚ to 360˚! Apply a coastline in mapping software like QGIS, and the problem becomes more pronounced. The Eastern Hemisphere is mapped as it should be, but Western Hemisphere land masses are printed on empty space while the corresponding area of the data remains blank:

A yellow coastline applied to grayscale CERES data shows how global data mapped from 0˚ to 360˚ longitude confuses geospatial software. Coastline for the Eastern Hemisphere is displayed correctly, but the Western Hemisphere coastline is appears on white off the left-hand edge of the map. — A coastline applied to the CERES data shows how global data mapped from 0˚ to 360˚ longitude confuses geospatial software. Additional steps beyond simply reading the subdataset are required to convert to the widely used WGS84 projection, which runs from -180˚ to 180˚ longitude.

To see why, refer back to the metadata:

Corner Coordinates:
Upper Left  (   0.0000000,  90.0000000) 
Lower Left  (   0.0000000, -90.0000000) 
Upper Right (     360.000,      90.000) 
Lower Right (     360.000,     -90.000) 
Center      ( 180.0000000,   0.0000000)

Which confirms what you can see in the mapped image — the data start at 0˚ Longitude and extend to 360˚ Longitude (in other words the Prime Meridian, which runs through Greenwich, England). This is fine from a mathematical perspective, but not from a geographical perspective. It breaks many geospatial and mapping workflows which are restricted to longitude measurements from -180˚ to +180˚, and/or have trouble moving across the International Dateline.

To fix it, you need to use gdalwarp, since shifting coordinates is a form of reprojection:

gdalwarp -s_srs "+proj=longlat +ellps=WGS84" -t_srs EPSG:4326 \
--config CENTER_LONG 0 -b 3 \
NETCDF:"CERES_EBAF-TOA_Ed4.2_Subset_CLIM01-CLIM12.nc":toa_net_all_clim \
toa_net_all_clim_wgs84_03.tif

Here are the significant parts of the command:

-s_srs "+proj=longlat +ellps=WGS84" sets the initial projection (which is incompletely defined in the original file).

-t_srs EPSG:4326 sets the output projection (WGS84).

— config CENTER_LONG 0 forces the center Longitude to be 0˚.

Combined, these commands will remap the data to extend to the more conventional –180˚ to 180˚.

A yellow coastline applied to grayscale CERES data with global data mapped from -180˚ to 180˚ longitude. The coastline is in the correct position relative to the data. — The global coastline appears in the correct location relative to the underlying net flux data after recentering with gdalwarp.

Once the data are re-centered they can be made into maps, like this small multiple series using the Mollweide projection. Mollweide is a good choice for this dataset because it is equal area and it preserves straight lines of latitude. Important for a parameter that shifts north and south with the seasons.

Maps of monthly long-term climatology net flux data from CERES. In general, the Earth accumulates heat energy near the equator (red), and loses heat near the poles (blue). The latitudes receiving the most energy shift as the sun moves north and south with the seasons. The effect of thick clouds, which reflect solar energy, can be seen near the equator and during the summer off the west coasts of North America, South America, and Africa. Maps by Robert simmon using data from the CERES Project.

Troubleshooting

Believe it or not, being offset by 180˚ is relatively straightforward as far as issues with geolocation in scientific data files is concerned. NetCDF files, especially those that adhere (even if loosely) to the CF conventions, are fairly well structured and GDAL will usually be able to figure out how to read them. HDF, on the other hand, is more flexible and is often more challenging to use. Fortunately NetCDF is becoming more common, but there are still legacy HDF datasets around, so it’s good to look at them, too.

An example is the August 2005 global precipitation data from IMERGE contained in the sample file 3B-MO.MS.MRG.3IMERG.20240201-S000000-E235959.02.V07B.HDF5

In what should be an increasingly familiar step, use gdalinfo to examine the contents of the file, then scroll through the metadata until you find the precipitation subdataset :

SUBDATASET_4_NAME=HDF5:"3B-MO.MS.MRG.3IMERG.20050801-S000000-E235959.08.V07B.HDF5"://Grid/precipitation

This is very similar to what you’d find in a NetCDF file, but the name begins with HDF5 instead of NETCDF. This will tell GDAL which driver to use to open the file, but fortunately is otherwise transparent to the user. Just like with a NetCDF file, export the precipitation data into a GeoTIFF with gdal_translate:

gdal_translate \
HDF5:"3B-MO.MS.MRG.3IMERG.20240201-S000000-E235959.02.V07B.HDF5"://Grid/precipitation \
IMERG_20050801_precipitation.tif

Grayscale map of precipitation read directly from an HDF file. The map is rotated 90˚ clockwise. — Naïvely reading precipitation from the IMERG HDF file results in a map rotated 90˚ clockwise.

Huh. That doesn’t look right — the Earth is the right shape (look closely and you can spot the Andes running along the western coastline of South America) but it’s been rotated 90˚ clockwise.

It should look like this browse image (absent the styling, of course):

Global map of precipitation from IMERG for August 2005, showing the intended horizontal orientation. The discrete palette ranges from light blue to green to yellow to orange to red. — Global map of precipitation from IMERG for August 2005, showing the intended horizontal orientation.

As usual, check the subdataset’s metadata for clues:

gdalinfo \
HDF5:"3B-MO.MS.MRG.3IMERG.20240201-S000000-E235959.02.V07B.HDF5"://Grid/precipitation

Grid_GridHeader=BinMethod=ARITHMETIC_MEAN;
Registration=CENTER;
LatitudeResolution=0.1;
LongitudeResolution=0.1;
NorthBoundingCoordinate=90;
SouthBoundingCoordinate=-90;
EastBoundingCoordinate=180;
WestBoundingCoordinate=-180;
Origin=SOUTHWEST;

The data is global with a resolution of 0.1˚, but the first point (Origin) is in the southwest corner, not the more typical northwest corner.

Image Structure Metadata:
  COMPRESSION=DEFLATE
Corner Coordinates:
Upper Left  (    0.0,    0.0)
Lower Left  (    0.0, 3600.0)
Upper Right ( 1800.0,    0.0)
Lower Right ( 1800.0, 3600.0)
Center      (  900.0, 1800.0)
Band 1 Block=1800x145 Type=Float32, ColorInterp=Undefined
  NoData Value=-9999.9
  Metadata:
    CodeMissingValue=-9999.9
    coordinates=time lon lat
    DimensionNames=time,lon,lat
    LongName=Merged microwave-infrared-gauge precipitation estimate
    units=mm/hr
    _FillValue=-9999.9004

The next chunk of metadata shows that the data is 1,800 pixels wide and 3,600 pixels tall. This reveals two things that explain why the data appears rotated. First, the Corner Coordinates field indicates that the array is rotated: each row of data moves from north to south along each line of latitude at a single longitude. It’s more common for data to run from west to east along a line of latitude. Second, the coordinates field indicates that it has three dimensions ordered as time, longitude, and latitude. GDAL is assuming the dimensions are ordered time, latitude, and longitude, resulting in a rotated image.

A relatively new function in GDAL — gdalmdimtranslate (added in version 3.1) — gives the means to reformat the data into the proper orientation. The command is:

gdalmdimtranslate \
3B-MO.MS.MRG.3IMERG.20240201-S000000-E235959.02.V07B.HDF5 \
IMERG_20240201_precipitation_flipped.tif \
-array "name=/Grid/precipitation,transpose=[0,2,1]"

The structure of this command is a little bit different from how other GDAL functions work. Rather than specifying the driver (HDF5), followed by the file name and subdataset, with gdalmdimtranslate you specify an input file followed by an output file, then add an -array flag with options enclosed in quotes. The first option is the name of the subdataset, and the second is transpose, followed by the new order of the axes. For this data leave the time axis (0) as is, and swap latitude (2) and longitude (1). With the coordinate order now matching what GDAL expects, the code can correctly apply georeferencing. To confirm this, reproject the dataset with gdalwarp:

gdalwarp -s_srs EPSG:4326 \
-t_srs '+proj=hammer +lat_0=0 +lon_0=0 +datum=WGS84 +units=m  +no_defs' \
-r bilinear \
IMERG_20240201_precipitation_flipped.tif \
IMERG_20240201_precipitation_flipped_hammer.tif

Apply a color scale, continental outlines, and graticules to make a map:

Color map of average global precipitation for the month of August 2005. Pale yellow represents the lowest precipitation rate (0 to 0.04 millimeters per hour), green and blue moderate precipitation (0.4 to 0.7 millimeters per hour), and purple heavy precipitation (2 to 2.4 millimeters per hour). The heaviest precipitation falls just north of the Equator in the Pacific and Atlantic Oceans, and from the Arabian Sea to Southern Japan. — Map of average global precipitation for the month of August 2005. The heaviest precipitation falls just north of the Equator in the Pacific and Atlantic Oceans, and from the Arabian Sea to New Guinea. Map by Robert Simmon based on data from Integrated Multi-satellite Retrievals for GPM (IMERG).

Weather Data

One additional data format you might run into — especially if you’re working with weather data — is Gridded Binary, or maybe General Regularly distributed Information in Binary form. I’ve seen both expansions of the acronym GRIB. I included some temperature forecast data ds.temp_20240729.bin for the contiguous United States (originally from NOAA’s Meteorological Development Laboratory) as an example to work with. (Be aware that the file extension is .bin, which can confuse some operating systems that assume it’s an executable file.)

Like NetCDF & HDF, GRIB is designed for multidimensional data, and packaging isn’t completely standardized. Fortunately the authors of GDAL have written drivers to interpret some of the most common types of GRIB files. Which means the first step is (again, always) to run gdalinfo to get a sense of the data:

gdalinfo ds.temp_20240729.bin

The first section of metadata is information about the projection (Lambert Conic Conformal) which GDAL will parse properly! That’s followed by forecast data at regular intervals. Instead of subdatasets, data in a GRIB file is packaged as bands. 42 of them, each representing predicted air temperature for a particular timestep. Taking a closer look:

Band 1 Block=2145x1 Type=Float64, ColorInterp=Undefined
  Description = 2[m] HTGL="Specified height level above ground"
  Min=6.150 Max=43.350 
  NoData Value=9999
  Metadata:
    GRIB_UNIT=[C]
    GRIB_COMMENT=Temperature [C]
    GRIB_ELEMENT=T
    GRIB_SHORT_NAME=2-HTGL
    GRIB_REF_TIME=1722285000 sec UTC
    GRIB_VALID_TIME=1722286800 sec UTC
    GRIB_FORECAST_SECONDS=1800 sec
    GRIB_DISCIPLINE=0(Meteorological)
    GRIB_IDS=CENTER=8(US-NWSTG) MASTER_TABLE=1 LOCAL_TABLE=0 SIGNF_REF_TIME=1(Start_of_Forecast) REF_TIME=2024-07-29T20:30:00Z PROD_STATUS=0(Operational) TYPE=1(Forecast)
    GRIB_PDS_PDTN=0
    GRIB_PDS_TEMPLATE_NUMBERS=0 0 2 0 0 0 255 255 0 0 0 0 30 103 0 0 0 0 2 255 129 255 255 255 255
    GRIB_PDS_TEMPLATE_ASSEMBLED_VALUES=0 0 2 0 0 255 255 0 30 103 0 2 255 -1 -2147483647
    STATISTICS_APPROXIMATE=YES
    STATISTICS_MAXIMUM=43.35
    STATISTICS_MEAN=29.375971319454
    STATISTICS_MINIMUM=6.1500183105469
    STATISTICS_STDDEV=5.4709529695505
    STATISTICS_VALID_PERCENT=48.91

Information for each band includes details about the data’s units (temperature in ˚C), statistics, and timestamps for both the time the forecast was generated (GRIB_REF_TIME), and when the forecast is valid (GRIB_VALID_TIME). If the time format looks unfamiliar, They’re encoded in UNIX time (seconds since January 1, 1970). For a non-programmer this type of timestamp can be challenging, but there are several websites that will convert from UNIX to more conventional formats. For example, band 1 of the sample file is the forecast for 1722286800 sec UTC which is July 29, 2024, at 21:00 UTC (2:00 pm PDT).

To convert a single timestep to another format, run gdal_translate on a band with the -b flag (similar to the command to export CERES data, but you won’t need to specify a subdataset):

gdal_translate -b 1 ds.temp_20240729.bin \
conus_temperature_forecast_20240726_2100UTC.tif

Which looks like this (after converting to grayscale):

Output of temperature from a NOAA weather forecast for the contiguous United States converted to graysacle. The data includes the Great Lakes and some offshore areas, but none of Mexico or Canada. — NOAA temperature forecast for the contiguous United States scaled and converted to graysacle. Note that the raw data uses the Lambert Conic Conformal projection, which is equal area and a good choice for weather data.

GDAL writes all the georeferencing data into the GeoTIFF, which makes it possible to make a map from the data without having to add or modify any of the georeferencing:

Color map of a temperature forecast for the contiguous United States valid for July 29, 2024 at 2:00 pm Pacific Daylight Time. The coolest areas (mostly along the West Coast or in mountain ranges) were predicted to be about 10˚C (light yellow), while the hottest (in the desert Southwest) were over 40˚ C (dark red). — Temperature forecast for the contiguous United States valid for July 29, 2024, at 2:00 pm PDT. The coolest areas (mostly along the West Coast or in mountain ranges) were predicted to be about 10˚C (light yellow), while the hottest (in the desert Southwest) were over 40˚ C (dark red). Contours complement the color palette and highlight small (1˚ C) variations in temperature. Map by Robert Simmon based on data from NOAA’s Meteorological Development Laboratory.

Global Weather Data

One more piece of advice about GRIB files specifically. The convention for global datasets is for longitude to go from 0˚ to 360˚ (just like the CERES data). Which, as mentioned, may cause problems. GDAL’s GRIB driver will automatically convert from 0˚ — 360˚ to -180˚ — 180˚, saving a little work. Here’s global temperature data from the ECMWF Reanalysis v5 (ERA5) model opened and exported with QGIS (version 3.36):

Grayscale map ranging from 0˚ to 360˚ longitude, the default from a global GRIB file opened in QGIS. — By default QGIS will plot a standard global GRIB file from 0˚ to 360˚ longitude.

Versus converted to a PNG with gdal_translate:

gdal_translate -b 1 -scale -6 45 0 255 -ot Byte \
ds.temp_20240729.bin ds.temp_20240729_gray.png

Grayscale map ranging from -180˚ to 180˚ longitude. gdal_translate will automatically remap standard global GRIB files to WGS84. — Fortunately, gdal_translate will automatically remap standard global GRIB files to WGS84.

Isn’t it nice when things just work?

Incomplete Metadata

Another challenge you may find, especially with local and regional data, is a file with missing or unparseable geolocation information. One example is precipitation data from NOAA’s Global Ensemble Forecast System (GEFSv12). It’s distributed in NetCDF, but the geolocation isn’t summarized in the metadata. Rather, it’s stored as separate lons and lats subdatasets.

SUBDATASET_1_NAME=NETCDF:"Aug2024_lead006_member_forecasts.nc":lons
SUBDATASET_1_DESC=[1597x2345] lons (32-bit floating-point)
SUBDATASET_2_NAME=NETCDF:"Aug2024_lead006_member_forecasts.nc":lats
SUBDATASET_2_DESC=[1597x2345] lats (32-bit floating-point)

In some cases the GDAL NetCDF (or HDF) driver can read and apply latitude and longitude arrays (For an example of this, read Joshua Stevens’ post on visualizing Aurora with the VIIRS Day Night Band.), but not for these specific files. So naïvely converting from NetCDF to GeoTIFF results in a file with no georeferencing. All the information is relative to the dimensions of the image, and the file metadata is lacking. Finding and adding map info requires a bit of sleuthing. There is a description of the NDFD Grids on the NOAA website with partial information, but they’re missing the corner points necessary to correctly place the data on the globe.

It turns out, however, that another NOAA data product, the National Blend of Models, shares the same dimensions (1,597 by 2,345 pixels) and map projection (Lambert Conformal Conic). Chances are the rest of the georeferencing is going to match, too. In which case you can copy the geolocation from one dataset to the other.

Assign the georeferencing using gdal_translate in the same way I fixed the sea surface temperature dataset with -a_srs and -a_ullr. But instead of using an EPSG code, use the coordinate system information from the metadata stored in the GRIB file. It’s in the form of Well Known Text (WKT), which you’ll need to grab (copy/paste) from the metadata and save in its own file with the extension .prj.

Display the metadata with gdalinfo, then copy the text block after “Coordinate System is:” including PROJCRS and everything in the following square brackets:

PROJCRS["unnamed",
    BASEGEOGCRS["Coordinate System imported from GRIB file",
        DATUM["unnamed",
            ELLIPSOID["Sphere",6371200,0,
                LENGTHUNIT["metre",1,
                    ID["EPSG",9001]]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433,
                ID["EPSG",9122]]]],
    CONVERSION["Lambert Conic Conformal (2SP)",
        METHOD["Lambert Conic Conformal (2SP)",
            ID["EPSG",9802]],
        PARAMETER["Latitude of false origin",25,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8821]],
        PARAMETER["Longitude of false origin",-95,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8822]],
        PARAMETER["Latitude of 1st standard parallel",25,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8823]],
        PARAMETER["Latitude of 2nd standard parallel",25,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8824]],
        PARAMETER["Easting at false origin",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8826]],
        PARAMETER["Northing at false origin",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8827]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["Metre",1]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["Metre",1]]]

Then paste it into an empty text file and save it as GEFSv12.wkt.prj in the same folder as the NetCDF file. Rather than including all that text in the gdal_translate command, point to the file with the information: -a_srs GEFSv12.wkt.prj.

Next, grab the upper left and lower right coordinates from the Corner Coordinates metadata:

Data axis to CRS axis mapping: 1,2
Origin = (-3272421.457337169442326,3790842.106035436037928)
Pixel Size = (2539.702999999999975,-2539.702999999999975)
Corner Coordinates:
Upper Left  (-3272421.457, 3790842.106) (138d23'39.07"W, 53d 3'24.76"N)
Lower Left  (-3272421.457, -265063.585) (126d17' 8.03"W, 19d12'55.22"N)
Upper Right ( 2683182.078, 3790842.106) ( 59d 1'17.54"W, 54d22'46.84"N)
Lower Right ( 2683182.078, -265063.585) ( 69d11'54.98"W, 20d19' 6.30"N)
Center      ( -294619.690, 1762889.261) ( 98d21'20.81"W, 40d37' 0.08"N)

Strip out the parentheses and commas and use these four numbers to assign the upper and lower left corner points: -a_ullr -3272421.457 3790842.106 2683182.078 -265063.585.

The full command to pull the first timestep from the data, add geolocation information, and convert to a GeoTIFF is:

gdal_translate -b 1 -a_srs GEFSv12.wkt.prj \
-a_ullr -3272421.457 3790842.106 2683182.078 -265063.585 \
NETCDF:"Aug2024_lead006_member_forecasts.nc":precip_ens_qmapped \
Aug2024_lead006_precip_ens_qmapped.tif

Add some styling to make a map of precipitation in and around the contiguous United States:

Map of predicted precipitation for the contiguous United States and nearby portions of Canada and Mexico. — Precipitation forecast for the contiguous United States and neighboring areas of Canada and Mexico. Map by Robert Simmon, based on data from NOAA’s Global Ensemble Forecast System.

Although scientific formats for remote sensing data can be frustrating to work with, GDAL provides an array of tools to preview, read, and extract the information they contain. Because the data is so varied, there’s no single method to read it — it’s more of a process of looking at each unique dataset and mixing and matching techniques. When approaching a new dataset my recommendation is to first preview the data with gdalinfo, identify the specific subdataset(s) you’re interested in, use gdal_translate or gdal_warp to extract those datasets, and (if necessary) modify the geolocation and/or data structure to fit your needs.

This workflow accommodates a wide range of data structures used by the scientific community, opening up a vast and underused source of information about Earth and the environment.

If you have any questions, or are stumped by a particularly opaque file, please let me know in the comments.

Thanks to Frank Warmerdam, Joe Kington, Carl Churchill, and Joshua Stevens for inspiration and feedback.

A Gentle Introduction to GDAL
Map Projections & gdalwarp
Geodesy & Local Map Projections
Working with Satellite Data
Shaded Relief
Visualizing Data
Transforming Data
Reading Scientific Data Formats (you are here)