Petabytes of Ocean Data, Part 1: NASA ECCO Data Portal
Five years ago, a new ocean model simulation was run by researchers from MIT and JPL on NASA’s Pleaides Supercomputer using MITgcm. One goal of this simulation was to help oceanographers prepare for the upcoming SWOT mission, which will observe the ocean surface with unprecedented resolution. The simulation was groundbreaking in several ways, particularly its high spatial resolution (between 1 and 2 km with global coverage), comprehensive tidal forcing, and high frequency (hourly) outputs. The resulting dataset holds great potential for scientific discovery, since it explores regimes of ocean dynamics that were previously inaccessible to global-scale simulation.
However, the dataset poses huge challenges for analysis.
- It is huge, many Petabytes in total.
- The dataset resides on a highly secure NASA supercomputer, only accessible to NASA-funded researchers.
- The model grid itself is a complex lat-lon-cap (LLC) curvilinear grid which is difficult to visualize in a regular map projections.
- It is stored in a custom binary data format called MDS which is unique to MITgcm.
- Finally, the data were compressed using a bespoke compression algorithm.
These challenges conspire to make the model data extremely inaccessible to non-specialists and hamper our ability to make new discoveries.
This problem directly motivated me to get more serious about software and infrastructure in my scientific work, since I saw the need for fundamentally new tools to meet the challenges. It led to my involvement in xarray and eventually to the genesis of Pangeo. It changed the course of my career completely, causing me to focus on building tools rather than just churning out papers.
Five years out, we are finally reaching a point where these tools are mature enough to tackle the original problem. In this series of blog posts, I will go into more details about these tools and the various tradeoffs between different analysis environments (cloud, HPC, etc.). The purpose of this first post is simply to announce the release of a few new ways to access the data
The ECCO Data Portal
A major step forward in making the data more accessible was undertaken by specialists at NASA Ames research center: the creation of a web based portal, through which anyone can access the data over the internet.
On this website you can point an click your way to downloading 40 GB binary files. However, these files are useless unless you know how to decode what’s inside.
To make the data useful, I developed a new module in the Xmitgcm python package called llcreader. This module makes it a snap to access the data from the ECCO data portal using xarray and dask, the central packages of the Pangeo software ecosystem. For example, initializing a dataset to the LLC sea-surface height variable is as simple as:
from xmitgcm import llcreader
model = llcreader.ECCOPortalLLC2160Model()
ds = model.get_dataset(varnames=['Eta'], type='latlon')
The resulting Xarray dataset is “lazy”: it only loads data when needed for computation or visualization. It does this by constructing HTTP requests to the ECCO data portal that just fetch the bytes that are needed, and then applying transformations to decompress and align the different pieces. All of this is wrapped up into Dask Arrays and placed in an Xarray Dataset.
The full documentation for the llcreader module is available at the Xmitgcm website, where you can find detailed usage examples.
Interactive Binder Demo
To demonstrate how this works, I have prepared an interactive demo using Pangeo Binder. To try the demo right now, just follow this link:
If you try out the interactive Jupyter notebook, you’ll be able to explore an interactive, Google-maps style visualization of the LLC2160 sea surface temperature. If you’re too lazy to try it yourself, here’s a screenshot of what it looks like.
This example lives in the following GitHub repository, which will be updated with more examples going forward:
In addition to llcreader, these capabilities depend on huge number of other packages and a large community of open source developers. Some of the most important ones in this example are:
- filesystem-spec: remote filesystems and file-system-like abstractions
- Xarray: the basic data structures and computational library for the datasets
- Dask: the parallel computing library which enables lazy representations of huge arrays
- Holoviews: interactive visualizations
- Jupyter: open-source software, open-standards, and services for interactive computing across dozens of programming languages
Scaling Up: HPC and Cloud
While the ECCO Data Portal is a breakthrough for making the data accessible, it has limited bandwidth. It’s great for an interactive exploration like the one shown above, but if I want to actually process Petabytes of data, it probably can’t provide the necessary throughput.
llcreader also works on Pleaides, where it can access the data over a performance Lustre filesystem. It works almost exactly the same as the previous example:
from xmitgcm import llcreader
model = llcreader.PleiadesLLC2160Model()
ds = model.get_dataset(varnames=['Eta'], type='latlon')
(The llcreader docs describe this in more detail.) An obvious downside is that we can’t share this with the world via Binder, because Pleiades is a private, secure supercomputer. Nevertheless, Pleiades users can begin to use this new capability today to accelerate their science. The datasets produced by llcreader are fully compatible with Dask’s distributed scheduler, meaning that many supercomputer nodes can be deployed to process the data in parallel.
Commercial cloud storage (e.g. Amazon S3 or Google Cloud Storage) may offer the best of both worlds. It’s both publicly accessible and extremely high throughput. A future blog post will compare the performance characteristics of these different platforms.
What’s Next
For now, I am very happy that we have reached this milestone of making the data publicly available. But there is still lots of work to do. llcreader needs users to try it out, identify bugs, and request new features. For now, you need the latest GitHub master of xmitgcm (a proper release is coming soon), plus a few other packages. Assuming you already have a Python 3 environment with Xarray and Dask installed, you can get llcreader as follows
pip install fsspec zarr git+https://github.com/xgcm/xmitgcm.git
Please share feedback and ask questions at:
My hope is that these tools enable us to start doing exciting new science with this amazing dataset.
Acknowledgements
This work depended on a great many people. Some who deserve particular mention here are:
- Dimitris Menemenlis (JPL): PI for the LLC simulations and one of the most generous collaborators I have ever met.
- Chris Hill (MIT): Co-creator of the LLC simulations.
- Shubha Ranjan and Ryan Spaulding (NASA Advanced Supercomputing): creators of the ECCO Data Portal
- Martin Durant (Anaconda): creator of fsspec, which was essential to this work
My work on this project was supported by NASA Award NNX16AJ35G (SWOT Science Team) and NSF Awards OCE-1740648 (Pangeo EarthCube) and OAC-1835778 (CSSI).