Pangeo applications for NASA Earth Observing Data
Today, a new project begins that expands Pangeo’s capability to utilize remote sensing datasets for geoscientific research. Our focus will be on cloud-native data analysis tools and approaches with the lofty goal of changing the way every day scientists interact with satellite observations of Earth.
A team of researchers from the University of Washington eScience Institute, in collaboration with scientists and engineers from the National Center for Atmospheric Research, Anaconda, and Element84 have been awarded a $1.5 million grant from the National Aeronautics and Space Administration (NASA) through the Advancing Collaborative Connections for Earth System Science (ACCESS) program.
Data intensive scientific workflows are at a pivotal time. Traditional local computing resources are no longer able to meet the storage or computing demands of scientists. In the Earth System Sciences (ESS) community, data volumes are exploding with new datasets, sourced from models, in-situ observations, and remote sensing platforms, being prohibitively large to store at even medium to large High Performance Computing (HPC) centers. NASA has estimated that by 2025, it will be storing upwards of 250 Petabytes (PB) of data using commercial cloud services (e.g. Amazon Web Services [AWS]). The availability of these data in cloud environments, co-located with a wide range of computing resources, could revolutionize how scientists use these datasets and provide opportunities for important scientific advancements.
To realize these opportunities, new approaches in how the ESS community handles data access, processing and analysis are required. This project will help facilitate the ESS community’s transition into cloud computing by developing technologies that build on existing open-source tools (e.g. Python, Jupyter, dask, xarray, rasterio) and by further integrating them within the Pangeo ecosystem.
Our project has three main components:
- Data discovery. Moving large amounts of data to the cloud is easy. Making use of these large datasets, however, requires advanced tools for discovering and accessing data collections. NASA has developed a number of data cataloging systems (e.g. CMR). We will work on integrating those systems into the Pangeo stack, to make data discovery and cloud-based data processing much easier.
- Cloud deployments. Over the last year, Pangeo developers have developed lots of infrastructure related to JupyterHub and dask. We will be continuing this work with an explicit focus on AWS.
- Demonstration and Outreach. A significant portion of our project will focus on sharing Pangeo’s tools and approaches with a broad audience. We’ll be developing scientific demonstrations, integrating Pangeo into hackweeks, and developing extensive documentation on how Pangeo can be used (think Zero-to-Pangeo instead of Zero-to-JupyterHub)
Our project team includes both scientists and engineers from academia and private industry with expertise in remote sensing data processing, cloud infrastructure, and software development.
The University of Washington
- Anthony Arendt (PI): Senior Data Science Fellow (eScience Institute) and Senior Research Scientist (Applied Physics Laboratory)
- Rob Fatland (Co-I): Director of Cloud and Data Solutions
- Scott Henderson (Co-I): Postdoctoral Fellow (Department of Earth and Space Sciences)
National Center for Atmospheric Research
- Ethan Gutmann (PI): Project Scientist (Research Applications Laboratory)
- Joe Hamman (Co-I): Project Scientist (Research Applications Laboratory)
- Matthew Rocklin (Co-I): Computational Scientist
- Dan Pilone (PI): Chief Technology Officer
Our vision for this project is to continue to work closely with the Pangeo community. If there are elements of our proposed work that align with your current work, we’d love to talk with. Please get in touch by opening a GitHub Issue: https://github.com/pangeo-data/pangeo/issues.