Open Ocean Science at OSM2022
The Pangeo community is coming together at the upcoming Ocean Sciences 2022 virtual meeting around an exciting session called Open Ocean Science.
This blog post gives a rundown of the session. For a quick overview, the following links will take you to the program for each of the three parts of the session:
- Thursday, 3/3/2022: 2:30–3:30pm ET: Sub-session #1: Advancing open ocean science panel
- Thursday, 3/3/2022: 3:30–4:30pm ET: Sub-session #2: Pangeo update and workshop: accelerate your science using Python
- Friday, 3/4/2022: 9:00–10:00am ET: Sub-session #3: Github for science: why and how to use GitHub
- Friday, 3/4/2022: 10:00–11:00am ET: Sub-session #4: Open science in action: example workflows, publishing, and sharing your code
- Friday 3/4/2022: 11:30am — 1:30pm ET: Pangeo Forge mini Hackathon
Sub-session #1: Advancing open ocean science panel
4x10-minute talks + 20-minute panel discussion
Thursday, 3/3/2022: 2:30–3:30pm
2:30 PM: Working Towards Open Source Science in NASA’s Earth Science Division
Katie Baynes, NASA HQ
With the release of the Science Mission Directorate’s Strategy for Data and Computing for Groundbreaking Science in 2019, NASA deepened its commitment to enabling transformational open science. We are working towards this goal in a variety of ways: adopting supportive data policies, expanding pathways to participation in scientific research, training our current and next-generation scientists in open science practices, and providing improved access to our data archives via cloud-based technologies and fully open tools and publications. This talk will cover NASA’s Earth Science Data Systems program’s recent activities, demonstrate how our priorities are aligned with the overall NASA science strategies for data and computing, and establish a path forward over the next several years as we continue to evolve and improve our open science practices.
2:40 PM: EYES OPEN FOR SCIENCE ON OUR OCEAN AND COASTS: THE US INTEGRATED OCEAN OBSERVING SYSTEM (US IOOS) AND OPEN SCIENCE
Tiffany Vance, US IOOS Program
The US Integrated Ocean Observing System (IOOS) supports open science via observing systems, model development/use, data access/analysis, software development, and the use of standards. IOOS-funded research programs seek community development and sharing of models. The IOOS CodeLab contains tutorials and how to access/utilize IOOS data sources. IOOS participates in standards setting bodies and requires the internal use of standard data formats. IOOS supports aligning marine life observations to global standards and best practices. IOOS’ technical documentation and recommendations are published for open access on GitHub. IOOS also has a portfolio of open source software packages to support many of the above activities at a central GitHub location. IOOS’ core requirements are: Open Data Sharing IOOS ascribes to the GEOSS data sharing principles. Near real time access to observations is a core part of this. IOOS regions maintain data access following the Find, Access, Interoperate, and Reuse data (FAIR) principles. Data Access Services IOOS Data Providers serve data and products through recommended services including ERDDAP and THREDDS. Metadata Data providers are expected to ensure relevant metadata is produced, accessible and compliant with IOOS conventions, and to participate in the development of such conventions. Catalog Registration The IOOS Catalog is the master inventory of IOOS datasets and services. Providers register their datasets in the Catalog. Provision of Data to the GTS U.S. IOOS is committed to ensuring that all relevant IOOS observations will be contributed in near real time to the World Meteorological Organization. Storage and Archiving Providers are expected to provide for long term and archival storage of data, metadata and other supporting documentation and algorithm descriptions. Sustained Operations The IOOS observing, data management, and modeling core capabilities must be sustained for long-term, continuous operations.
02:50 PM: OPEN SCIENCE IN CLIMATE TECH
Julie Pullen, Jupiter Intelligence
There has been a breathtaking surge in climate tech companies and non-profit organizations developing and utilizing open source science and associated tools. The perspectives and resources that these entities bring to the open source community represent a transformative opportunity for growth. Julie Pullen is the Climate Strategist at Jupiter Intelligence. Dr. Pullen’s expertise spans climate, weather and hydroscience with a particular focus on high-resolution coastal urban prediction for flooding, heatwaves and other perils.She was previously Associate Professor in Ocean Engineering at Stevens Institute of Technology where she coordinated field and modeling studies globally to improve our understanding and prediction of the Earth system. She was earlier Director of the DHS-funded National Center for Maritime Security. As a scientist at the Naval Research Laboratory, she pioneered the coupling of models of the ocean and atmosphere for operational prediction globally. Jupiter is a climate risk analytics start-up. Jupiter is a foundational contributor to open source software initiatives led by the Linux Foundation, and Jupiter has been engaged in collaboration with the open source community at its core. Jupiter employs cloud native tools like Kubernetes in the engineering workflows, and utilizes community Earth system models in its science. Many Jupiter employees have been contributors to these models and supporters of these communities over their careers.
03:00 PM: OpenOceanCloud: A New Approach to Ocean Data and Computing
Ryan Abernathey, Columbia University
For decades, oceanography was severely constrained by data availability, but now we are drowning in a flood of data. Oceanography has been revolutionized through the development of new observing technologies (eg. satellites, autonomous floats, and gliders) that deliver vast amounts of data every day. Alongside these observations, a new class of numerical models has emerged which simulate the ocean with ever increasing detail and realism. While new measurements are undoubtedly important, it is also true that most existing data remains severely unexploited due to the challenges of finding, acquiring, processing, and visualizing the large, complex, heterogeneous datasets. To leverage our past investments in ocean observations and modeling, and to fully exploit new observations, we must transform our infrastructure and tools for working with ocean data. To meet this challenge, we need an international collaboration to accelerate the development of cloud-based data infrastructure for oceanography: OpenOceanCloud. This partnership between universities, research institutes, and industry will leverage open data, open-source software, and cloud computing to build a collaboration platform that can be used by students and researchers across the world. Currently, data intensive ocean research is only accessible to privileged institutions with the resources for high performance computing and data storage. OpenOceanCloud will break down this barrier, providing a research platform to the thousands of potential oceanographers who lack such resources. Access to vast data sets and powerful computing environments can help remove the barriers related to low-bandwidth internet, intermittent power, and limited-cyberinfrastructure. With this infrastructure, anyone can do science, anywhere, and this empowers communities that have been historically excluded from full participation in oceanography. This talk will summarize recent activity and next steps around realizing this vision.
03:10 PM: Q&A Session
Sub-session #2: Pangeo update and workshop: accelerate your science using Python
Thursday, 3/3/2022: 3:30–4:30pm ET
03:30 PM Open Sesame: Open your science together with the Pangeo community
Deepak Cherian, National Center for Atmospheric Research
This talk will introduce “Pangeo”: an open inclusive community working to enable collaborative and open scientific workflows. Datasets are increasingly becoming bigger and bigger, requiring ever more computational skills to extract insight, and obsoleting traditional “download and analyze” workflows. The Pangeo community works to democratize these computational skills, and create both a human and computational commons for scalable science. We introduce an opinionated view of the core principles of the Pangeo open-science community as viewed through the lens of open source software communities. These principles include: 1. Work to solve the most minimal aspect of a problem to maximize code sharing. 2. Build clear computational interfaces to foster interoperability 3. Build clean human interfaces to foster community, cooperation and collaboration to solve ever more complex problems. This talk will specifically focus on Principle 3 by showcasing specific examples where community collaboration has democratized complex computational code and demonstrating entry points for new community members. This talk will set the stage for the following talk and workshop which will illustrate exemplary examples of open scientific workflows in the cloud.
03:45 PM No Supercomputer, no problem! — Analyzing Petabyte scale climate data in your browser with Pangeo.
Julius Busecke, Columbia University
One of the most challenging and complex problems for the present and future generations will be fighting and grappling with the consequences of manmade climate change. To inform our actions, the globe needs to supercharge the creation, dissemination, and application of scientific knowledge. We need ways to analyze faster, engage a larger and more diverse amount of people, and deliver results in a fast, reproducible, and yes *fun* way. One of the cornerstones of understanding the complex earth system and informing policy around the globe is the Coupled Model Intercomparison Project (CMIP) which approaches the Petabyte scale. Datasets produced by the current and future generation of global earth system models render the common ‘download and analyze’ workflow inefficient, blocking innovative analysis and fast scientific discoveries needed to deal with the dramatic changes happening to the planet. They also effectively exclude people who do not own a fancy supercomputer from participating in the scientific discovery process. That is until now! We present a workflow to analyze CMIP6 data fully in the cloud. Python tools and cloud infrastructure developed within the Pangeo community enable near-instantaneous access and fast data processing to much of the CMIP6 archive with only a web-browser. To demonstrate the wide range of interoperable, efficient, community based solutions developed in the open-source community we show an interactive demo how to compute global sea surface temperature time series for a variety of models and their members.
04:00 PM A chance for participants to do other Pangeo Gallery examples on their own
Julius Busecke and Deepak Cherian + organizers and helpers
Sub-session #3: Github for science: why and how to use GitHub
Friday, 3/4/2022: 9:00–10:00am ET
09:00 AM Using Git and GitHub to Enable Open Ocean Science
James Munroe, Memorial University of Newfoundland
Git is widely used software for version control management. This talk will introduce the key concepts of Git using the GitHub Desktop application and why it is an essential tool for collaboration without overwriting another’s work. Fundamental operations of Git and GitHub to be covered include cloning, forking, committing, pulling, merging, and pushing. Examples of how ocean researchers from diverse institutions are using Git and GitHub to build collaborative, open, ocean science communities will be highlighted.
09:10 AM GitHub for science: Why and How to use GitHub
Aimee Barciauskas, Development Seed
Source control is the foundation to open source software and core to being able to contribute to open science. Open source software has been maintained through shared Github repositories for the modern era. The introductory talks will present links to tutorials that will be used later in the 30-minute mini-workshop for hands-on practice. Content for the 10-minute Tutorials: 1. Best practices for collaboration, focusing on pull requests: Pull requests allow for contributors to suggest changes to other branches in a repository. Github enables users to suggest changes and reviewers can view differences between the to and from branches. Reviewers may suggest changes or approve the pull request. 2. How is Github being leveraged for open science? Tools like binderhub have enabled scientists to easily share their jupyter notebooks so users can run the exact same code without any setup required (because binderhub sets up the jupyter notebook environment for you. Additionally, github is used to package containers to run workflows on remote systems. If you package your code appropriately (e.g. using containers) it can run on many remote cloud systems. Resources which may be re-used to support this tutorial:
09:20 AM Citing and Improving the Discoverability of Your Research Software with AGU
Chris Erdmann, American Geophysical Union (AGU)
For many of the articles published in American Geophysical Union (AGU) journals, software plays an integral role in the underlying research. Often, the software is actively developed using platforms such as GitHub, and where it can also be released, preserved, and cited via repository integrations such as Zenodo. AGU asks authors to share their software and provides guidance to them on how they can document and cite the software in their publications. This talk will discuss the steps involved in properly including software in your publications and receiving credit. It will also provide a brief overview of some of the other ways AGU is using GitHub including open science training, streamlining workflows, and collaborating with researchers.
Sub-session #4: Open science in action: example workflows, publishing, and sharing your code
Friday, 3/4/2022: 10:00–11:00am ET: Sub-session
10:00 AM Climate Observatory — Analyzing and Visualizing NOAA Climate Data Records on the Cloud
Yuhan (Douglas) Rao, North Carolina State University
NOAA’s Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information, mostly derived from long-term satellite data archives, on how, where, and to what extent the land, oceans, atmosphere, and cryosphere are changing. NOAA CDRs were developed in response to the recommendations made by the National Research Council to embrace NOAA’s mandate in understanding climate variability and change through national leadership in generating and managing satellite-based CDRs and ensuring long-term consistency and continuity for satellite CDR program. Although CDR datasets are freely available from NOAA NCEI for nearly a decade, these datasets have been made freely accessible via cloud platforms through NOAA’s Big Data Program in 2021. Historically, users of these datasets were required to have advanced technical skills and large computing capabilities to access and analyze NOAA CDRs. However, the migration to the cloud platform alleviates some of these restrictions and provides the possibility to reach a larger user base who need reliable climate information for various purposes. Easy access to more than 40 NOAA CDR datasets has the great potential to enable different stakeholders to develop climate services and products to address the pressing climate change challenges. I will demonstrate an open science workflow to convert NOAA CDR data to cloud-optimized data formats for easy access and analysis. This open science workflow is part of a pilot project of cloud-based data analysis and visualization for NOAA CDRs. The resulted cloud-optimized CDR data will be made publicly available together with Jupyter notebooks that will assist users to access and analyze NOAA CDR data.
10:10 AM Revisiting five decades of 234Th data: a comprehensive global oceanic compilation.
Elena Ceballos-Romero, Woods Hole Oceanographic Institution
The depth distribution of Thorium-234 (234Th) relative to its parent Uranium-238 (238U) is widely used for determining the downward flux of carbon in the ocean following the premise that particulate organic carbon (POC) flux could be calculated if the ratio of POC to 234Th measured on sinking particles (POC:234Th) at the desired depth is known. Many 234Th depth profiles have been collected using a variety of sampling instruments using sampling and analytical strategies that have been progressing since its first sampling in 1967. We present a global oceanic compilation of 234Th measurements that collects results over a period exceeding 50 years gathered in a single open-access, long-term repository that can be updated. The compilation is made of a total 223 datasets. It includes data compiled from over 5000 locations spanning all the oceans for total, dissolved and particulate 234Th concentrations, and POC:234Th ratios. 379 oceanographic expeditions and more than 56000 234Th and 18000 238U data points have been compiled. Appropriate metadata have also been included along with other relevant parameters such as temperature and salinity when available. Data sources and methods information are also detailed along with valuable information for future analysis. The data are archived in PANGAEA under DOI doi.pangaea.de/10.1594/PANGAEA.918125, and a dedicated web page — Sea of Thorium — is under development to: i) grant free access to the compilation, allowing the researchers to search, filter, visualize and freely use existing 234Th data, and ii) encourage researchers to share their 234Th measurements and contribute to the compilation. This effort provides a valuable resource to better understand and quantify how the contemporary oceanic carbon uptake functions and how it will change in future that seeks to act as a coordinating umbrella to serve as a focal point for the 234Th community under the principles of openness and reproducibility.
10:20 AM Open-source, reproducible workflow in physical oceanography and geophysical fluid dynamics
Navid Constantinou, Australian National University
I value open-source reproducible science. It’s one thing to be reading a scientific paper in which the authors carefully described in detail what they did to get their results. But it’s a whole next level when along with the paper the authors provide, e.g., a repository with the data they used and a Jupyter notebook with which all their analyses can be reproduced. As a physical oceanographer and geophysical fluid dynamicist, I often either write code to solve particular equations or use/modify already-written code to setup and run a model configuration and analyze its output. Other times I analyze output from comprehensive Earth systems models. I’ll touch on some aspects of my open-source workflow and show how I use Github, Jupyter notebooks, and Zenodo to provide easy and straightforward ways to reproduce my science. I’ll also discuss a bit on how embracing open-source and reproducible principles for my science has actually made my workflow easier.
10:30 AM Ocetrac: A Python package to track the spatiotemporal evolution of marine heatwaves
Hillary Scannell, Jupiter Intelligence
Dangerous hot-water events, called marine heatwaves (MHWs), cause prolonged periods of thermal stress in the marine environment that can lead to widespread coral bleaching, harmful algal blooms, unproductive fisheries, and even economic loss. Anticipating the paths of destructive MHWs remains a challenge owing to the complex spatiotemporal evolution of these events. Many traditional approaches use fixed time and space analyses, which fail to capture the full evolution of MHW events. To overcome these challenges and limitations, we present a new open source software package called Ocetrac. The main goals of Ocetrac are to label, track, and characterize the evolution of unique geospatial anomalies. Ocetrac leverages many well-established open-source packages. The algorithm uses morphological operations from SciPy’s multidimensional image processing package to identify uniquely distinct MHW objects based on their spatial information. After this step, the algorithm applies the Scikit-image measure module to label and track the identified MHW objects in time and space. The result is a new labeled image dataset of MHW trajectories that we can inspect to characterize the complex spatiotemporal patterns of historic events. This tool has potential applications to other oceanographic hazards such as extreme sea level, hypoxia, oil spills, and low pH events. Ocetrac is a community driven package for the detection of extreme events in gridded datasets. It is hosted on GitHub, can be installed from source code, PyPI, or Conda-Forge, and is documented using Sphinx. A suite of tests are triggered using GitHub Actions for each pull request and new release, giving Ocetrac a solid foundation for collaborative development. By providing a flexible, interoperable, and open-source package to carry out these specialized calculations, we hope to empower other researchers to adopt, reuse, and remix our feature-tracking methodology as part of their own workflows.
10:40 AM 20-minute Q&A/discussion
Sub-session #5: Pangeo-Forge workshop: putting ocean data onto the cloud
Friday 3/4/2022: 11:30am — 1:30pm ET
11:30 AM Pangeo Forge mini-Hackathon: Transforming Archival Ocean Data into Cloud-Native Formats
Rachel Wegener, University of Maryland, College Park
Charles Stern, Lamont-Doherty Earth Observatory, Columbia University
The Pangeo Forge platform provides open-source, scalable data transformation algorithms and freely accessible compute infrastructure for the conversion of archival datasets to analysis-ready, cloud-optimized (ARCO) data formats. Anyone can use Pangeo Forge’s premade algorithms to define a data transformation, called a “recipe,” for their dataset of choice. The platform architecture separates the recipe contribution process from the automated cloud infrastructure which performs the scaled conversion of datasets, thereby making recipe contributions accessible to anyone with introductory Python knowledge. Pangeo Forge is a community driven project and is supported by the community that uses it. In this interactive session we will introduce Pangeo Forge in more depth and explain how it fits into the other open science workflows and principles discussed in the session. Participants will then work together to create recipes, the core unit of contribution for converting datasets, for oceanographic datasets of interest. We will engage with key technical concepts for making a recipe and contextualize the recipe within the other infrastructure elements of the Pangeo Forge platform. Participation in this session will involve hands-on coding and group work.