The University of Washington eScience Institute recently hosted a week-long hackweek to learn technologies for accessing and processing ICESat-2 data. ICESat-2 is a satellite laser altimeter of unprecedented quality launched on September 15, 2018 and designed to precisely measure Earth elevations over time. Our event focused on the application of ICESat-2 data to monitoring the cryosphere, the frozen water systems of the Earth — including sea ice, glaciers, and ice sheets. This post will briefly recap the event, discuss the role of Pangeo during the week, and explain why we think these hackweeks should be a standard follow-on to future satellite launches!
Data from ICESat-2 was made public via NASA’s National Snow and Ice Data Center (NSIDC) on May 28, 2019, and is growing at a rate of ~ 1 terabyte per day. These data volumes are increasingly common for new satellite sensors, and are both a boon and source of frustration for researchers. Scientists want to test their hypothesis against this new information quickly by making interactive plots, comparing with other datasets, and running custom code. Most scientists do not want to learn the syntax for new APIs to search for data, skim through static documents for metadata details, or wait for hours as 100s of gigabytes fill up their laptop disk. So understandably, it can be frustrating to get up to speed with new datasets.
A major goal of the hackweek was to bypass such frustrations by acquainting ICESat-2 participants with cohesive tutorials that use a common suite of open-source Python libraries. We were fortunate to kick off the hackweek with an insightful introduction to Jupyter and GitHub by Fernando Perez. Amy Steiker followed up by presenting a comprehensive interactive notebook from the NSIDC that has everything needed for programmatic access to ICESat-2 data. With the foundation in place, we moved on to libraries such as h5py, geopandas, and geoviews. In this post we won’t describe all the tutorials, but they are all available on YouTube as well as linked on the hackweek schedule!
Open source tools and tutorials are amazing resources, but scientists still face hurdles to quickly adopt them into their workflows. Here’s where the hackweek can shine. Hackweeks provide opportunities for community building, peer learning, networking and collaborative project work within a welcoming and inclusive environment. We blend tutorials with open project work as a way to unleash team creativity, while enabling individuals to make rapid progress on their own data science challenges. The key to a successful hackweek is to give everyone access to a common set of data science tools so that we can maximize learning and minimize time spent installing software and libraries. It was clear to us that this was an ideal opportunity to test a Pangeo JupyterHub deployment on a group of 70 motivated scientists.
The ICESat-2 Pangeo JupyterHub
We stood up https://icesat2.pangeo.io three months before the hackweek, our first attempt to mimic existing Google Cloud Deployments using the recently released AWS Elastic Kubernetes Service. We borrowed heavily from the existing configuration of the UK Met Office Informatics Lab Kubernetes and Pangeo JupyterHub configurations on GitHub, and made a few modifications due to the ever-improving quality of software tools in this space. In particular, we think the constantly-improving eksctl utility to deploy Kubernetes clusters has greatly simplified the deployment process on AWS. Also yuvipanda has been instrumental in creating a continuous deployment system for multiple hubs, the “pangeo cloud federation”, which has greatly streamlined the process of deploying and updating hubs on Google, Azure, and AWS!
We’re happy to report that this deployment worked fantastically for this size of event! While many of the blog posts on Pangeo focus on scalable computing with big data, many aspects of these Cloud deployments are useful for analysis of smaller datasets. In the scientific community there is an increasing need for computing tools that sit somewhere between a personal computer and an HPC. During the hackweek, there was limited need to launch distributed dask-kubernetes clusters, but there was a great need to have a reliable JupyterHub with authentication, a tested and pre-configured environment, reasonable computing resources per-participant that autoscale, and a common place for fast and reliable data sharing. The total cost for running these services for one week was less than $1000 — a very small sum compared to the costs of building and launching satellites!
Summing it up and looking ahead
ICESat-2 Hackweek was a successful endeavor because of a proven hackweek educational model, motivated instructors who volunteered their time to develop tutorial contents, and motivated participants who spent a full week immersing themselves in new data science skills to apply this data to important research questions. Each project has its own GitHub repository with a readme file explaining goals and accomplishments during the week. For example, check out topohack a project to “Compare and evaluate ICESat-2 data with high resolution DEMs (airborne lidar/satellite stereo) collected at lower latitudes over bare ground.” We encourage you to explore all the amazing tutorials and team projects in the ICESat-2 GitHub Organization. Curious how the event went from the perspective of a participant? Read PhD student Robbie Mallet’s excellent blog post, or check out Twitter #CSIhackweek!
The Pangeo JupyterHub infrastructure worked very well for ICESat-2 Hackweek, and there are many features we are excited to make more use of in the future. We are particularly excited about the ability to replicate the computational environment with Binder to provide dynamic tutorial content to a broad audience. Looking ahead, as space agencies move satellite archives to Cloud storage in new cloud-optimized formats (such as cloud-optimized geotiff or zarr-backed netcdf), the Pangeo JupyterHub approach to deploying computational environments alongside data archives can revolutionize the way scientists utilize entire satellite archives in their research (see these examples using Landsat8 data or Ocean circulation models).
In closing, we think this event was a major success that will accelerate the use of ICESat-2 data. Hackweeks aid the scientific community in our quest for new discoveries and we hope that they become standard practice for the launch of any new satellite!
This article was co-written by Scott Henderson and Anthony Arendt. The ICESat-2 Pangeo JupyterHub was implemented with support from AWS Research credits and NASA Grant 17-ACCESS17–0003. We’d like to thank all the instructors and participants who volunteered their time to produce great tutorials and projects that will aid and inspire many years of scientific discovery with ICESat-2.