Supporting New Xarray Contributors

Ryan Abernathey
pangeo
Published in
5 min readApr 10, 2019

TLDR: Xarray is looking for new contributors of all skill levels!

Xarray is in many ways the foundation of the Pangeo software ecosystem. It provides the core data structures, file I/O, and computational algorithms that underlie climate data science, big and small. (Xarray also has a large community of users in other fields, such as plasma and nuclear physics.) Xarray is growing rapidly, with many powerful new features added in the past year.

Xarray Status Report

The Xarray core developers recently began holding a monthly telecon to coordinate the project’s activities. (Notes from the latest meeting.) From this meeting emerged the following narrative.

  • We all love Xarray and think it’s a great tool. It should probably have more users than it does. (For example, many pandas workflows with multidimensional data are much simpler in Xarray.)
  • Xarray is also a great foundational data structure for other projects to build on. (Satpy is a great example.)
  • But there are a few things that are holding it back from reaching its full potential. New users struggle with its documentation. Tool developers aren’t sure about the best way to extend xarray and adapt it to domain-specific needs (e.g. geographic lookups).
  • Xarray’s roadmap has some exciting changes in the pipeline that will enable it to become more powerful and extensible, such as more flexible indices, more flexible array support (e.g. sparse or GPU arrays, in addition to existing numpy and dask support), and more flexible storage backends.
  • The best person to do implement these changes, which involve some internal refactoring, is Stephan Hoyer, the creator of xarray.
  • To free Stephan up for this deep work, the other devs can help by doing more maintenance, issue triage, and work on other areas of the project (including docs).

Growing the Community

At this point, a limiting factor on Xarray’s growth and improvement is the total amount of available developer time. This a probably a widespread problem in community open source projects. There are two ways to increase developer time: increase the contribution of existing developers or enlist new developers. One interesting aspect of Xarray is that nearly all of its core developers are full-time mainstream scientists (potsdocs, research scientists, or professors), rather than professional software engineers. One awesome consequence of this is that it ensure that the project remains relevant and useful to real science use cases. On the other hand, we have limited time; no one has an extra 10 hours a week to work more on Xarray.

So the solution for Xarray sustainability is to grow the community of Xarray developers.

The purpose of this post is to make it clear that we are looking for help and explain how new contributors can get involved and make an impact, even those new to open source development. The xarray core developers are eager and willing to mentor new contributors. At this point it’s also important to mention that Xarray has a clear and progressive code of conduct.

Below is a list of some specific projects that interested contributors could tackle.

Novice-Level Projects

Even if you’re totally new to python, there are some ways you can help out with Xarray development.

  • Help improve the documentation! We often hear users complain that Xarray’s documentation is opaque. The experienced devs don’t always see it, because it “makes sense to us.” You can start by opening issues to report things that you find confusing or to make suggestions. And you can propose changes or make additions to the docs via pull requests. (List of documentation-related issues)
  • Help promote Xarray to your colleagues! If you’ve new to Xarray and have found it helps your scientific workflow, consider sharing your experience via a tweet, blog post, or simply a conversation with colleagues. This will help grow the community, broadening the pool of potential developers. Here’s a great example!

Intermediate Projects

If you already know your way around a pull request and have some solid experience using Xarray, these projects are within your grasp:

  • Help manage Xarray’s issue tracker. We currently have 435 open issues. Help responding to new issues, identify old issues which can be closed, duplicates, etc.
  • Add a function to make it easier to compute a weighted average #422
  • Add an option to warn rather than fail when CF decoding raises an error #2848
  • Add the use_cftime option, recently added to xarray.open_dataset, to control time decoding in xarray.open_zarr (no issue as of yet)
  • Add the calendar type to the repr for CFTimeIndex #2416
  • Fix datetime accessor support for PeriodIndex #1565
  • keepdims=True for xarray reductions #2170
  • Creation of empty DataArrays #277
  • Implement a “pad” method #2605
  • Custom fill value for align, reindex and reindex_like #2876

Advanced Projects

If you’re a skilled python dev, we could really use your help on these:

  • Unit Support — This has been a long-standing feature request. Recent changes to numpy now mean it’s feasible to do it “right.” Some advice can be found on the Dask blog. (#525)
  • TileDB BackendTileDB is an exciting new array storage library. Building an Xarray backend for TileDB would enable lots of new possibilities. (https://github.com/pangeo-data/pangeo/issues/120)
  • NCML Format Support — NCML is an xml file spec that allows many netCDF files to be aggregated into a single virtual dataset. We would love to be able to read NCML files and generate these aggregate datasets in Xarray. #2697
  • Allow DataArray to hold cell boundaries as coordinate variables #1475
  • Enhanced HTML repr of Xarray objects— This one has a PR started (#1820) but requires help from an HTML / CSS / Jupyter guru.
Screenshot of possible Xarray HTML repr (PR #1820)

Next Steps

If any of this speaks to you and you’re motivated to become an Xarray contributor, that’s fantastic! To get involved, you have several options:

  • Comment on the GitHub issues linked above. Identify yourself as a new contributor and state your willingness to help.
  • Comment on this blog post below.
  • Email me personally at rpa@ldeo.columbia.edu.

In any case, we can pair you with an experienced Xarray developer to mentor you and help guide your contribution.

There are also a growing ecosystem of “Xarray related” projects:

These projects use or build additional functionality on top of Xarray. A future blog post will address some ideas for how to organize these efforts.

Welcome to all new Xarray contributors!

--

--

Ryan Abernathey
pangeo
Editor for

Associate Professor, Earth & Environmental Sciences, Columbia University. https://rabernat.github.io/