Spotlight: Parsl–Productive Parallel Programming in Python

Tim Bonnemann
Open-Source Science (OSSci)
7 min readApr 26, 2024
Photo credit: Ben Clifford, Daniel S. Katz. No rights reserved (CC0).

Welcome to our new Spotlight Series, where we get to know science-focused open-source software projects in order to better understand how Open-Source Science (OSSci) can add value and help accelerate scientific research through better open source.

Q: Let’s start with a round of introductions. Who are you? What do you do? And what is your relation with open-source software in science?

Kyle: I’m Kyle Chard, a Research Associate Professor at the University of Chicago, and I hold a joint appointment at Argonne National Laboratory. I’m interested in open-source software as a method of enabling the community to work together to build high-quality and open software for science.

Dan: I’m Daniel S. (Dan) Katz. I’m the chief scientist at the National Center for Supercomputing Applications (NCSA) and a Research Associate Professor at the University of Illinois Urbana Champaign. I’m interested in sustainable research software and policy issues such as citation and credit mechanisms and practices associated with software and data, organization and community practices for collaboration, and career paths for computing researchers. I co-founded the Journal of Open Source Software (JOSS), the US Research Software Engineer Association (US-RSE), and the Research Software Alliance (ReSA).

Q: Tell us about Parsl. What does it do? And how did it come about?

Parsl allows a Python programmer to make calls to Python functions and external executables asynchronous, so that they can run in parallel when they don’t have data dependencies. Parsl’s runtime controls their execution when there are data dependencies, so that they run when they are able to. The Parsl runtime also works with many types of supercomputers, clouds, and local systems, so that a Parsl program can express large amounts of potential parallelism, with the runtime making best use of the compute resources. Parsl’s webpage includes links to community information such as our Code of Conduct and governance guide, along with user and developer information such as documentation and interactive tutorials. Parsl uses GitHub for code and issue management, where you can access the (Apache-2.0 licensed) code and create issues and pull requests. The GitHub project includes tutorials, documentation, and associated research projects. Parsl can be installed from PyPI, conda-forge, and spack. We also use Slack for community discussions, as well as offering a Zoom call every two weeks for developers and users to chat. Parsl will soon be a NumFOCUS fiscally-sponsored project.

Q: Could you give a concrete example of how Parsl is being used in practice?

Parsl has had significant impact in many domains, for example it was used to produce the most interconnected simulated sky survey in preparation for the Rubin Observatory [1], to conduct one of the largest single batch imputations ever performed on 474k subjects in the Million Veterans Program [2], to search for potential COVID-19 therapeutics in a search space of billions of molecules [3], and to scale large language models to identify worrying COVID variants (2022 Gordon Bell COVID-19) [4]. Parsl is also a crucial building block for building large-scale parallel and distributed systems, such as Globus Compute [5], Colmena [6], DLHub [7], and Garden [8].

A detailed example of Parsl usage is found in the DESC project, which uses Parsl for various workflows. In a recent paper, they outlined how they scaled a large-scale image simulation (ImSim) workflow that combines Python code to steer the workflow, Parsl to manage the large-scale distributed execution of workflow components, and software containers (Singularity and Shifter) to carry out the image simulation campaign across two DOE supercomputers, ALCF Theta and NERSC Cori, scaling up to 4000 nodes [9]. Briefly, the workflow uses containers with the imSim software to simulate each of the LSST camera’s 189 CCD sensors. The workflow takes an instance catalog that specifies cosmological objects and positions as input and uses it to determine which objects cast light on each sensor. The workflow parallelizes analysis across sensors and across the sky, and then patches together results into a single dataset. They used their workflow to simulate five years of observations for 300 square degrees of sky area and have released this dataset for further science use. In addition, the DESC team has also built their own workflows framework on Parsl, created plugins for the LSST bulk production workflow, and developed new tooling, for example to visualize Parsl monitoring information.

  • [1] The LSST Dark Energy Science Collaboration (LSST DESC), The LSST DESC DC2 Simulated Sky Survey, The Astrophysical Journal Supplement Series.1(253) 2021. doi: 10.3847/1538–4365/abd62c
  • [2] Haley Hunter-Zinck, Yunling Shi, Man Li, Bryan R. Gorman, et al. Genotyping array design and data quality control in the million veteran program. The American Journal of Human Genetics, 106(4):535–548, April 2020. doi: 10.1016/j.ajhg.2020.03.004
  • [3] Y. Babuji, et al. Targeting SARS-CoV-2 with AI- and HPC-enabled lead generation: A first data release, 2020. doi: 10.48550/arXiv.2006.02431
  • [4] M. Zvyagin, et al. GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics, The International Journal of High Performance Computing Applications, 37(6), 2023. doi: 10.1177/10943420231201
  • [5] R. Chard et al. “FuncX: A Federated Function Serving Fabric for Science.” 29th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC). 2019. doi: 10.1145/3369583.3392683
  • [6]L. Ward, et al., “Colmena: Scalable Machine-Learning-Based Steering of Ensemble Simulations for High Performance Computing,” in 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC), St. Louis, MO, USA, 2021 pp. 9–20. doi: 10.1109/MLHPC54614.2021.00007
  • [7] R. Chard et al., “DLHub: Model and Data Serving for Science,” 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, 2019, pp. 283–292, doi: 10.1109/IPDPS.2019.00038.
  • [8] https://thegardens.ai/
  • [9] A. S. Villarreal, Y. Babuji, T. Uram, D. S. Katz, K. Chard and K. Heitmann, “Extreme Scale Survey Simulation with Python Workflows,” 2021 IEEE 17th International Conference on eScience (eScience), Innsbruck, Austria, 2021, pp. 206–214, doi: 10.1109/eScience51609.2021.00031.

Q: How is your project funded? And how do you manage to sustain it?

Parsl has been funded through a variety of mechanisms. It was originally funded by a set of collaborative 5-year NSF SI2 awards, and now is supported by collaborative 2-year NSF CSSI awards (in the transition-to-sustainability track) and a 2-year Chan Zuckerberg Initiative Essential Open Source Software for Science (EOSS) award. In addition, we’ve had support from the DOE ECP Exaworks project and Argonne National Laboratory via a Laboratory Directed Research and Development (LDRD) award. We are currently receiving some funding from LSST DESC to support bug fixes and new features for their use.

We receive unfunded support from many people and projects, from direct users of Parsl to developers of platforms that rely on Parsl to developers of collaborative projects that align with Parsl to administrators who support Parsl users on their systems.

Our NSF project is aimed at making Parsl sustainable, and we are working to transition Parsl to a community-governed and community-supported open source project, with future income to be handled and distributed by a 501(c)(3) organization (we’ve been accepted by NumFOCUS and are currently working to finalize this) under the direction of an elected Parsl Coordination Committee. The project will deliver a sustainable Parsl community by a) targeted technical activities that reduce costs and barriers for contribution; b) building the Parsl community via outreach, engagement, and education programs; and c) establishing pathways and incentives to convert some users to contributors and some contributors to leaders.

Q: What are the key challenges that keep you up at night, both in your day-to-day work on Parsl and longer term?

The main challenges we think about are:

  • Supporting the core team who work on Parsl.
  • Growing the contributor base to ensure that we are not dependent on a small number of funding streams.
  • Ensuring the robustness/reliability/scalability/performance of the software on a range of different platforms.
  • Doing work that some find less exciting than software development (e.g., docs, tutorials).
  • Finding a way to incorporate new research into a production platform, with minimum criteria for acceptance and an idea about how such new features will be supported in production for some future period.

Q: Looking ahead at the next couple of years or so, where do you see your project is headed? What are your aspirations?

We hope the project will continue to grow and that membership in NumFOCUS along with our new governance model will lead to a more open and truly community-led project. We are hopeful that we can establish a thriving community of contributors, users, administrators, and supporters, with self-organizing groups focusing on different aspects (from champions in specific domains to groups focused on documentation and testing).

Q: Last but not least, who are your ideal contributors? And how can people get involved?

We would like to really encourage contributions from a broad range of people, scientists who use Parsl, developers who are building on it, administrators who support its use on their systems, and others who find the project interesting. We encourage a range of contributions from updates to documentation and our website, to tutorials, testing code, CI, and scalability analyses, through to core code contributions, new research additions, and integrations with other software.

Excellent! Thank you very much, Kyle and Dan!

--

--

Tim Bonnemann
Open-Source Science (OSSci)

Intersection of community & participation. Currently @IBMResearch. Wannabe trailrunner.