Spotlight: Distributed-Something–Run Encapsulated Docker Containers in AWS

Tim Bonnemann
Open-Source Science (OSSci)
6 min readSep 28, 2023
Distributed-Something schematic
Distributed-Something schematic

Welcome to our new Spotlight Series, where we get to know science-focused open-source software projects in order to better understand the opportunities for Open-Source Science (OSSci) to add value and help accelerate scientific research and discovery through better, stronger open source in science.

Welcome to OSSci Spotlight! Who are you? What do you do? And what brings you to open-source software in science?

I’m Dr. Erin Weisbart, a computational biologist and senior staff scientist in the Cimini Lab in the Imaging Platform at the Broad Institute of MIT and Harvard (home of the Cimini Lab and Carpenter-Singh Lab). I’m a former wet lab biochemist turned computational biologist. In the Cimini Lab, we work with biologists to create image analysis workflows and develop open source tools and software for bioimage analysis. We are most known for the open source software CellProfiler, which allows for creation of no-code reproducible image analysis workflows. Open science and open source software is important to the Imaging Platform and many of our projects and the tools we create as we work on them are open from the beginning — from raw data through pipelines and software to data analysis and interpretation.

Personally, almost all of the work I have done since becoming a computationalist has been open source which is important to me as it fits with my ethos as an obsessive documenter and lover of education and collaboration.

Please tell us about Distributed-Something. What does it do? And how did it get started?

Distributed-Something is an open-source software template that allows you to distribute any software that can be Dockerized in Amazon Web Services (AWS) so that you can scale parallelizable workflows without computing power or data storage limits. Though there are many tools that have a similar function, our tool caters to an end user base that is extremely low/no code, filling an important niche in the ecosystem. Once a Distributed- software has been set up, it requires only 2 human-readable config files and 3 single-line Python commands to run so it is very accessible to novice computationalists. (For example, when I joined the Carpenter Lab I had effectively no computational experience, and I was running Distributed-CellProfiler on my own within my first month in the lab.)

Distributed-Something is also lightweight. It requires only moderate computational comfort in Python and a handful of hours to create a new Distributed- implementation (we have a Distributed-HelloWorld showing all the necessary changes in a single commit). Importantly, both customization and use are heavily documented because making our software accessible is really important to our team (and as I mentioned, I love writing documentation). A description of Distributed-Something was recently published in Nature Methods: Distributed-Something: scripts to leverage AWS storage and computing for distributed workflows at scale.

The first of our Distributed- softwares was Distributed-CellProfiler, written frantically by Juan Caicedo and Shantanu Singh in the Carpenter Lab when they needed to process a large dataset with CellProfiler and their local server went down. Distributed-CellProfiler was published in PLOS Biology in 2018.

When Beth and I were developing a new, complex image analysis workflow (now in preprint) we found that we needed to distribute a particular function that wasn’t well handled by CellProfiler (stitching images together) so we created Distributed-FIJI to distribute ImageJ, the most popular open source bioimage analysis tool. Beth realized the utility of the Distributed- framework itself so we abstracted it and turned it into a template and Distributed-Something was born.

We often work at scale so we have already created implementations for converting images to the next-generation file format .ome.zarr using the bioformats2raw library in Distributed-OMEZarrCreator and for collating .csv’s using pycytominer in Distributed-Collate.

Are there any concrete examples of how Distributed-Something is being used in practice?

Distributed-Something implementations are regularly used by our team and collaborators and are integral tools in many of our biology-focused publications. Some examples follow.

Distributed-CellProfiler has been used to generate most of the Carpenter-Singh and Cimini Lab’s image based profiling datasets, most of which are publicly available in the Cell Painting Gallery. It was recently used by many of the partners of the Joint Undertaking for Morphological Profiling (JUMP) Cell-Painting Consortium, a partnership led by the Carpenter-Singh Lab between 10 pharma partners, 2 non-profit partners, and 6 supporting partners, to generate the largest public Cell Painting dataset, containing > 136,000 genetic and chemical perturbations in ~115 TB data. A preprint describing the dataset is available.

Distributed-OMEZarr creator was used to convert a 20 TB dataset, described in this publication, into .ome.zarr file format. The utility of the conversion is described in this publication and many more datasets in the Cell Painting Gallery are planned for conversion using this tool soon.

Distributed-FIJI (along with Distributed-CellProfiler) was used in a complex image processing pipeline to process 3 genome-wide CRISPR screens, combining Cell Painting and pooled optical profiling to create a genome-wide atlas of human cell morphology described in this preprint.

Distributed-Collate was used to collate thousands of .csv’s that held the data analyzed in a publication that explores the robustness of the Cell Painting assay across a number of different imaging systems, described in this preprint.

What about funding? How do you manage to sustain your project?

Distributed-Something, unlike our lab’s larger softwares (CellProfiler, CellProfiler Analyst, and Piximi), has been primarily developed and funded in pieces by grants funding projects that needed the individual pieces. As I mentioned, one of the strengths of our Distributed-Something software is that it is easy for a moderately skilled python developer to create a new implementation in a matter of a handful of hours which makes it relatively simple to grow the DistributedScience ecosystem piece by piece.

What about challenges! What keeps you up at night?

I’m relatively new to computational biology and computer programming myself, and our lab often does education and outreach to beginning computationalists, so making our software as accessible to novice users as possible while balancing that with customizability is always on my mind.

I also worry about folks accidentally provisioning infrastructure that costs them money without understanding why or how to clean it up. I’ve extensively documented what parts of our software can trigger costs and I’m always expanding our “Kill Suite”, a set of scripts that live in Lambda functions in AWS and look for leftover/accidental infrastructure and “kill” it.

Looking ahead, where do you see your project is headed? What are your goals and aspirations?

There is a booming interest in deep learning in bioimage analysis (like just about every other scientific field right now), so short term goals are to create Distributed-Something implementations of some of the most popular deep learning tools like Cellpose and StarDist.

A bigger picture project I’m working on is a workflow that automates an entire assay at scale, including multiple steps that use Distributed-CellProfiler and other Distributed- softwares. Our lab frequently runs the Cell Painting assay (a widely used, high-content image-based assay for generating cell morphological profiles) and a novel pooled version of the Cell Painting assay, both of which require a stereotyped set of perfectly parallel computational steps. I hope that by automating these assays in workflow systems we can further simplify use of these popular assays at scale and provide examples for how Distributed- softwares can fit into workflow systems.

Last but not least, who would be your ideal contributor? And how can people get involved?

We would love to see Distributed-Something spread across scientific disciplines. There is nothing bioimage-specific about Distributed-Something, that’s just the field we are in, so we would love to have researchers outside of the bioimaging sphere use our open-source template to make (and share!) Distributed-Something implementations for software in their field.

Folks are also welcome to contribute suggestions by making an issue in our Github repository or file a PR to make a code contribution! Our Distributed-Something repository is on GitHub.

Thank you very much!

Thanks for reading! Are you involved in an OSS project in science and would like to share your experience? Let us know!

--

--

Tim Bonnemann
Open-Source Science (OSSci)

Intersection of community & participation. Currently @IBMResearch. Wannabe trailrunner.