Open Reproducible Science: Workflow Structure
One key to open reproducible science is to provide rigorous organization of all workflow code. Not just for when you send your project to someone else. A future version of yourself will also benefit when you return to an organized workflow after some time away.
I created a GitHub repository (with accompanying video tutorial) that provides a Python script and Jupyter Notebook each containing an example workflow. I use this structure for all of my computational projects. The workflow contains the following stages:
- Environment Setup
- User-Defined Variables
- Data Acquisition
- Data Preprocessing
- Data Processing
- Data Postprocessing
- Data Visualization
- Data Export
Below I have provided my definitions of each workflow stage. I based the definitions my experience with geospatial remote sensing projects. Note that the preprocessing, processing, and postprocessing stages may contain overlap. This depends on the nature of the data, the specific software platform, and/or personal preference. Some may consider calculation of index layers like the Normalized Difference Vegetation Index (NDVI) in remote sensing to be either a preprocessing or processing step.
In addition, some of my definitions may differ what from your personal preference or what other sources indicate. The semantics of each workflow stage are less important here. Focus more on the rigorous organization of your workflows. An organized project will allow someone else to use run the code independent of your assistance. It will also save time when you return to the code after some time.
Your workflow may include only a subset of these stages. Many of my scripts do as well. With that, here are the definitions.
Environment Setup: Code to import packages and modules, set options, and set the working directory.
User-Defined Variables: Code to set file paths and other global variables used throughout the the workflow.
Data Acquisition: Code to download data from an Application Programming Interface (API) or other website that hosts data (e.g., Figshare). If you manually created data or downloaded through a Graphical User Interface (GUI), then this step is unnecessary.
Data Preprocessing: Code to prepare raw data for processing. Could be synonymous with data preparation, cleaning, or wrangling. Examples include, but are not limited to:
- Loading data into various data structures
- Renaming elements of the data
- Converting units
- Masking data
- Replacing Not a Number (NaN) or NoData values
- Creating satellite image index layers
- Applying kernels/filters to satellite imagery
Data Processing: Code to take preprocessed data as input and produce new data outputs, via computational or calculational routines. Examples include, but are not limited to:
- Land cover classification
- Change detection
- Spatial analyses
Data Postprocessing: Code to take processed data and create a final data output, via cleanup or corrective routines. Examples include. but are not limited to:
- Morphological routines for land cover classification
- Applying kernels/filters
- Assigning attributes
Data Visualization: Code to create graphics, figures, plots, and/or charts from the preprocessed, processed, or postprocessed data.
Data Export: Code to save preprocessed, processed, and/or postprocessed data as well as data visualization products to tangible files.
The snippet below shows how I structure base Python scripts.
example-workflow.py file in the GitHub repository also contains a section header for Script Completion. This code provides an unambiguous indication that the script has completed.
Documenting a script provides assistance to anyone who uses your code (including yourself). The snippet below outlines the docstring format I use at the beginning of scripts.
The docstring contains information required to understand and run the script. This includes, but is not limited to:
- Purpose of the script
- Data references or links
- Required data file inputs
- User-defined variables that need to bet set
- Output data files created
- Other information or settings relevant to the script
I add my name and email in case anyone wants to reach out with questions. I also include the script license information.
There is no definitive way to structure your code. It depends on the nature of the workflow as well as personal preference. Organizational or group guidelines can also play a factor, if you are creating workflows as part of a team.
One method to structure your workflows is the example I provided above. This method includes all stages within the same script. Another method to organize your code is to create separate scripts for various parts of the project. The snippet below provides an example of how this could look.
Each script would likely contain the Environment Setup, User-Defined Variables, and Data Export stages, as well as the stage relevant to the task name. One main difference for this structure is that each script would create an output for the next sequential script to use as input.
I have worked on projects using both methods. The nature of the data and the required tasks influence the workflow structure.
Hopefully this will help you to create organized workflows. Here are the additional resources I created to complement this article:
The repository provides all code referenced in the article. The tutorial walks through the repository and workflow structure.
See you in the next article.