Open Reproducible Science: Workflow Structure

Learn to create a rigorous workflow structure with Python

Cale Kochenour

Published in

Geospatial Talent Stack

4 min readMar 29, 2021

Looking across a field toward mountains in Colorado, United States — Credit: Image by Galyna_Andrushko via Envato Elements

Overview

One key to open reproducible science is to provide rigorous organization of all workflow code. Not just for when you send your project to someone else. A future version of yourself will also benefit when you return to an organized workflow after some time away.

I created a GitHub repository (with accompanying video tutorial) that provides a Python script and Jupyter Notebook each containing an example workflow. I use this structure for all of my computational projects. The workflow contains the following stages:

Environment Setup
User-Defined Variables
Data Acquisition
Data Preprocessing
Data Processing
Data Postprocessing
Data Visualization
Data Export

Workflow Stages

Below I have provided my definitions of each workflow stage. I based the definitions my experience with geospatial remote sensing projects. Note that the preprocessing, processing, and postprocessing stages may contain overlap. This depends on the nature of the data, the specific software platform, and/or personal preference. Some may consider calculation of index layers like the Normalized Difference Vegetation Index (NDVI) in remote sensing to be either a preprocessing or processing step.

In addition, some of my definitions may differ what from your personal preference or what other sources indicate. The semantics of each workflow stage are less important here. Focus more on the rigorous organization of your workflows. An organized project will allow someone else to use run the code independent of your assistance. It will also save time when you return to the code after some time.

Your workflow may include only a subset of these stages. Many of my scripts do as well. With that, here are the definitions.

Environment Setup: Code to import packages and modules, set options, and set the working directory.

User-Defined Variables: Code to set file paths and other global variables used throughout the the workflow.

Data Acquisition: Code to download data from an Application Programming Interface (API) or other website that hosts data (e.g., Figshare). If you manually created data or downloaded through a Graphical User Interface (GUI), then this step is unnecessary.

Data Preprocessing: Code to prepare raw data for processing. Could be synonymous with data preparation, cleaning, or wrangling. Examples include, but are not limited to:

Loading data into various data structures
Renaming elements of the data
Converting units
Masking data
Replacing Not a Number (NaN) or NoData values
Creating satellite image index layers
Applying kernels/filters to satellite imagery

Data Processing: Code to take preprocessed data as input and produce new data outputs, via computational or calculational routines. Examples include, but are not limited to:

Land cover classification
Change detection
Spatial analyses

Data Postprocessing: Code to take processed data and create a final data output, via cleanup or corrective routines. Examples include. but are not limited to:

Morphological routines for land cover classification
Applying kernels/filters
Assigning attributes

Data Visualization: Code to create graphics, figures, plots, and/or charts from the preprocessed, processed, or postprocessed data.

Data Export: Code to save preprocessed, processed, and/or postprocessed data as well as data visualization products to tangible files.

The snippet below shows how I structure base Python scripts.

Note the example-workflow.py file in the GitHub repository also contains a section header for Script Completion. This code provides an unambiguous indication that the script has completed.

Workflow Documentation

Documenting a script provides assistance to anyone who uses your code (including yourself). The snippet below outlines the docstring format I use at the beginning of scripts.

The docstring contains information required to understand and run the script. This includes, but is not limited to:

Purpose of the script
Data references or links
Required data file inputs
User-defined variables that need to bet set
Output data files created
Other information or settings relevant to the script

I add my name and email in case anyone wants to reach out with questions. I also include the script license information.

Workflow Philosophies

There is no definitive way to structure your code. It depends on the nature of the workflow as well as personal preference. Organizational or group guidelines can also play a factor, if you are creating workflows as part of a team.

One method to structure your workflows is the example I provided above. This method includes all stages within the same script. Another method to organize your code is to create separate scripts for various parts of the project. The snippet below provides an example of how this could look.

Each script would likely contain the Environment Setup, User-Defined Variables, and Data Export stages, as well as the stage relevant to the task name. One main difference for this structure is that each script would create an output for the next sequential script to use as input.

I have worked on projects using both methods. The nature of the data and the required tasks influence the workflow structure.

Wrap Up

Hopefully this will help you to create organized workflows. Here are the additional resources I created to complement this article:

The repository provides all code referenced in the article. The tutorial walks through the repository and workflow structure.

Feedback is always welcome and much appreciated. You can contact me via Twitter or email.

See you in the next article.