Designing a Collaborative Computing System between Data Science and Engineering for Indigo Carbon

Damien Sulla-Menashe
IndigoAg.Digital
Published in
12 min readFeb 3, 2021

IndigoAg has embarked on an ambitious project that seeks to quantify the amount of carbon stored in agricultural soils and then turn those credits into financial capital. We will pay farmers money per acre to dedicate their land to sequestering carbon, if we can prove they increased the amount of carbon in their soils. Once a farmer joins Indigo Carbon, they subscribe to our agronomists’ packages of recommendations that include using regenerative agricultural practices and fewer chemical inputs. Behind the scenes, we are using a whole suite of models tuned on satellite remote sensing and weather data, as well as field campaigns to sample soils, verify our models, and assess the amount of carbon that has been sequestered over a period of many years. Once the full set of evidence including user-provided input and our model outputs has been compiled, we can issue carbon credits that can be sold like any other financial commodity because we have proven that the land has stored more carbon over this period than it would have otherwise. We believe these changes will be profitable for the farmer in the long term and that they will stay in the program once they start seeing more productive yields and more money in their pockets.

At the core of the IndigoAg Carbon data collection campaign is a joint effort between small teams of Data Scientists and Engineers. In order to do the carbon accounting at scale, we need to be able to validate that the farmers did what they promised to do on all their lands without visiting every acre of their properties. To start with we have to actually measure and model the soil carbon. This is the work of a partnership between the Soil Carbon Data Science / Engineering groups. Second, we need to automatically detect the suite of regenerative agricultural practices applied to the farm as well as to monitor the health and productivity of the cash crop during the growing season. That work is done by the analogous Remote Sensing Data Science / Engineering partnership and is the major subject of this writing. We have developed a framework to collaboratively build and deploy algorithms at scale, which in this case is the set of all crop fields in the United States over the past 10 years.

The need for a collaborative computing system came about in February 2020, when we were asked to analyze additional benefits of regenerative agricultural practices beyond gaining more soil carbon in those fields. We needed to understand how events like flooding impacted fields that had planted cover crops versus fields that were not planted with cover crops. We had the ability to create time series from over six years of satellite and weather data that would serve as inputs to an algorithm that would detect the planting of cover crops, for example, but we needed to run such an algorithm at scale to all the crop fields in over 1,200 US counties. To enable this work, we developed a new framework for the Data Science group to implement their algorithms that they had been experimenting with that ensured that the inputs and outputs of the different models would be consistent.

As this system matured, we were able to apply an arbitrary algorithm to an arbitrary set of farm field boundaries to identify certain management practices for a customer who has already committed to Indigo Carbon and quickly produce the same quality and diversity of insights. As such these outputs will be crucial components of our validation and verification protocols to help gap-fill and quality control user-provided inputs, provide visualizations of the data through web interfaces, and enable constant incremental improvements to each algorithm. In the near future, we will see new input datasets and models come online as we start to build on these foundations.

History of the project

The idea of the regen-carbon-pipeline and its sister package shared-carbon was born from an ambitious project that started in February 2020. The GeoInnovation Department of IndigoAg wanted to do an experiment to showcase their analytical abilities and to relate the adoption of regenerative agricultural practices with benefits that the farmer achieved based on that decision such as improved drought or pest resiliency. The end result was the 2020 Regenerative Agriculture Resiliency Study, which analyzed all the farms in over 1,200 US counties for the 6-year period between 2014–2020 by detecting management practices such as tillage and the planting of cover crops while also looking at crop health, flooding events, and extreme weather.

Some large-scale results from IndigoAg’s Regenerative Agriculture Resiliancy Study including assessing rate of no-till practices and planting cover crops across many of the most farmed counties in the United States.

Early in the process of planning for the Regen-Resiliency study, we proposed to do all the analysis on zonal summaries of auto-delineated fields (drawn by a computer instead of a human) instead of individual raster pixels. This greatly reduced data volume and the scale that we need to operate (i.e., from trillions to millions of data points). We had to design a framework that would run a variety of model formats including classifications and simple statistics but that would also be connectable to our data assets on AWS S3 or in database tables. We also needed the ability to deploy the code in Docker containers to run through AWS Batch or Airflow. We decided to create a modular set of Python packages deployed from a repository called shared-carbon and a second regen-carbon-pipeline repository that is Docker-ized and has connections with different databases. The regen-carbon-pipeline portion defines the inputs and outputs for each algorithm and then applies the algorithm to those inputs using the core model functions imported from the shared-carbon Python package.

The core concept of the Regen-Carbon Pipeline is the definition of an algorithm as a statistical or machine learning model that can be applied on a time series of data values and produce an output per growing season. The format and the spatial extent of the input/output data do not matter in this context because the algorithm is applied on the most granular spatial unit that is provided, it could be a raster pixel or summarized (tabular) data at a field or county level. There are some algorithms that produce simple statistical summaries of time series data and then there are more complex algorithms that have a training or fitting step and may call machine-learning methods to produce a classification. The algorithms can be inter-dependent, for example, the outputs of an algorithm that detects the phases of a crop during the growing season can be fed into another algorithm to predict the yield of that crop during that growing season.

Flexible model structure

Although there are some extra complications working with two repositories in tandem, the flexibility of the structure and the built-in dependencies create a lot of extra validation and enables more standardization and templating. We made several other design decisions for the smaller shared-carbon package. Each algorithm should be self-contained (modular) with an explicit version number that would need to be manually updated. Each algorithm generally reads inputs as numpy arrays. Some of the repeated patterns within the modules include objects named like Inputs, Features, Model, and Params. If possible we pass numpy arrays instead of using pandas. Outputs from the model will also be provided as numpy arrays if possible or otherwise as a pandas.DataFrame objects. Model validation is assumed to be part of the module with some sort of spatially-explicit measures of uncertainty provided as output.

The regen-carbon-pipeline repository is built around the concept that each algorithm can be accessed through generic train/apply methods. Similar to the scipy concepts of fit/predict but instead influenced from my background in R. Each algorithm’s name and version are read directly from the appropriate shared-carbon module. The algorithm can be initialized with custom parameters and the inputs are handled by Dataset classes that can retrieve data from CSVs on S3, database tables, or raster files. A standard apply method will lookup the appropriate input Dataset and then loop through each item in that Dataset (observation), feeding these into the shared-carbon Model class to produce an output. Generally, these inputs/outputs are handled as pandas.DataFrame objects but the approach can also be adapted to a stack of raster files. The published Docker image packages a version of the regen-carbon-pipeline that contains specific versions of each algorithm so that work can be distributed across many workers asynchronously but the final outputs will all be versioned correctly. In the future, we may deploy each algorithm in its own Docker container to reduce versioning challenges.

The infrastructure for running the regen-carbon-pipeline (RCP) algorithms within our production system.

Our production system (shown above) starts with real field boundaries from our customers which are saved in a database table. We generate time series data for these boundaries using our zonal summary machine that are then saved in another set of database tables. The RCP algorithms read the time series data for those boundaries directly as the major inputs. Other inputs include ancillary datasets such as crop reports provided by the US Department of Agriculture (USDA) or training/uncertainty datasets at different spatial scales. The algorithm outputs can be stored in a database table or provided as tabular data on S3 in other formats. These are identifiable by any internal Indigo users according to consistent, shared boundary IDs.

Collaboration with Data Science

So what does all this architecture talk have to do with collaboration? The long term plan for this system is to serve insights based on these algorithms into the Indigo Carbon ecosystem. These data will be incorporated into the body of evidence we will be providing to the carbon auditors to verify we are accurately measuring the carbon for which we are issuing credits. Data scientists working to develop, improve, and validate the specific models are a key component of this production-level system. Each algorithm has its own characteristics and uncertainties and some are much further developed or validated than others. While we continue to build out the infrastructure to automatically generate the results, the process to put a new algorithm into production is now well-versed and has been extremely successful.

If I am a data scientist and I want to contribute to this system it will take two Pull Requests in github, one in shared-carbon to implement the model structure and a second one in regen-carbon-pipeline that registers the algorithm and links it to a set of inputs/outputs. Each of these two steps has some well-commented code templates including examples of tests that can be easily adapted to the new model. We recommend to first schedule a meeting with one of our engineers to talk through any bits that might be tough to implement.

For a new algorithm, we require that the initial shared-carbon submission have a test of the algorithm objects, a small sample input CSV file with real data, and an example output file that can be kept on S3. In the regen-carbon-pipeline work to add a new algorithm, we require that there be at least an apply method that works from a pandas DataFrame. The input DataFrame can be created from either a database table or from a CSV file on S3. The algorithm can optionally have a separate training step, a method to evaluate, or a method to apply the algorithm to a raster. Ideally there will be a test for the full implementation of the algorithm on real input data.

For modifications to existing models, we keep track of the model version inside of shared-carbon and these are read in by regen-carbon-pipeline and used to version all the output records. Any changes to a particular algorithm will require the developer to manually update the version number. A minor fix gets a patch version. A change that updates the output schema or the set of inputs should get a minor version bump. A switch to an entirely different modeling approach would be a breaking change and warrants the release of a new major version.

Where do we go from here?

The great part about this project is that we get to work together with real scientists inventing cool new methods and at the end of the day those same algorithms get put into a production system that will be used to validate carbon credit generation. For an algorithm to be in production it means that it is hooked up to our infrastructure that generates time series data for boundaries from our customers. As new field boundaries come into our system, we will automatically generate time series data for those boundaries and then we will run the set of RCP algorithms directly on those data. As algorithms get refined and new versions come out, we can quickly regenerate the results for all the fields which are mapped to unique output locations enabling us to always revert or reproduce our results.

Below we show examples of Directed Acyclic Graphs (DAG) of how we see this pipeline evolving. Currently, each algorithm has a completely self-contained apply step and may also have a train step to build the model. The inputs and outputs of the different algorithms do not interact in any way and so there is quite a lot of redundant processing going on to get the correct features into the model. We envision a more complex DAG where we break the pre-processing step into its own RCP module and allow for some algorithms to read as input the outputs from other algorithms. An example of this is shown below where we pre-process all the time series data that the different algorithms need and put these outputs on S3 or in a database table. The phenology algorithm then runs and breaks up the time series into crop growing seasons, these outputs along with some of the other pre-processed data will be used to train and classify in-season crop type. Then based on the predicted crop type we can train and classify a tillage model.

Current and future approaches to linking different algorithms together into a single processing graph.

The end-user of our work needs to be able to understand at a high-level what the model is doing internally and also be informed about accuracy and uncertainty. These sorts of tools and documentation will be a big push in the coming year. Many of the models do not currently measure uncertainty in their predictions and we need to think about how to make those information accessible and understandable to a more general audience. One extra complication is that there are multiple scales of uncertainty and so we may imagine a system by which there are more regional analyses of model performance along with a per-boundary/year measurement of quality based on the quality of the inputs.

A large project in the works is to improve the traceability and reproducibility of our data pipelines. When we run a Crop Type classification model we will want to have the ability to rerun that same version of the model at a later date and get the same results. This starts with the downloading of satellite and weather datasets, archiving and retrieving data that are carefully versioned, and ends with the accounting of all the inputs to that model run. The goal for this effort of data provenance is to make sure we can stand up to an audit of our carbon accounting system. Once we have successfully implemented this system we can also build more tools to observe the process ourselves.

Now that the framework has matured a bit we will see a lot of new algorithms come online even as the existing ones are further refined. For example, in the next several months we expect to have new models for planting and harvest dates, soil temperature, and crop yields. We also expect to see big updates from crop phenology, cover crops, crop type, and tillage. Many of the algorithm performance improvements will be made possible by the introduction of new datasets that might give us new scientific insights. In the near future, we hope to make available several new datasets to these models such as data from PRISM (weather), Sentinel 1 (Synthetic Aperture Radar), and POLARIS (soil characteristics at different depths). These new datasets will enable the development of better models and fusion approaches.

Another ongoing path of active research is to start unifying the streams of optical remote sensing data through advanced data fusion. Data fusion especially across optical satellite data sources is very appealing because it would greatly increase the density of the time record and the confidence in the outputs. We mentioned before an idea where the pre-processing of our data inputs are considered a stand-alone module. An example of pre-processing is to apply a spline through the data to gap-fill missing observations and smooth out aberrational values caused by clouds/shadows. If the data are fused through time first so that they are calibrated to the same reference, then the splined output will be more accurate. Following these steps, the data would be truly analysis-ready and can be plugged into any of our other modules. When the system has matured, our task in the partnership as engineers will be mostly janitorial, we clean the stage so that the real actors (the scientists) can get to work!

--

--