Machine Learning for Predicting Deforestation: Research and Planning

David Nagy
Project Canopy
Published in
11 min readOct 15, 2020


(co-authored by Zhenya Warshavsky)

Project Canopy is a new nonprofit non-governmental organization (NGO) whose mission is:

…to gather, transform and communicate the data organizations need to end defaunation, deforestation and associated carbon emissions in the Congo Basin rainforest. Our data platform, analytics and newswire provide our audiences the building blocks they need to make the most impactful programmatic and policy decisions.

The task ahead of us is no small matter. To build even a prototype of a data platform of this size and scope requires:

  • Close collaboration with the co-founders, who have deep domain knowledge of conservation practices in the Congo Basin rainforest
  • Organizing vast amounts of heterogenous data that would be useful for the various actors engaging in conservation efforts in and for the region
  • A web development team to create the site framework and a CMS for managing the content
  • A data science team to aggregate and analyze existing data in a meaningful way and provide further insights

We are Zhenya Warshavky and David Nagy, and we are that data science team — we were selected right after graduating from the Lambda School’s data science track because we believe in their mission. Our first goal was to develop a proof of concept or minimum viable product (MVP) demonstrating Project Canopy’s value proposition to potential donors. In fact, the MVP had two aims: it would (a) contribute to conservation efforts in and of itself, while (b) proving that we could accomplish much more with additional time, manpower, and funding. To explain how we decided on the scope of the MVP, we will first briefly summarize the current state of the Congo Basin rainforest.

Conservation within the Congo Basin

Africa’s Congo Basin rainforest, the second-largest rainforest in the world, occupies 2 million square kilometers of humid tropical forest, with another 1 million square kilometers of secondary and savanna forest. But it is increasingly under threat. According to the Center for International Forestry Research, the Congo Basin rainforest lost about 20% of its tree cover between 2001 and 2017, causing 1.7 Gt of CO₂ emissions. This deforestation, combined with other pressures such as geopolitical turmoil and overpopulation, also threatens Africa’s greatest concentration of mammals, primates, birds, amphibians, and fish. Some estimates suggest the rainforest will largely disappear by 2100:

Recent United Nations’ population projections for DRC estimate 197 million people by 2050 and 379 million by 2100, when DRC is expected to be the fifth most populous country in the world. Under the assumption that population growth continues to correlate with the increase in annual primary forest loss area, all of DRC’s primary forests will have been cleared by 2100.

The threats to the Congo Basin rainforest are complex. Small-scale threats such as slash-and-burn agriculture and charcoal manufacturing were once the chief causes of deforestation, but they are slowly yielding to large-scale human activities, such as industrial-scale logging and agriculture, which already operate in over 25% of the forest’s area. However, these threats do not exist in isolation. Project Canopy seeks to promote a systemic view of the rainforest: the cumulative effects of all these human actions not only threaten the multitude of its unique fauna and flora, but also the rainforest’s value as a major global carbon sink.

Evidence-based policymaking is essential to arresting these potentially catastrophic trends. However, the data needed to craft effective policies is scattered across multiple sources. Making sense of it takes a great deal of time, labor, and technical knowledge. Plus, these resources are rarely available to the relevant actors: development bodies, regional governments, campaigners and local civil society organizations (CSOs). As a result, those organizations are making funding and programmatic decisions without critical, actionable information.

Detecting the Drivers of Deforestation

After exploring several possible project ideas, we settled on developing a machine learning model that could detect commercial logging roads using satellite imagery of the Congo Basin. Logging roads are temporary, narrow tracks resembling a network of roads, whose purpose is to provide access to selected, high-value timber species. (This is in contrast to the Amazon, where clear-cutting is the preferred method for large-scale land-use conversion of primary forest, usually for soy and cattle farming.)

A logging road network, where the eastern branch shows the development of slash-and-burn agriculture. Source: Sentinel satellite product

By only selectively logging specific tree species using a pattern of narrow tracks, forest impact should theoretically be minimized, as the canopy should naturally regenerate once the concession is exploited for its relatively small quantity of timber. However, in a non-trivial amount of cases (for example, the road shown above) these roads are repurposed by local populations for slash-and-burn agriculture which contributes to permanent forest loss. Even though small-scale slash-and-burn is the key driver of the Congo Basin’s tree cover loss, the impact of industrial selective logging (ISL) as a precursor to slash-and-burn agriculture is often discounted by conservationists. Therefore, not only can commercial logging itself contribute to deforestation, but the interplay between commercial logging and other drivers of deforestation must also be understood.

Of particular note is determining whether a particular logging road falls within a logging concession. Logging concessions are arrangements where a country grants harvesting or management rights of publicly-owned forests for a set period of time to a private entity (generally logging companies) in exchange for funds. A recent study from 2019 stated that, outside of concession boundaries, only 12% of logging roads were abandoned after commercial logging activity ended. In contrast, within concessions, the percentage goes up to 44%. Beyond the ostensible illegality of timber harvesting outside of official concessions, this implies that improperly decommissioned logging roads (usually when the responsible company blocks logging road access points) directly lead to deforestation, especially if those roads are outside of concessions. Therefore, a secondary goal of the project is to detect the proportion of logging roads that exist outside of these concession areas. A successful result would bolster the argument that current forest management plans undertaken by municipal governments do not work as intended.

It would clearly be valuable to have a comprehensive dataset containing the locations of logging roads in the Congo Basin (inside and outside concession areas), so that we can track their changes over time and how those changes correspond to the rate of deforestation. But there does not currently exist any such dataset. According to Fritz Kleinschroth et al:

Logging roads and public roads are used and managed in different ways, and the associated impacts on surrounding forests vary substantially. Whereas some regions of the world have access to reliable road maps, both digitally and on paper, a complete map of all roads in Central Africa is still not available. . . . [N]o road map of the Congo Basin is available that differentiates these various types of roads.

Furthermore, due to the sheer size of the forest, labelling each logging road by hand would take a massive amount of time and labor. Hence, we decided to create a machine learning model that could predict, with high accuracy, whether or not a given satellite image contains a logging road. Assuming reliable and reproducible results, we could then expand our model from simply detecting logging roads to also recognizing additional drivers of deforestation, including slash-and-burn agriculture and Palm Oil plantations.

Satellite Imagery in GIS

In order to accomplish our goal, we needed to process and conduct geographic information system (GIS) analysis on satellite images of the Congo Basin, so that those images could be understandable to a mathematical model. The basic building block of GIS Satellite data is the raster, which is rendered on a map as pixels. Each pixel represents an area on the Earth’s surface. More generally, all digital images are rasters, containing two-dimensional arrays of varied size, depending on the range of values available per pixel.

A geospatial raster is only different from a digital photo in that it is accompanied by spatial information that connects the data to a particular location. This includes the raster’s extent and cell size, the number of rows and columns, and its coordinate reference system (CRS).

In this example of a raster, each pixel covers a 1m x 1m area.

To learn more about GIS Raster processing, check out the introductory material to Raster Data developed by Earth Lab.

Raster Data Selection Process

The first step in our machine learning classification project was gathering quality data to feed to the model so it can “know” what a logging road looks like, i.e., gathering training data. One of the most prominent and reliable sources of data on the subject of global deforestation is the Global Forest Watch map and data portal, which provides data, technology, and tools for conservation efforts. Powered by Google Earth Engine and Hansen’s Global Forest Change Dataset, the portal enables conservationists to view a forest loss overlay over satellite imagery from Landsat and Sentinel, the two leading open-data satellite imagery providers.

We first assessed the feasibility of utilizing Hanson, because raster data is available via Hansen’s Google Earth Engine Dataset Portal, organized with direct download links on a world map. By clicking on a specific granule, you can acquire a cloud-free True Color image (RGB) processed specifically for canopy cover by UMD’s proprietary algorithm. However, Hansen’s dataset proved to be difficult to utilize for our purposes because canopy cover was the only visible feature, ostensibly making all other land features (including logging roads) often undetectable. It became clear that we had to download Landsat 8 imagery directly from a satellite imagery provider. This meant additional steps in our pipeline including:

  1. Sorting for cloud cover in our query
  2. Identifying an approach to masking cloud cover if a cloud-less raster was unavailable for the time period we were querying
  3. Haziness and other layers of noise found in most raw spectral rasters would need to be identified

Hansen (Landsat 8)

Logging road image from 2018

Landsat features 30 meters per pixel for the visible bands of the electromagnetic spectrum.


While the logging road in question is visible, the contrast between the logging road’s upper portions in the image are nearly invisible and could easily be missed entirely by a machine learning model trained to detect these features. Some of this can be attributed to vegetative regrowth, since after logging roads are abandoned, the forest often overtakes the road within a relatively short period of time — even as quickly as one to two years (Kleinschroth, personal communication). This means that a single year snapshot taken in December 2019 could be unreliable for locating a logging road abandoned by January of the same year.

In addition, the pixelation of the logging road presents issues for the training set labelling process: we first need a pair of human eyes to locate the road in a number of images before we feed those images into the model as training data. This road is relatively clear, but others may not be.

In contrast, here is a view of the same logging road via Sentinel 2 acquired within a year of the Hansen image:

Sentinel 2

Logging road

Sentinel features 10 meters per pixel for the visible bands of the electromagnetic spectrum.

The difference is stark! What accounts for such a vast difference in image quality?

Landset vs Sentinel

Amongst a host of various features associated with the leading satellite imagery providers, the most apparent (and most valuable) is the pixel resolution. Recently, satellite imagery resolutions have increased substantially. From the early 1980s up until 2015, Landsat was the most consistent publicly available imagery source for most of the globe, at a resolution of 30 meters per pixel. But in 2015, the European Space Agency (ESA) launched their Sentinel mission, which has become the gold standard in publicly available high-resolution satellite imagery due to its 10m-resolution per pixel, availability via various web services, plugins within popular GIS desktop software, and direct access via cloud providers. With a higher resolution, more pixels are used to represent the same image, thus allowing ESA to capture far more detail than the lower-resolution Landsat. Of course, what is sacrificed is the ability to conduct significant time series analysis.

Resolution of spectral satellite imagery plays an even more crucial role in the age of computer vision applications for machine learning models, where machines are fed images to “learn” what a specific feature looks like. If it is difficult for the human eye to detect a logging road, then it would be far more challenging for the machine to mathematically represent that same feature.

Due to this increased resolution, we decided to use the ESA’s Sentinel-2 as our source for Congo Basin images, or “products.”

Sentinel-2 L1C vs L2A products

After settling on acquiring Sentinel-2’s data for ML training, we had to choose between two different product offerings: either the L1C satellite or the L2A satellite.

Here are an example product from each:

A raw unprocessed Sentinel-2 image — “L1C” product
A Sentinel-2 image native atmospheric correction — “L2A” product

Comparing the TCI (True Color Image) versions, the L1C product (the first one) contains atmospheric noise and/or shadows of clouds, making the logging road far less perceptible than in the L2A product (the second one).

There is one major drawback to using L2A products: L1C products were introduced in June of 2015, while L2A products did not become available worldwide (outside Europe) until December of 2018. In other words, L2A products only cover images taken in the last year and a half. Still, the quality difference was enough that we decided to focus mainly on L2A.

We later learned that we can actually download L1C products and then convert them to L2A-level quality by manually running Sen2Cor processing tools. Sen2Cor is in fact the exact same processing tool that ESA uses to generate the L2A versions in the first place. Thus, for the images not available as L2A by default (which span from 2015 to 2018), we can run Sen2Cor on our end and have 3 more years of data!


After being recruited as Project Canopy’s first data scientists, we had to both develop our overall proof of concept and decide on the specific steps needed to make that project a reality, all the way down to choosing between L1C and L2A products. From start to finish this process took about two months, without a single line of code being written. Especially for a data scientist, research, planning, and just acquiring domain knowledge are often just as important as writing custom algorithms.

We learned many lessons from this experience, but possibly the most important was the necessity of flexibility and constant re-evaluation. With as broad a goal as Project Canopy has, we cannot afford to hamstring ourselves by committing to a specific approach at the beginning. This is especially true since we had little domain knowledge of GIS or the Congo Basin before we started; when doing initial research into a new field, you need to be willing to go where the research takes you, and not impose upon it your own expectations and biases. By staying flexible and continually questioning our prior decisions, we were able to overcome what initially seemed like insurmountable barriers and end up with a feasible, actionable plan.

At this point, our next step was actually downloading the data we needed to train and test our model on. The immediate difficulty was the sheer size — terabytes of data — of the GIS Big Data needed to capture the entire Congo Basin visually. Our next articles will cover the process of figuring out how to do this.