Conserving Forests with Decision Trees

US Wildfires in 2013 (Short, 2015)

Each year in the United States there are an average of 70,000 wildfires occur which burn an average of 5.7 million acres of wildland. Wildfires that occur naturally are allowed to burn as they contribute to the local ecology (Aplet, 2006). There are however nearly six wildfires caused by human activity for every one that occurs naturally. The high number of wildfires caused by human activity can harm the local ecology so steps should be taken to reduce the number of man-made wildfires. As part of my Masters of Computer Science coursework I examined the possibility of using Machine Learning to direct outreach efforts that are targeted at preventing these fires.

Examining the database of wildfire occurrences created by the US Forest service indicates that there are trends in the cause of wildfires when they are compared by time of year and location. In most states debris burning is the most frequent cause of wildfires, in many others campfires are most frequent. These trends provide an initial indication that Machine Learning can be used to direct wildfire prevention outreach efforts.

Leading cause of wildfires per state when ranked by total count (Short, 2015)

The most effective form of outreach come from regional advocates who can form programs suited to the region (Monroe, et al., 2013). For this reason guidance must identify regions that have the highest risk of wildfires. It must also change as seasonal changes influence most prevalent causes of wildfires. My proposed method of directing wildfire prevention is to create a regionally targeted calendar which identifies the most likely causes of wildfires. This will enable local advocates to plan their outreach efforts so they have the greatest impact. In order to form this calendar observed data is compiled into a dataset which is used to form a model that can be used to predict future risk.

The dataset for the algorithm must make a decision whether an observation is ideal for advertising based on available data. Each observation will include decision within every region during each month so cause, month, and region are included as features. It has been observed that the climatic cycle has a strong impact on the occurrence of wildfires (Kitberger, 2001) so both the climatic season (El Niño or La Niña) and the intensity of the season.

It is infeasible for a human to decide if each observation is suitable for targeting so a computational approach is used based on multiple criteria that balance the regional and seasonal risk. The criteria are normalized and combined using rank reciprocal weight (Malczewski). Observations with a risk that exceeds a selected threshold are considered suitable for targeting. There is no intrinsically correct value, a human must select a value which does not result in missed opportunities due to insufficient engagement and at the other extreme does not produce excessive advertising which may result in alert fatigue and confusion.

The impact of threshold values on targeted observations vs. untargeted fires

I selected the C5.0 algorithm to predict the risk of new observations. This is a decision tree algorithm that recursively splits data to maximize information gain by selecting a split that has the greatest reduction in entropy. Multiple trees can be generated which then vote resulting in greater accuracy. Further improvements can be made by giving more weight to some outcomes. It also creates a white box model which can be analyzed by auditors or subject matter experts.

To create an accurate model the learning algorithm must minimize the generalization error which is the error when the algorithm processes new data. The dataset is split into both training data to form the model and test data for evaluation. The algorithm is said to underfit if it cannot form a model of the training data and overfit if it can model the training data but cannot generalize. Reducing the training error will also reduce the generalization error if the datasets are identically distributed (Bengio, 2016). I accomplished this by randomizing the order of the observations before splitting them into training and test sets.

The C5.0 algorithm performed well, correctly classifying 89.7% of decisions to engage and 83.8% of decisions not to engage. The algorithm selected splits which someone who was not familiar with the algorithm could understand. It would pose questions for example such as whether the cause being considered was debris burning or if the month being considered was between December and February.

I used the generated model to predict the most likely cause of wildfires throughout the year for each state with alternate calendars generated for El Niño and La Niña cycles. The standard calendar for California is included below.

California Calendar of Likely Wildfire Causes
Debris Burning, Equipment Use
April-October: Campfires, Children, Debris Burning, Equipment Use, Smoking
November: Campfires, Debris Burning, Equipment Usage
December: Debris Burning, Equipment Use

El Nino Modifications
April: Exclude Smoking
November: Exclude Campfire

La Nina Modifications
June, July: Include Powerline

To evaluate the effectiveness of the final product I compared several wildfires with known causes that occurred in California during 2016 against the static California schedule that would have been used during that year. Investigation to determine the cause of a wildfire can take a long time so not every wildfire in 2016 has a known cause but comparison of known causes against the schedule for a recent year is useful for determining the schedule’s effectiveness. June 2016 was the final month of an El Niño phase that started in October of 2014. A brief, low intensity La Niña cycle started in July and lasted through the end of the year. California was also in the sixth year of a drought which was somewhat lessened by the rain brought by El Niño.

The Soberanes Fire started in July of 2016 and burned over 132,000 acres. At a cost of $236 million, it was the most expensive wildfire in the US to date. It was found to have been caused by a campfire (InciWeb, 2016), which is an activity that would have been targeted from April through October that year. The Erskine fire was started in the end of June by a powerline (BakersfieldNow staff, 2016), although powerlines were not targeted until July outreach efforts would have been underway at this point. This suggests that generating a calendar with a finer resolution may improve the results. The Cedar (Rocha & Hamilton, 2016), Chimney (Lambert, 2017) and Marshes (Cowan, 2016) wildfires were all caused by equipment use, specifically by motorized vehicles that were idling or being used off-road, in August through September of 2016 which are when equipment use would have been targeted for advertisement.

The accuracy of the schedule at directing efforts to wildfires that occurred in California during 2016 demonstrates the effectiveness of the static outreach schedule. While it is not possible to say if these wildfires would have been prevented, there would have been ongoing outreach targeting the activities that caused these fires at the time they occurred. When used to direct outreach efforts, this schedule can be useful for increasing the effectiveness of outreach efforts and reducing both the frequency and damage of wildfires.


Aplet, G. H. (2006). Evolution of Wilderness Fire Policy. International Journal of Wilderness, 9–13.

BakersfieldNow staff. (2016, December 22). Worn electric line in tree caused deadly, destructive Erskine Fire. Retrieved from BakersfieldNow:

Bengio, Y. C. (2016). Deep Learning. Cambridge: Massachusetts Institute of Technology.

Cowan, J. (2016, 09 28). Cal Fire releases cause of Marshes Fire. Retrieved from The Union Democrat:

InciWeb. (2016, 28 10). Soberanes Fire. Retrieved from InciWeb:

Kitberger, T., Swetnam, T. W., & Veblin, T. T. (2001). Inter-hemispheric synchrony of forest fires and the El Niño-Southern Oscillation. Global Ecology & Biology, 315–326.

Lambert, H. (2017, March 29). Investigators identify cause of Chimney Fire. Retrieved from KSBY:

Malczewski, J. (1999). GIS and Multicriteria Decision Analysis. Hoboken: Wiley.

Rocha, V., & Hamilton, M. (2016, September 29). Man gets prison and must pay $61 million for starting huge fire in the Sequoia National Forest . Retrieved from Los Angeles Times:

Short, K. C. (2015). Spatial wildfire occurrence data for the United States, 1992–2013.–0009.3