Finding the Forest Cover Type

Published in

CUNY CSI MTH513

6 min readMay 13, 2019

Introduction

“Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task.” Furthermore, ML is a subset of artificial intelligence that “includes abstruse statistical techniques” which causes machines to improve at tasks with experience. In ML, there are several difference classifiers used for several different functions and types of data. We learned about these classifiers in our ML class this semester. As the semester progressed, my team (Nicholas Borghese, Nora Abualam and I), otherwise known as N³, entered and competed in several competitions. One of those competitions was the Forest Cover Type.

The Competition

Out of the three competitions performed during the semester, the one I will be discussing is the Forest Cover Type competition. In this competition, data scientists will gather and attempt to figure out the actual forest type from cartographic variables. The prediction is for a given 30 x 30 meter cell from US Forest Service (USFS) Region 2 Resource Information System data. Variables were then constructed from the data that was gathered by the USFS and the US Geological Survey.

Furthermore, this competition is scored with accuracy, therefore the higher the score the better.

The possible cover types are described below:

1 — Spruce/Fir
2 — Lodgepole Pine
3 — Ponderosa Pine
4 — Cottonwood/Willow
5 — Aspen
6 — Douglas-fir
7 — Krummholz

There are seven different types of forest cover types which means seven different possible outcomes. The training set contains both features and the Cover Type while the test set contains only the features. The goal of this competition is to predict the Cover Type for each individual row in the test set.

The Data

The data offered has multiple different fields:

Elevation in meters
Aspect in degrees azimuth
Slope in degrees
Horizontal_Distance_To_Hydrology (distance to nearest surface water features)
Vertical_Distance_To_Hydrology — Vert Dist to nearest surface water features
Horizontal_Distance_To_Roadways — Horz Dist to nearest roadway
Hillshade_9am (0 to 255 index) — Hillshade index at 9am, summer solstice
Hillshade_Noon (0 to 255 index) — Hillshade index at noon, summer solstice
Hillshade_3pm (0 to 255 index) — Hillshade index at 3pm, summer solstice
Horizontal_Distance_To_Fire_Points — Horz Dist to nearest wildfire ignition points
Wilderness_Area (4 binary columns, 0 = absence or 1 = presence) — Wilderness area designation
Soil_Type (40 binary columns, 0 = absence or 1 = presence) — Soil Type designation
Cover_Type (7 types, integers 1 to 7) — Forest Cover Type designation

The wilderness areas include:

1 — Rawah Wilderness Area
2 — Neota Wilderness Area
3 — Comanche Peak Wilderness Area
4 — Cache la Poudre Wilderness Area

Exploratory Plots

Before even attempting to come up with a solution, my team and I decided to visualize how the data flowed. We figured the best way to do that was to plot the data.

General heat map of all categories offered

After careful consideration, we decided to plot all the categories onto a heatmap to see how the data was plotted and the relation of categories between each other.

After seeing the seaborn plot, we decided to create another graph to narrow down the results for the data.

Then, we excluded the data that had no information, dropping Soil Type 7 and Soil Type 15. Given that they had no data and just white lines.

From this graph, we can see that there is little relation between the Cover Type and other categories, thus we decided to use all the categories to attempt to determine the cover type for each given row.

From the graphs, we can see that the data well balanced and there is an even distribution of classes.

Trial & Error

It took my team 14 different commits to finally get a score that we felt was the best score we could achieve in the time given.

XGBoost Classifier

Our first go around, we decided to use the XGB Classifier. XGBoost is a program of gradient boosted (coming from its name extreme gradient boost) decision trees made for performance and speed. What XGBoost does is push the computational resource limit for boosted tree algorithms.

As can be seen in the bit of Python code above, the XGB Classifier was used with the OneHotEncoder for the first go around giving us a result of 0.58489. It should also be noted that the Id and Cover_Type from the train were dropped for X while the “y” included those features. To further explain, one hot encoding is a procedure where categorical variables are altered and changed into a form that can be provided to and easily read by ML algorithms. This is done in order for the algorithms to output more accurate predictions, which results in a better job done.

Image result for one hot encoding — An example for One Hot Encoding

My team and I knew that we could improve from this score given how low it was and in this competition, the closer we are to 1.0, the better. Therefore, we decided to use the Logistic Regression instead.

To describe logistic regression, it is a statistical method for observing a dataset that contains one or more independent variables that determine an outcome. This given outcome is then measured with a dichotomous variable. Moreover, given that the variable is dichotomous, there are only two possible outcomes.

Difference between linear regression and logistic regression

When we submitted the Logistic Regression, we ended up with a score of 0.55516, which was lower than our previous one. Clearly, the logistic regression wasn’t the right move and we had to figure out a better solution to top the logistic regression along with the one hot encoder.

Final Solution

Random Forest Classifier

After trying the XGBoost with OneHotEncoder and the Logistic Regression, we moved on to the Random Forest Classifier.

A random forest is a “meta estimator that fits” multiple decision tree classifiers on several sub-samples of the provided dataset and utilizes averaging to better the predictive accuracy and control over-fitting. Overfitting is “the production of an analysis that corresponds too closely or exactly to a particular set of data, and may, therefore, fail to fit additional data or predict future observations reliably”.

After importing the ensemble library, we used it to use the RandomForest Classifier

After printing the prediction for the score, the prediction showed that it was .9952 which was a great outcome

Once we submitted the notebook, the score we received back was 0.7207. After that, we were unable to improve on this score, so we settled and decided this was the highest score we could achieve.

Cessation

Although the concept of Random Forest is quite complex with its usage of multiple decision trees, the code for this competition happened to be the shortest.

It was a simple couple of calls to functions and mixed with the submission code. A total of five cells in the kernel to receive the score that we got at the end.

After using the different classifiers, I ended this competition feeling that I gained more knowledge in Machine Learning than before. I also was able to understand Random Forest better with the concept of multiple decision trees (ergo, forming a forest). The clever use of the word did indeed help in remembering just how the classifier worked.

Ironically, the best classifier we used for the Forest Cover was the Random Forest Type, showing that OneHotEncoder and Logistical Regressions didn’t give outcomes like the Random Forest did. Clearly, this solution was our best one.