Land Cover Classification

Python code to categorise satellite images into different land cover classes.

Shraddha Anala
Analytics Vidhya
4 min readMay 22, 2020

--

I’m expanding with more posts on ML concepts + tutorials over at my blog!

Welcome to another tutorial in my ongoing series where I build Machine Learning models on random datasets from the UCI Machine Learning Repository.

Photo by Subtle Cinematics on Unsplash

This tutorial tells you how to analyze crowdsourced data from OpenStreetMap (OSM) to determine the class of land cover; forest, grass, water, etc. You can download the dataset here if you’d like to follow along.

The Crowdsourced Mapping Dataset is considerably different from my previous challenges in that it contains a lot of noise in it. Preprocessing the data was a bit of a challenge this time.

According to the sources, the training dataset contains attribute noise due to varying degrees of cloud cover at the time of capturing satellite images and class noise due to labelling errors. Therefore, it is advised to not compute classification accuracy with the training dataset.

About the Dataset:

Two datasets are provided for training and testing purposes. Both of them contain 29 columns; 1 Class label (target), 1 column of the Maximum Normalized Difference Vegetation Index value (NDVI) and the remaining columns contain the NDVI values captured at different days in reverse chronological order.

Based on the NDVI value, we can classify the satellite information as belonging to one of the 6 land cover classes; Forest, Impervious, Water, Grass, Orchard or Farm, as the NDVI value is a measure of the vegetation cover.

Tutorial:

1) Data Preprocessing-

As usual, we will import the required libraries and the datasets into 2 pandas DataFrames.

Here, we are reshuffling the rows in the dataset and considering only the bimonthly NDVI variables.

2) Synthetic Minority Over-sampling Technique -

One of the other challenges encountered in this dataset is that the class distribution is highly imbalanced. This means that 1 class has more number of observations compared to others. Below is an initial frequency distribution of the land cover classes.

Frequency Distribution of Land Cover Classes

The Forest class has the highest number of observations (7441) while the orchard class has the least (53) out of 10,545 total observations for all classes.

This skewed distribution will lead to improper training of the model with overfitting on the training subset. Moreover, if you were to evaluate its performance on the test set, you would see very poor results showing that the model has not learned the correlations in the data.

Density Distribution of Classes against NDVI (before Oversampling)

To overcome this challenge, we will be perform oversampling of the minority classes using Synthetic Minority Over-sampling Technique (SMOTE).

SMOTE involves generating samples close in feature space to the minority observations. This way the dataset is augmented with new observations to balance the class distribution but no new information is gained. So to reiterate, we are synthesizing new examples to level the class imbalance but are not adding any new information that is alien to the testing dataset.

The below code takes care of encoding the target variable and implementing SMOTE.

Let’s take a look at the distribution plots and count plots after oversampling to see how the dataset has been modified. All the classes have an equal number of observations after SMOTE.

Frequency Distribution after SMOTE
Density Distribution of Classes after Oversampling

3) Naive Bayes Classifier & Evaluation -

In the next step, we apply Normalization to standardize the independent features, via MinMaxScaler method of Scikit-learn.

Then we build the classification model using the Gaussian Naive Bayes classifier. I found that this algorithm, out of all the classifiers, resulted in the highest accuracy.

Finally, as usual, we end with calculating the accuracy score and performing K-Fold Cross Validation to establish our model’s performance metrics.

I was able to achieve an accuracy of 74% on the testing set.

Accuracy Score
Standard Deviation in the Accuracies
Mean of the 10 accuracies

Despite the 2 different types of noise present in the training dataset; both attribute and labelling noise, our model achieved a good accuracy of 74%. Today, I learned (and you too!) about a new technique for handling imbalances in the class distribution called Synthetic Minority Over-sampling Technique.

This is an important insight in preprocessing different data and analyzing the raw data before proceeding to the final model construction.

I hope you found this new technique useful and if you enjoyed this article, you can take a look at others in this series, as well as my GitHub. Please leave suggestions, thoughts, requests for further clarifications below.

Thank you very much for reading and I’ll see you next week!

--

--