Using Machine Learning for Land Cover Classification: A Python Approach

Niken Andika Putri
Age of Awareness
Published in
6 min readDec 4, 2024
Sentinel-2A images (RGB composite) captured on August 9th, 2020

Land use and land cover (LULC) classification play a pivotal role in the forestry and agriculture sectors, whether for plantation management, ecosystem restoration, carbon market initiatives, or other applications. Monitoring land cover and land use changes is a mandatory task for concession owners, requiring consistent and accurate analysis of their concession areas.

As a GIS analyst, I initially relied on manual processes or pre-built software tools to perform these classifications. While these methods worked, they often lacked the flexibility to accommodate specific needs or adapt to unique challenges. Manual workflows also become impractical when managing large areas, especially in scenarios that demand frequent or real-time updates.

The LULC classification typically relies on satellite imagery, aerial photographs, or drone imagery. Satellite images, available as free or high-resolution commercial options, offer various sensors and resolutions tailored to diverse objectives. Optical and Synthetic Aperture Radar (SAR) data are particularly valuable for capturing diverse landscape features. The temporal nature of these datasets allows near-real-time monitoring, making it easier to track changes and manage forest assets efficiently.

This article provides a step-by-step guide to land cover classification using Python, covering data acquisition and preprocessing, feature extraction and preparation of training/testing datasets, model training and accuracy assessment, and prediction and exporting of results. By leveraging Python’s powerful tools and libraries for machine learning and geospatial data processing, we can create customized, efficient solutions tailored to land cover classification. This journey represents my first step into combining Python, machine learning, and spatial data — let’s dive in together!

Due to the complexity of the code, it will not be displayed here. However, the complete script can be accessed through the link provided at the end of the article.

1. Data Acquisition and Preprocessing

For this project, I will use optical data from Sentinel-2A, part of the Copernicus program, which can be downloaded from here. Out of its 13 spectral bands, I will focus on 7 bands for the analysis, specifically those capturing RGB (Red, Green, Blue) and NIR (Near-Infrared) wavelengths. Additionally, the Normalized Difference Vegetation Index (NDVI) will be incorporated to enhance the prediction accuracy.

The preprocessing step ensures the data is ready for analysis. Since my Area of Interest (AoI) is cloud-free, I skipped cloud masking. Instead, I clipped the data to the AoI, stacked the selected bands, and scaled the reflectance values by dividing them by 10,000. This scaling converts the raw values into standardized reflectance units, making them easier to work with.

To obtain labeled sample data, I referenced datasets provided by the Ministry of Environment and Forestry based on the Regulation of the Director General of Forestry Planning Number: P.1/VII-IPSDH/2015 concerning Guidelines for Monitoring Land Cover. Using GIS software, I created a shapefile containing point locations, each annotated with its respective land use/land cover (LULC) class as an attribute. For this tutorial, the dataset includes ten distinct LULC classes.

From this step onward, I will utilize Python libraries such as Geopandas, Rasterio, and NumPy for handling spatial data, and Scikit-learn for building and evaluating the machine learning model.

2. Feature Extraction and Training/Test Data Preparation

Feature extraction involves retrieving pixel values from the selected spectral bands and organizing them into a format suitable for machine learning analysis.

Labeled data from the shapefile, containing the ten LULC classes, was overlaid with the raster data to assign class labels to the corresponding pixels. These labeled pixels were then used to create training and testing datasets, ensuring a balanced distribution of classes for model training and validation.

Visualization of image composites for the input images (Left: False Color; Right: Natural Color)
The labeled dataset

3. Model Training and Accuracy Assessment

Once the training and testing datasets were prepared, I used the Random Forest (RF) algorithm to classify the land cover types. Random Forest is a robust machine learning algorithm widely used in remote sensing applications due to its ability to handle large datasets, manage complex relationships between features, and minimize overfitting.

Using Scikit-learn, I then trained the RF model on the training dataset, which involved fitting multiple decision trees and combining their outputs for classification. After training, the model was validated using the test dataset. I assessed its accuracy by comparing the predicted labels with the true labels and calculated metrics such as precision, recall, and f-1 score. The confusion matrix was also generated to evaluate the model’s performance across all ten land cover classes.

The classification report and confession matrix of the RF model

The classification report indicates an overall accuracy of 62.90%, with strong performance for the open water and secondary mangrove classes, achieving high precision, recall, and F1-scores. However, the model struggles to predict the settlement, mixed dry agriculture, and paddy field classes, where precision, recall, and F1-scores are near zero, signaling poor model performance for these categories. The mining class also poses challenges due to its limited support, with only one sample available. The lower macro average scores compared to the weighted average further highlight class imbalance, where larger classes disproportionately influence the overall performance metrics.

The confusion matrix reveals that the model performs well for open water and secondary mangrove classes, with most predictions correctly classified (e.g., 14/14 for secondary mangrove). However, there are notable misclassifications for classes like settlement and mixed dry agriculture, which are often confused with other categories, such as secondary mangrove and paddy fields. Sparse classes, including paddy field and mining, have low prediction counts or no predictions at all, indicating difficulty in distinguishing these categories. Overall, the matrix highlights significant misclassifications for underrepresented classes and strong performance for dominant, well-represented ones.

4. Prediction and Exporting Result

With the RF model trained and validated, the next step was to use it for predicting land cover classes over the entire study area. Using the trained model, I applied predictions to the raster data by feeding the pixel values from the selected bands as input features. The output was a classified raster, where each pixel was assigned a land cover label corresponding to one of the ten LULC classes.

After generating the classified raster, I exported the results as a GeoTIFF file using Rasterio, ensuring compatibility with GIS software such as ArcGIS Pro or QGIS for further visualization and analysis. Once opened, the classified map provides a clear spatial representation of land cover types within the study area, serving as a valuable tool for decision-making and monitoring. With additional configuration, such as setting colors, labels, and map legends, the output can be customized for better interpretation and presentation.

The final result of the LULC classification

Overall the model demonstrates promising performance for certain classes but struggles with sparse or imbalanced ones, highlighting the need to address class imbalance, enhance feature representation, and increase training data diversity. Incorporating additional data sources, such as combining optical with SAR data, or introducing new features like texture metrics or alternative vegetation indices, could further improve the model’s ability to distinguish between challenging classes and enhance overall performance.

Conclusion

This project highlights the power of combining machine learning with Python for land cover classification, showcasing an end-to-end workflow from data preprocessing to exporting results. By using tools like Geopandas, Rasterio, NumPy, and Scikit-learn, we achieved a flexible, efficient process tailored for spatial data analysis. The RF model provided accurate classifications, demonstrating its effectiveness for large-scale land cover mapping. This approach not only simplifies workflows but also opens new possibilities for integrating machine learning into geospatial applications, offering valuable insights for forestry and agriculture sectors.

The code associated with this article is freely available on my GitHub.

Acknowledgment

This article is made possible thanks to the incredible support of the following open-source contributions:

--

--

Age of Awareness
Age of Awareness

Published in Age of Awareness

Stories providing creative, innovative, and sustainable changes to the ways we learn | Tune in at aoapodcast.com | Connecting 500k+ monthly readers with 1,500+ authors

Niken Andika Putri
Niken Andika Putri

Written by Niken Andika Putri

GIS analyst with a background in forestry and environmental science, actively pursuing data science and machine learning to advance geospatial analysis.

Responses (3)