Data Training for LULC Classification of Chachoengsao, Thailand

4 min read2 days ago

The journey begins on 11/08/2024, with 16 hours straight of performing LULC Classification in Chachoengsao, Thailand. This activity aims to know land use and land cover in the Area of Interest (AOI) Chachoengsao.

This is the early LULC Classification, which only includes 4 classes that will be classified in the activity. Those are :
1. Buildings
2. Water
3. Cropland
4. Woodland
The ground truth data has already been obtained, and because it was taken in the year 2020, the LULC will be performed in the year 2020, from January to December.

The first thing that needed to be done, is data collecting. From the provided CSV ground truth files, import them to the Google Earth Engine. Then with the help of s2cloudless, point out all of the correct ground truth and make it to be feature collections. In this case, 200 points per class are needed for the training data.

After all of those feature collection points are obtained, export the points data to the CSV files. The export data is needed from January to December. Export all of those 4 classes.

If all of the CSV data is already exported, download it to the local machine. Next, local data processing needs to be done. Visual Studio Code, Miniconda Environment, and Python will be needed in this process. All of the necessary libraries are just a basic library for machine learning, and data processing (ex. pandas, scikit-learn, numpy, matplotlib, etc). The additional library will be Earth Engine API and rasterio for opening GeoTIFF files.

After all of the needed data and tools are already prepared, next is merging all of the CSV files in each class. This can be done easily by using the pandas library, in Python. It will be resulting so much data in each class.

Next, all of the parameters need to be calculated. In this case, the parameters are only NDVI, NDWI, and NDBI. Calculate all of the bands with the formula to get the parameters value, in the context of the Sentinel 2 satellite.

Next is to visualize the data. If outlining the data is needed, then it needs to be done in purpose to clean the data from bias. Each class has different criteria, which data needs to be outlined, and which does not. In this case, outlining the data needs to be done, with the side effect of decreasing the amount of data that will be trained.

After outlining all of the unnecessary data, next will be performed downsampling or upsampling. In this case, the downsampling will be performed, in order to correct imbalanced data and thereby improve model performance.

And then merge all of the CSV class files into 1 CSV file, to perform kNN training data. Hyperparameter tuning needs to be done first to know which are the best parameters for the model.

After searching for the best parameter, the result for the Best Parameters: {‘metric’: ‘minkowski’, ‘n_neighbors’: 5, ‘p’: 2}. Then just train the kNN with the best parameters.

Now, K-Fold Cross Validation needs to be done. Cross-validation is a statistical technique employed to assess the performance of machine learning models. It is widely utilized in practical machine learning to evaluate and choose the most suitable model for a specific predictive task. This is due to its straightforward nature, ease of implementation, and ability to produce skill estimates with typically lower bias compared to other methods.

Finding the average confusion matrix score is needs to be done also, to know the performance of the trained model.

If all of the scores looks perfectly fine, then just save the model for future training.

Data Training for LULC Classification of Chachoengsao, Thailand

Written by raditiasatriawan