In our research team, we usually write blog posts about machine learning (ML) experiments, application prototypes or Python libraries. This time we are proud to present a fully functional product — a Land Cover Monitoring System (LCMS). In this blog post, we will present how it works and what are its main components.
In short, the purpose of LCMS is to process huge amounts of satellite data for a large geographical area, extract relevant information and provide them in a format that is easy to understand and take action on. To be more precise, we developed LCMS to perform automatic land cover classification, crop type classification and detection of changes in land cover through time. All that is done using ML and it works on a level of an entire country, in a continuous manner!
Overall we can divide LCMS into 4 parts:
- Collection and preparation of reference data
- Data-processing pipelines to run ML process and produce results
- Storage and serving of results
- Application for operators
It looks simple enough, but LCMS has actually been our main R&D project for more than a year. It brings together a lot of remote sensing knowledge, our Python ML framework eo-learn, in-house products Sentinel Hub services and Geopedia, and front-end know-how.
So far we have implemented a production-ready version of LCMS for the Republic of Azerbaijan. It is being used by governmental officials to automatically detect and track how the land is changing through time. We have also utilized LCMS for numerous experiments in Slovenia and recently we have been developing it for Turkey.
There are 2 kinds of data LCMS requires as an input. One is satellite data, which is accessible through Sentinel Hub services in a simple manner, and the other one is ground-truth reference data.
The quality and quantity of ground-truth polygons are extremely important for LCMS since it employs supervised ML algorithms to solve classification problems. There is a general rule of ML that results can only be as accurate as the data used for training. This clearly applies to our use-case as well.
No matter what kind of land cover classification is performed, all types of surface from the area of interest have to be well represented in the training data. For land cover classification at the country level, we need ground-truth polygons that are evenly distributed over the entire country. In the end it doesn’t matter that much whether they cover 1%, 10% or 100% of the country, as long as the surface of the covered area is as diverse as the surface of the entire country.
Good sources of land cover data are cadastral databases and LPIS systems. Nowadays, many countries already have such systems and they spend a lot of manual work to keep the data reasonably accurate and up-to-date. However, in some cases such data doesn’t exist or is not accessible. Then the alternatives are to manually label data yourself or to use various open data sources and fuse them together. One such example is Open Street Map, which is maintained by a large community of people.
Similarly, for crop type classification, ground-truth data has to evenly cover all arable land. It turns out that ML models trained on one year don’t perform well for other years. Therefore, accurate data is required for each year separately, and for each crop rotation if applicable. One way to obtain large quantities of such data is to use farmer declarations.
The most important part of LCMS are data-processing pipelines. Their job is to perform the entire ML process, which spans from collecting data to producing results. To learn how exactly we do this process, take a look at our previous blog posts about Land cover classification with eo-learn (part 1, part 2 and part 3). Pipelines are simply built on top of eo-learn tasks and workflows, providing additional functionalities for upscaling the area of interest, delegating workflow executions, storage management, logging, and various other useful utilities.
One important aspect of pipelines is that they split up the country into smaller chunks and for each one we create an EOPatch, which stores all the data from that location. This way pipelines can execute operations on multiple EOPatches in parallel and at the same time keep in memory only as much data as there are available resources. Furthermore, pipelines are able to divide the processing of EOPatches over multiple EC2 instances on Amazon Web Services (AWS). This makes the execution part of the pipeline much faster and more convenient.
Having the whole process split into multiple pipelines also helps to manage the complexity of development and enables us to easily experiment on only a subset of data. We have thus split our data-processing into the following pipelines:
- Download: Loads satellite imagery using Sentinel Hub, computes cloud and valid data masks and stores them into EOPatches.
- Features: Computes features such as NDVI, NDWI, NDBI, tasselated-cap indices, and performs temporal harmonization (i.e. interpolation or mosaicking). This part will be in the future offloaded to Sinergise’s next frontier — Euro Data Cube.
- Add reference: Transforms ground-truth reference polygons into raster masks.
- Sampling: Spatially samples features based on reference data. Only the subset of sampled locations will be used for ML training.
- Merge features: Merges sampled features together by concatenating them into
numpyarrays. The result are multiple ML-ready datasets.
- Model training: Uses the datasets as inputs to train and test a ML classifier and saves a trained model.
- Prediction: Runs a country-wide prediction based on the model. The result of this pipeline is classification maps.
- Post-processing: Performs various operations on classification maps, such as super-pixel segmentation, filtering, erosion, etc.
- Change detection: Compares the resulting maps for ‘before’ and ‘after’ and finds areas where it is most likely that a change has happened.
- Reporting: Collects various statistical information from resulting maps and creates reports for operators.
For Azerbaijan, we processed a region of 86.600 km² at 10m pixel resolution, which resulted in 2.3TB of data per year. We split the region into 1224 EOPatches.
Storage and services
Execution of data-processing pipelines requires a lot of computational power and it can take up to a few days. In the end the results are raster classification maps and vector polygons of detected changes. These results have to be stored somewhere and we need a service to provide them to operators whenever requested.
A solution for raster maps is the “Bring Your Own Data” feature of Sentinel Hub. The maps can be stored on an AWS S3 bucket. By following the instructions, we can configure the bucket permissions and a Sentinel Hub layer according to the structure of data in the bucket. Once that is done, Sentinel Hub services are able to read and provide any part of a map an operator would require.
Regarding vector data, a solution is to upload the polygons into a geospatial database and use a service for querying them. All this can be done using Geopedia. We can simply create a new Geopedia layer, import the data and then use Geopedia REST API to query the data.
Application for operators
Depending on the use case, classification results can be shown either in a simple viewer (e.g. Sentinel Playground) or they can be a part of a full GIS application.
For Azerbaijan, we developed a custom application that provides access to the entire archive of Sentinel-2 and Landsat 8 data.
It enables viewing classification maps, comparing them to each other and comparing them to satellite data.
The Validation Tool is one of the most relevant features of the application, particularly for land change detection. It provides a list of all detected changes in land cover. An operator can select a detected change, check the corresponding satellite imagery for that location and decide if the change has been correctly detected or if it is a false positive. In the latter case, the operator can decide to correct the change and submit it back to the system. In the end, operators get a list of land cover changes, which have been confirmed to be correct, and the system gets new reference data for the next iteration of the ML process.
At the moment LCMS is fully scalable and can be transferred to any area of interest. The amount of time and manual work needed to run the pipelines has decreased significantly. This is excellent news for data scientists, who can now run more experiments in a shorter amount of time and therefore focus more on research rather than engineering and data processing. So we’ll continue to improve classification and change detection algorithms and execute LCMS on numerous other regions around the world.