Utility of AI4Boundaries and EuroCrops as training datasets for field delineation

Leveraging GSAA datasets for field delineation

Published in

Planet Stories

14 min readAug 2, 2023

Written by Sara Verbič. Work performed by Devis Peressutti, Nejc Vesel, Matej Batič, Žiga Lukšič, Jan Geršak, Matic Lubej, Nika Oman Kadunc and Sara Verbič.

Automatic field boundary delineation in agriculture requires high-quality and diverse input data to train accurate and robust machine learning models. In this blog post we will describe our analysis of the AI4Boundaries and EuroCrops datasets. We will delve into the data preparation steps taken to optimize these datasets for training a field delineation model, highlighting their strengths and weaknesses.

A crop field serves as the fundamental management unit in agriculture, and accurately delineating its boundaries enables capturing crucial information regarding their size, shape, and spatial distribution. Efficient and precise delineation of field boundaries holds significant implications for a range of remote sensing applications, thereby supplying crucial information to farmers and governmental organizations. Accurate digital records are essential for defining field boundaries.

Globally, with over 1.5 billion hectares of cropland, comprising more than 1.2 billion active field boundaries that undergo constant changes with agricultural seasons, the manual and time-consuming process of defining precise boundaries presents substantial challenges and disincentives [1]. Harnessing the potential of automated field boundary extraction across vast areas would greatly benefit various applications. Not only would it streamline the onboarding process for farmers and encourage wider adoption of digital agricultural services, but it would also enhance the quality of products and services delivered through remote sensing technologies. Automating the extraction process enables filling in gaps where such data is non-existent and provides repeated updates of field boundaries. This allows a comprehensive view of how the agricultural landscapes are evolving over time due to anthropogenic activities and agricultural practices. Moreover, it contributes to the successful addressing of pressing issues such as climate change, food production and food security.

Field delineation

Building upon these considerations, we developed an automatic agricultural field delineation model for parcel boundary delineation based on Sentinel-2 imagery. The aim of the field delineation process is to automatically determine the boundaries of agricultural fields from satellite imagery, either to update existing but outdated records of fields, or to create new records altogether. Automatic delineation is based on spatial, spectral, and temporal properties of pixels belonging to the same field. Utilizing a U-net based deep neural network, the tool predicts three image variables: the field’s segmentation, its boundary, and the distance of the segmented image points to the boundary. From these, a probability image of the boundaries of the fields is constructed, either from a single image or from a time series of images. During the post-processing phase, the image prediction is converted into vector format, where each polygon represents the extent of a homogeneous agricultural parcel.

Fig 1. Polygon vectors defining agricultural parcels based on Sentinel-2 imagery.

To effectively train a field delineation model capable of generalizing to large areas, particularly on a global scale, a high-quality and diverse training dataset is crucial. This dataset should encompass a broad range of field types, climates, soil types, cropping systems, and management practices that are prevalent in different regions worldwide. However, there is one limitation — the scarcity of publicly available datasets of crop field boundaries, especially in regions with small-scale farming.

GSAA data

Within Europe, the GeoSpatial Aid Application (GSAA) data represents a highly valuable resource. It refers to the annual crop declarations made by European farmers for Common Agricultural Policy (CAP) area-based support measures. A GSAA element is always a polygon of an agricultural parcel with one crop (or a single crop group with the same payment eligibility). The GSAA is operated at the region or country level in the European Union’s (EU) 28 Member States (MS), resulting in about 65 different designs and implementation schemes over the EU [2].

The use of the GSAA data within Europe offers several advantages. The data is typically derived from high-resolution aerial imagery, ensuring high-quality and detailed information on individual fields. The majority of the GSAA parcels correctly match the agricultural land cover, except for a minority of incorrect parcels resulting from outdated information or incorrect applications. Data usually undergoes verification and quality control processes to ensure accuracy and reliability. These checks may involve on-site inspections, data validation, and cross-referencing with other agricultural data sources. It offers the advantage of providing consistent and quality references and annotations over the years, as it is collected on an annual basis. However, there are limitations to consider. As previously mentioned, not all GSAA data is publicly available. Datasets between regions are not compatible, nor are attributes semantically harmonized for the whole of Europe, making comparisons between countries more challenging. It is important to highlight the sparse nature of GSAA, meaning that data tends to be partial, since it is based on declarations, and not all farmers declare their land properties. While the dataset is based on self-declarations, it is also essential to note that, while in-situ controls act as a means of validating these declarations, these controls are sparse samples and cannot cover the entire area comprehensively [3]. Despite these drawbacks, the GSAA datasets remain a valuable sources for field delineation and agricultural analysis in Europe.

Two notable GSAA based datasets covering Europe are AI4Boundaries and EuroCrops. These datasets were utilized individually in our research to train a new field delineation model.

AI4Boundaries

AI4Boundaries is a dataset of images and labels that are readily usable for training and comparing models focused on field boundary detection. It includes two AI-ready datasets, each consisting of pairs of images and labels, which facilitate model development and comparison. The first dataset is a multi-date compilation of Sentinel-2 monthly composites, suitable for large-scale retrospective analyses. The second dataset is a single-date collection based on orthophoto imagery. All the labels in these datasets are sourced from publicly available GSAA data, which are openly accessible for Austria, Catalonia, France, Luxembourg, the Netherlands, Slovenia, and Sweden in 2019.

To construct the dataset, data was selected using a stratified random sampling method that considered two landscape characteristics. The average parcel perimeter/area ratio (PAR) was computed for each grid cell and distributed into five percentile bins. Furthermore, the coverage percentage of parcels within each cell was divided into ten classes. These indicators provide a combined description of the prevalence of agriculture (i.e., the proportion of land covered by agriculture) and landscape fragmentation (i.e., the perimeter/area ratio) within each grid cell.

Fig 3. The stratification of the sampling is done based on perimeter area ratio (a) and proportion of parcels (b) in 4 km × 4 km grid cells.

The resulting AI4Boundaries dataset comprises 7,831 samples of 256 by 256 pixels for the 10-meter Sentinel-2 dataset and 512 by 512 pixels for the 1-meter aerial orthophoto dataset. Both datasets are accompanied by corresponding vector ground-truth parcel delineations, covering 2.5 million parcels and an area of 47,105 km2. Additionally, pre-processed raster versions of the datasets are provided and ready for immediate use. These resources contribute to the convenience and efficiency of utilizing the AI4Boundaries dataset.

If you would like to learn more about the AI4Boundaries dataset, you can refer to the article published in the Earth System Science Data journal, which is accessible through this link. The article provides detailed information.

Before training a new model on the AI4Boundaries dataset we manually reviewed some of the samples to assess the overall quality of the reference labels. For each sample we scored the quality of the polygon labels according to their fullness (i.e. whether the polygons cover all the visible agricultural parcels) and correctness (i.e. whether the contours of the polygon match what is seen on the image) from 1–5, where 1 represents terrible (<20% are correct) and 5 represents great (>95% are correct). Spain, Luxembourg, Netherlands, and Slovenia exhibited higher scores. On the other hand, Austria and France showed poor performance, primarily due to a low percentage of agricultural parcels corresponding to fields fullness.

When examining the best-performing countries on a monthly basis, the distribution of scores remains relatively consistent throughout the year, although certain months exhibit a lower prevalence of high scores. These months include January, May, September, October, November, and December. This means that the boundaries are less, if at all, visible on some months.

Fig 4. Graphs are illustrating the performance of best performing countries (ES, NL, LU, SI) over the course of a year.

Generally, patches with a low number of fields tend to have lower scores, regardless of field size. To prepare the dataset for training, we implemented specific rules. Firstly, we calculated the area and count of polygons, and then applied a filtering criterion of count > 50 and area > 0.1. This filtering step allowed us to effectively distinguish the majority of low scores (1) from the remaining data. Notably, this differentiation was particularly evident in France, where a limited presence of polygons was observed, leading us to exclude samples from France entirely.

Additionally, we removed samples that received a score below 3, indicating a mediocre performance with approximately 60% correct polygons. With these rules in place, the dataset was effectively refined and ready to be utilized for training our field delineation model, ensuring a more accurate and reliable outcome. In addition, to better mitigate the effect of missing polygons, we trained our model masking out parts of the image where there are no GSAA polygons.

EuroCrops

EuroCrops is a dataset collection that brings together publicly available self-declared crop datasets from European Union countries. Participating countries are Austria, Belgium, Germany, Denmark, Estonia, Spain, France, Croatia, Lithuania, Latvia, Netherlands, Portugal, Romania, Sweden, Slovenia and Slovakia.

The dataset focuses on data collected within the subsidy control framework, resulting in one type of crop per parcel per year. However, some countries (such as Austria and Portugal) submit multiple crops per field, which can be extracted by examining the attribute tables or the country mappings provided [3].

The raw data acquired from these countries lacks a standardized and machine-readable taxonomy, primarily because crop names are usually provided in each nation’s language without standardized codes. To address this issue, a new Hierarchical Crop and Agriculture Taxonomy (HCAT) was developed to harmonize crop information across the European Union. The HCAT is incorporated as additional attributes in the shapefiles provided by EuroCrops. For each polygon there is a machine-readable HCAT name of the crop and HCAT code. Code represents the class and corresponding level. While level 3 only distinguishes between general Land Cover types like Arable Crops and Meadow, level 6 introduces, for instance, seasonal changes. Oats, which is a level 5 class like Rye, can therefore be divided into Winter Oats and Summer Oats. It is easily possible to cut the threshold at the desired level of granularity on-the-fly. [4].

Fig 6. Additional attributes added to the shapefiles.

To curate a training dataset for field delineation we utilized Version 6 of the EuroCrops dataset. However, it should be noted that a subsequent release, Version 7, has since become available. In Version 7, an updated framework known as HCAT3 was introduced, which is not backward compatible with the previous version. The main improvements include the merging of spring and summer wheat categories, rectification of typographical errors, and the introduction of additional dimensions, specifically the “z” dimension [5].

During the inspection of the dataset for field delineation purposes we identified a few challenges. These challenges can be attributed to the participation of multiple countries in the EuroCrops project, each characterized by unique administrative structures, varying data availability, and distinct classification systems.

The available data spans from 2015 to 2022, however, it is worth noting that not all countries have data available for each individual year. Furthermore, it is important to emphasize that the data from different countries within the dataset has been harmonized for specific years, varying from country to country. Most countries have data harmonized for the year 2021, however, for instance, France’s harmonized data belongs to the year 2018.

Fig 7. List of countries and the corresponding years for which their data has been harmonized.

Data is partially unavailable for certain countries due to specific reasons. In the case of Belgium, due to the federal structure, the data is split into two sets covering the regions of Flanders and Wallonia. So far only the Flemish data for the year 2021 has been harmonized within the EuroCrops project. Similarly, in Germany, datasets are not published on a national level, but by each federal state individually. Two separate datasets covering Lower Saxony and North Rhine-Westphalia have been acquired, depicting the crop situation of 2021. It should be noted that unlike other countries, where datasets are provided as a single file, these two datasets are provided separately. In Spain, data is distributed separately for each autonomous community, and as of now, data has been acquired and harmonized for the community of Navarra for the year 2021 [3]. We also noticed that harmonized data for Portugal is not available to download from Zenodo, but only from another provider Sync&Share, where only Version 1 is available [4].

The number of classes included in each country’s dataset varies. The crop dataset for Croatia consists of 15 classes and it does not include distinct classes for different crops. Instead, it has a general category of the EC_hcat_n attribute called “arable_crops.” As a result, the level of detail is relatively limited. Similar situations can be observed in Spain and Lithuania, where their crop datasets also lack distinct classes for different crops. In contrast, the Netherlands’ crop dataset is characterized by a more extensive and detailed classification system. It comprises a total of 326 classes, allowing for a comprehensive representation of various crops cultivated in the country [4].

Upon conducting an initial analysis of the dataset, we began the process of curating a dataset by acquiring the available datasets from Zenodo. To enhance usability and facilitate analysis, we merged the individual datasets into a single comprehensive dataset, eliminating the need for separate files for each country or year. To optimize the performance of the analysis and working with the data, we introduced indexing and created two distinct datasets: one dedicated to geometries, ensuring faster processing and manipulation, and another containing additional attributes and information for comprehensive analysis. Furthermore, we computed the area for each polygon and calculated the circumference-to-area (CA) ratio, providing valuable insights into the shape characteristics of the agricultural fields. Thoroughly examining the data, we assessed its content and structure, paying close attention to any potential anomalies or irregularities.

We quickly noticed a variation in the quality of polygons between the datasets. Germany emerged as one of the countries with high data quality. In contrast, Croatia stood out for comparatively lower quality, mostly in terms of completeness. A considerable number of polygons are absent in certain locations, despite the visible presence of arable land in the satellite imagery.

During our inspection of the dataset, we also encountered 23 polygons in France with irregular shapes that extend beyond the borders, overlapping and covering other polygons within the dataset. These spatial discrepancies can introduce challenges and potentially lead to mistakes or unrepresentative results when working with the data.

Fig 9. Polygons in France with irregular shapes that extend beyond the borders.

It is important to take into consideration that certain classes within the EuroCrops dataset do not represent crops. To gain further insight, let’s examine three of these classes that have a significant overall polygon count. One notable class with a high overall polygon count is “not_known_and_other”, which exhibits high diversity between countries. In some cases, it includes categories related to agricultural land use, such as in Slovenia, where it includes unknown crops, a mixture of honey plants, a mixture of honey plants with other agricultural plants and others. On the other hand, in some countries, it can include classes not related to agricultural land use. For instance, in Spain, it encompasses urban zones, buildings, water currents, and unproductive areas. Additionally, there is a class labelled “other_tree_wood_forest” present in France and the Netherlands, while the class “tree_wood_forest” appears in the majority of the harmonized datasets [6].

Fig 10. Top 25 EC_hcat_n categories by area in hectares.

Another aspect that warrants attention is the accuracy of class assignments. It has come to our attention that a few crops in specific countries have been assigned incorrect EC_hcat_c codes. In most cases, there is an issue with assigning the wrong code to flowers. For example, lilies in the Netherlands have assigned EC_hcat_c code for isatis_tinctoria_woad, which is another type of flower. While the count of wrongly labelled polygons is relatively low in many cases, there are exceptions that possibly demand closer examination. For example, there are 23,657 polygons of walnuts in France with incorrect EC_hcat_c codes, along with 3,283 tulips and 1,215 lilies in the Netherlands. These errors in code assignment can potentially lead to misinterpretations and inaccurate analyses of crop distribution.

Fig 11. List of crops with incorrectly assigned EC_hcat_c codes.

The presence of incorrect EC_hcat_c codes assigned to certain crops in specific countries can be attributed to the challenges involved in the harmonization process. Automating the translation process proved challenging due to country-specific agricultural terms causing mistranslations. As a result, manual correction of translations was necessary to ensure accuracy. Mapping the translations to the HCAT taxonomy required manual checks due to the diverse crop classes declared by the member states [3].

To enable locational grouping, facilitate visualization through heatmaps, and enhance understanding of spatial distribution of crops, we incorporated a hierarchical geospatial indexing system [7]. Additionally, we augmented the dataset by including ecoregion information, enabling us to assess coverage across different biomes within the European Union. We split the entire Europe into a regular grid, where each cell of the grid is an EOPatch, as it will be processed in our eo-learn library. These preprocessing steps laid the foundation for further analysis and model development, ensuring that the dataset was ready for training the field delineation model.

Fig 12. A heatmap of winter rapeseed distribution.

Conclusion

Overall, the utility of the AI4Boundaries and EuroCrops datasets as training datasets for field delineation is significant. Both datasets have undergone preprocessing steps to refine the data and ensure better training outcomes. The AI4Boundaries dataset went through a filtering process based on polygon quality, excluding low-scoring and underrepresented data. The EuroCrops dataset was consolidated into a comprehensive dataset, and various attributes were computed for enhanced analysis. Although the dataset has challenges due to varying administrative structures, it provides valuable information. The introduction of the Hierarchical Crop and Agriculture Taxonomy (HCAT) helps harmonize crop information across the European Union, ensuring standardized and machine-readable crop data and enabling comprehensive analyses of crop distribution on a larger scale. Compared to AI4Boundaries, EuroCrops stands out due to its broader coverage of countries, making it more representative. However, to further enhance the applicability of the models, the need for harmonized datasets covering more countries, especially outside Europe, is evident.

References

[1] https://www.linuxfoundation.org/press/agstack-first-dataset-field-boundaries

[2] https://doi.org/10.5194/essd-15-317-2023

[3] http://dx.doi.org/10.48550/arXiv.2302.10202

[4] https://github.com/maja601/EuroCrops

[5] https://zenodo.org/record/6868143#.ZHb_eHZBxD8

[6] https://github.com/maja601/EuroCrops/tree/main/csvs/country_mappings

[7] https://www.uber.com/blog/h3/

Presented work has been done within the Open-Earth-Monitor Cyberinfrastructure project that has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement №101059548.