10 Hard Lessons Learned For Creating a Dataset in Our Crops Identification Challenge to Fight Hunger
If you want to use satellite imagery to predict specific crops grown in various seasons, this article will give you a head-start.
Article authored by Jayasudan Munsamy and Łukasz Murawski. Other project team members were James Tan, Thomas Chambon, Alexander Epifanov, Erick Galinkin, Shefalika Gautam, Radhika Menon, Javier Perez Tobia, Saqib Shamsi, Sai Praveen.
Hi there! This article is the result of working in Omdena’s AI challenge to estimate crops yield with the UN World Food Program in Nepal. The problem was tough, challenges were huge, and resources scarce. Still, a community of 36 collaborators managed to build a solution with 89% accuracy.
This article details the problems we had to solve and lessons learned in creating an appropriate dataset.
Happy reading :)
The problem to be solved — hunger
It’s estimated, that every day 821 million people — one in nine — still go to bed on an empty stomach each night. Even more — one in three — suffer from some form of malnutrition. The most exposed to the problem are people in conflict-affected countries where they are three times more likely to be undernourished than those living in peace. Global climate changes and ever more frequent and severe natural disasters are also affecting food security around the world, which is especially noticeable in less wealthy countries.
Nepal is among the world’s poorest countries, ranking 149th out of 189. Challenging geography, civil unrest and a lack of infrastructure complicate efforts to improve livelihoods, establish functioning markets and transport food. One-quarter of Nepal’s population lives below the national poverty line, on less than US$0.50 per day. Approximately 36 percent of Nepali children under 5 are stunted, while 27 percent are underweight, and 10 percent suffer from wasting due to acute malnutrition.
The impact of climate change is further expected to result in more frequent and intense disasters that threaten to undermine the country’s progress to date. Located in one of the most seismically active zones in the world, Nepal is also subject to forceful earthquakes. The 2015 earthquakes wiped out 25% of that year’s GDP, refocusing attention on this ever-present threat. (Source)
The World Food Programme (WFP) is the food-assistance branch of the United Nations and the world’s largest humanitarian organization addressing hunger and promoting food security. Its mission is to save lives and change lives by delivering food assistance in emergencies and working with communities to improve nutrition and build resilience. Operating in 83 countries, WFP is assisting around 86 million people each year. It’s on the forefront of the fight against hunger which the international community committed to end by 2030.
Together with WFP in Nepal, Omdena organized a crops yield prediction challenge. The goal was to use satellite imagery to predict specific crops grown in a season, to estimate crop yields. The solution will help WFP to better manage the supply and distribution of food aid and ultimately, to reduce hunger and malnutrition in the region.
A community of around 36 Collaborators from 15 countries worked together to build the solution. People with different set of skills and background — students and experienced professionals; some of them just starting their AI adventure, and few others leading and steering the project in the right direction. All of them volunteered to the project in their spare time, devoting evenings, quite often nights and weekends for over 8 weeks to help the people of Nepal.
The one thing we all had in common was to help and to make a positive impact on the world.
10 Powerful Lessons Learned
The challenges mentioned here are in no particular order. None of them are to be taken lightly as they all can derail the project or impact outcome of the ML models.
1. Insufficient ground truth data for crop types
Ground truth (GT) data was provided by the Central Bureau of Statistics (CBS) in Nepal, based on crop cultivation surveys from 2016. Unfortunately the data had these shortcomings — a) covered very little crop fields (impacted volume of satellite images team could gather); b) had incomplete information regarding dimensions of the fields (impacted labeling and masking tasks); c) had significantly different number of data points for each of the crop types and this created data imbalance in dataset; d) had incomplete information about the land use cycle/pattern — the same field being used for different crop types in different seasons (impacted dataset creation and ability of ML model).
Lesson(s) learned — a) make sure to gather absolutely all the ground truth details necessary for dataset creation (for example in case of crop type identification GT should specify crop field id, their dimensions, GPS locations, shape, size, crop cultivated, seasonal crops for the field, crop cycle details, land use patterns, etc.) and if not possible, improve it iteratively with the help of subject matter experts in the domain and local farmers; b) information in ground truth data is not only necessary for dataset creation but also for labeling /masking and building generalized ML models with ability to identify crops irrespective of seasonal cultivation & different land-use patterns.
2. Lack of knowledge of satellite imagery & processing techniques
Though the AI enthusiasts from various countries who volunteered to work on the project had different levels of experience with ML model building, none had knowledge or experience of working with satellite images. This meant that the team had to first ramp-up on details of satellite imagery like different formats available, various sources from where to get imagery, possibilities with various data bands in images, suitable imagery to be used for the crop identification purpose, processing techniques to be applied before using images and many more fine-grain details required to build a dataset. Though the team managed to ramp-up and build decent datasets to experiment few ML model architectures under classification & segmentation, we had to go through painful iterations to figure-out suitable images. Key questions for which we could not find answers quickly were — a) which satellite imagery source to use; b) which readily available dataset to use; c) can we use raw satellite images; d) what pre-processing should we do if we use raw satellite images; e) what pre-processing should we do if we use readily available dataset; f) what spatial resolution should the images have; g) should we scale the image to make crop fields bigger for better visibility and if yes, what scale level should we choose; h) how to deal with clouds in images; i) which spectral bands in satellite images should we use; j) which vegetation indices should we use.
Lesson(s) learned— a) get the basics of satellite imagery right & as early as possible in the project cycle to be able to decide on what goes in the dataset. This is absolutely crucial to avoiding ‘wild goose chase’ of the multitude of datasets available out there; b) explore both paid and free satellite imagery to understand the pros & cons.
3. Lack of understanding of dataset requirements
Though this challenge appears to be linked to ‘lack of knowledge on satellite images’ point, this is an entirely different one. Sometimes, even with high-quality images it might not be possible to achieve the task if appropriate data (spectral bands) in images are not used. For example, not always RGB bands in satellite images are useful — while they may enable ML models to pick-up visual cues from the images to identify different types of vegetation like bush/trees/grass/etc., they may not be sufficient to produce high accuracy & generalized ML models which can identify specific crop types or growth stage of crop or pest infestation level of crops, etc. across seasons and land use patterns. Such tasks mandate the use of data in other spectral bands in satellite images like near-infrared or shortwave infrared, etc. So, we were searching for questions like — a) should we get satellite images for the whole crop cycle or just the key growth stages (this question may appear trivial, but in our case rice fields were a challenge as they will be filled with stagnant water for the first few weeks after sowing and this may appear like ponds in satellite images); b) can we substitute images from other years if a year’s images are unavailable or unusable; c) how to treat barren fields (cleaned-up after harvesting and before sowing next crop) and the fields which have just been sowed; d) what kind of data augmentation should be done to increase dataset size; e) in case we decide to use spectral bands other than RGB, what kind of processing is needed; f) how do we decide on percentage acceptable distractions in images (for example, percentage of cloud cover/haze/ glare / etc.).
Lesson(s) learned — based on the objectives of project, define satellite imagery dataset requirements like — source of dataset, image resolution, spatial resolution, temporal resolution, image quality, spectral bands to be used, distraction levels accepted in the images (ex: % of cloud cover, haze, etc.), use of images from different time periods to compensate bad quality images, use of pre-processing techniques to enhance quality or remove distractions in images, etc. (read all the challenges to understand all that’s required in dataset).
4. Lack of high-quality satellite images
Quality of satellite images depend directly on the effort spent on them by government / private organizations and hence is dependent on the cost. While the freely available datasets have lower quality (low resolution and high spatial resolution), commercially available datasets are far too good and expensive. Though the team concluded that freely available datasets were not good enough for the job (after ramping-up on satellite imagery knowledge), they were left with no other options as there was very little time left to connect with commercial providers of datasets, discuss, evaluate their datasets and create our own dataset. So, the team had to settle for freely available low-quality images from Sentinel2 Level 1C dataset which were not even good enough for labeling & masking.
Lesson(s) learned — a) different tasks need different levels of image quality; b) image quality required for task in had needs to be identified as early as possible (this may be a bit of challenge in the beginning and may take a few iterations); c) quality of images is not only important for the ML model performance but also for labeling & masking and image transformation tasks; d) quality of image is not defined by just RGB bands in them and data in other spectral bands are also to be considered in image quality decision (few datasets may have processed/artificially scaled data in spectral bands which may impact ML model performance).
5. Terrain challenges
In our area of interest, crop fields were spread across different terrains like plains, hilly regions, forest regions, river banks and in some cases even river bed. So, the satellite images of crop fields from these different terrains had unique challenges. Fields in plain did not have clear boundaries between them, had tress in the middle of the fields, had buildings inside the fields and in some cases, roads/pathways passed through crop fields. Fields in the hilly regions had no proper boundaries; parts of a single field were at a different altitude than rest of the field and some fields had shadows of hills in the fields. Fields in river banks and river beds had no proper boundaries and had no seasonal crops in them.
Lesson(s) learned — a) look at satellite images from different regions within area of interest and also across time periods/seasons to understand the terrain of areas; b) include images from various terrains in the dataset; c) consider the challenges with images from different terrains while labeling/masking and address those challenges as much as possible with appropriate labeling; d) while its important to have images from different terrain in the dataset, its equally important to not have the ones which are real outliers or extremely rare.
6. Lack of images with temporal data
As June through September months is monsoon season in Nepal, we could not get satellite images from these months due to heavy cloud cover (minimum of 60% and as high as 90% in all the images). This meant we could not see how paddy fields looked like in the initial 3 months of their growth cycle (main paddy crop cycle is between Jun through Nov). Since the months of monsoon remain the same every year, we could not get images of June through September even from other years and hence the dataset was incomplete. Another unique challenge was that the growth cycle of a crop was different in different terrains — growth cycle for rice in plains was between Jun through Oct/Nov, but cycle for a different variety of rice was in early months of the year (exact months are unknown) and we got to know this too late in the dataset creation cycle as this was not in ground truth data.
Lesson(s) learned — a) factor-in such data loss (time periods for which images cannot be gathered/used) and compensate with images from previous/following years for that time period, if possible; b) if its absolutely not possible to gather images for a certain time period (like in our case), define the capabilities of ML model accordingly and set expectations for inference appropriately; c) variations of the same object of interest (like in our case different varieties of rice) have to be treated as different objects/classes and data included for them dataset; d) our satellite imagery dataset is at the mercy of mother nature :)
7. Land use pattern for cultivation
In our case, though there was a defined crop cycle for the crops we had to identify using satellite images, we learned late in dataset creation stage (as the ground truth data did not cover this critical information) that same fields will be used for different crop cultivation in different seasons. Meaning field A can have rice as the main crop between Jun and Nov, then used for wheat cultivation between Dec and Feb and be used for other vegetable crops with short lifecycle between Mar and May. This forced us to revisit all the temporal data we had created and relabeled them appropriately. So, without ground truth data covering such inputs and satellite images from different months showing short life-cycled seasonal crops wrongly understood & labeled as the main crop, our dataset was incomplete & incorrect resulting in inaccurate ML models.
Lesson(s) learned — a) understand the domain (in this case agriculture), talk to subject matter experts and farmers in the area of interest to gather as much knowledge about the crop cycle, land-use patterns, pest infestation & effect of crops, any natural calamity the area is prone to like floods or fire and any other information that could help; b) filter-out or consider adding special cases of satellite images to dataset; c) include temporal data based on land use pattern, crop cycle, seasonal crops cultivated and any other practices followed in the area which need to be considered.
8. Insufficient labeling & masking of the dataset
The team decided to work on a couple of approaches with respect to ML model architecture — classification & segmentation models. As we all know the level of labeling required for these models is entirely different — while classification only requires labeling different class images with corresponding labels, segmentation, on the other hand, requires detailed marking of land parcels in satellite images. But, due to low image quality & incomplete ground truth information on field dimensions & exact locations, the team was unable to mark the land parcels clearly in the dataset. Also, only after few iterations with segmentation models, the team realized that labeling of almost all the objects in an image is necessary for improving accuracy of models, but again they were limited by image quality & incomplete ground truth info and eventually ran out of time before completing labeling for entire dataset (ex: in addition to marking the crop field, it is necessary to mark nearby trees, pathways, barren land, river, etc. to help models learn).
Lesson(s) learned — a) define labeling requirements for various ML model architectures well in advance; b) use every bit of information available to mark/label objects of interest in images (ground truth info, satellite images of the same object from different months/years/seasons, etc.); c) label/mark all or most of the objects visible in an image along with object of interest; d) make sure masks created from marking/labeling are as precise as possible; e) more care to be taken in marking/labeling similar-looking objects in an image (ex: grass field and initial growth stages of paddy field may look alike in the satellite image); f) labeling can be improved iteratively using feedback & inputs from dataset creation stage, model building & model performance.
9. Lack of standard techniques for crop type identification
While there are research papers citing multiple techniques used for crop type identification (the combination of satellite image pre-processing, calculating Vegetation Indices (VIs) and various ML algorithms and Deep Learning model architectures), there is no standard or one way of implementation that will guarantee foolproof and high accuracy solution. This is because of various factors which are unique to regions like climatic conditions, soil differences, veg indices variations with regions, quality of satellite images used, test area/land parcels chosen, preparation of these land parcels for experiments, etc.
Lesson(s) learned — for specialized tasks like differentiating vegetation types, et al, data from RGB bands (visual cues in satellite image) is not enough to get high accuracy results. So, it is required to analyze data contained in other spectral bands (ranging from 3 to 16 bands) of satellite images other than just RGB bands. This is where Vegetation Indices (VIs) play a crucial role. Vegetation Indices are combinations of surface reflectance at two or more wavelengths designed to highlight a particular property of vegetation and are obviously derived using the reflectance properties of vegetation. Since each of the VIs is designed to accentuate a particular vegetation property, we need to find and choose the ones suitable for our task.
a) experiment with various combinations of VIs & pre-processing techniques to identify the one that suits dataset in hand; b) prepare a few datasets considering various VIs & pre-processing techniques to use and try-out with multiple ML algorithms and model architectures.
10. Lack of knowledge on satellite imagery platforms & tools
It may seem trivial to choose the platform or tool offered by satellite imagery dataset provider for our dataset creation, pre-processing and export of images to suit ML model training. But, it is not trivial. As a team, we did not have the knowledge of platforms provided by various satellite imagery providers. This meant, we had to learn first & then play with each of their’s free sandbox environments or trial licenses which exposed only limited capabilities starting from the area of interest to image processing capabilities. So, the team had to not only ramp-up on satellite images, but also various platforms to understand capabilities and choose the right one. Luckily, the team had a member with experience in Google Earth Engine (GEE) platform who helped others ramp-up. As GEE provides datasets from other satellite imagery providers as well, it was very useful for the team to analyze all applicable datasets from within GEE platform itself, rather than using different platform/tools for different satellite imagery datasets from various providers.
Lesson(s) learned — a) satellite imagery cannot be treated as any other images (not quite obvious until you understand satellite imagery!) and requires a tool or platform to access readily available datasets, visualize the images, play around with processing and export to the format required for ML models; b) ‘not all that glitters are gold!’…the platform offered by satellite imagery providers may have 100s of capabilities, but its of no use if it can’t do what you need for your dataset. While many platforms were good in highlighting their high-quality images, they lacked easy of use & easy imagery accessibility through SDK; c) one platform that enables access to multiple satellite imagery providers’ datasets is ideal for trying-out & analyzing different datasets without having to learn multiple platforms.
Platforms / tools used
Google Earth Engine platform combines a multi-petabyte catalog of satellite imagery and geospatial datasets with planetary-scale analysis capabilities and makes it available for scientists, researchers, and developers to detect changes, map trends, and quantify differences on the Earth’s surface. While there are a lot of capabilities in the platform, we found it very useful for our purpose to do the following:
- to analyze all the applicable datasets for our crop identification task
- to visually analyze our area of interest — different terrains, temporal images, affect of various VIs
- to download processed images from Sentinel2 Level-1C dataset as required geo-tiff files
- to choose satellite images based on the area of interest, date range, and cloud cover percentage
- to do minimal pre-processing of images to remove cloud cover
- to try-out segmentation algorithms like SNIC implemented in SDK
- to calculate various Vegetation Indices and do a detailed analysis to choose the best VIs for our purpose
Sentinels Application Platform (SNAP) part of Sentinel Toolboxes was used to understand and try-out various Vegetation Index calculations and pre-processing tools for satellite images.
The SNAP architecture is ideal for Earth Observation processing and analysis due to the following technological innovations: Extensibility, Portability, Modular Rich Client Platform, Generic EO Data Abstraction, Tiled Memory Management, and a Graph Processing Framework.
What should you do to avoid the challenges we faced?
- Ensure that the ground truth data is complete with all required details for dataset creation & labeling, talk to SMEs & farmers to gather more inputs
- Ensure team gets the basics of satellite imagery right & as soon as possible in order to prepare the dataset
- Consider using an appropriate platform for analyzing different datasets and dataset creation
- Define the dataset requirements upfront & clearly to avoid unnecessary iterations of dataset creation;
- Be clear on the quality requirements of satellite images for your project to achieve objectives
- Include a variety of images covering crop fields in all kinds of terrain in your area of interest
- Include images from different time periods like crop growth cycle/seasons/months/years; understand the land use pattern of your area of interest and include images from all the crops cycle of a given field
- Ensure precise labeling and masking are done not only for the main classes/objects of interest but also for the related classes/objects
- Finally be open to experimenting with applicable VIs, techniques, ML algorithms & models to find the best combination appropriate for your dataset that enables high performance of ML models.
Most importantly, we need to understand that dataset creation is an iterative process — feedback & inputs from different stages like ground truth data, data analysis & engineering, labeling and model building should all be considered for improving dataset with an iterative approach.
We hope that the presented challenges & lessons learned will help you to better approach your own dataset preparation task.
Thanks for reading.
You may find the following article interesting if you are looking for a structured guide to get you Agricultural AI project started.