Data Challenge Winner: Q&A with Raphael Kiminya
A conversation with the First Place winners of the Radiant Earth Spot the Crop Data Challenge
We recently announced the Radiant Earth Spot the Crop Data Challenge winners to predict crop types in Western Cape, South Africa using satellite image time-series. The competition was organized in two parallel tracks: In track 1, participants used time-series of Sentinel-2 multispectral data as input to their model; In track 2, both Sentinel-2 and Sentinel-1 (radar) data were required as input.
Hosted on Zindi, we organized the competition in partnership with the Western Cape Department of Agriculture in South Africa, and with support from the convening sponsor GIZ FAIR Forward program, the platinum sponsor Computer Vision for Global Challenges (CV4GC), and the gold sponsor, Descartes Labs.
Eight hundred thirty-one participants competed to build machine learning models that identify crop type classes across both tracks. Radiant Earth generated the training data based on ground reference data collected and provided by the Western Cape Department of Agriculture.
In this Q&A, we sat down with Raphael Kiminya from Kenya to talk about his journey to becoming a data scientist and his approach to tackling the problem. He won the Spot the Crop Data Challenge that used Sentinel-2 multispectral data as input to the model.
“This was my first time working with this type of data.”
For our conversation with the Spot the Crop Data Challenge XL winners that used Sentinel-1 and Sentinel-2 as input, click here.
Congratulations on winning the Radiant Earth Spot the Crop Data Competition! What inspired you to get involved in this field? How did you become interested in machine learning? Tell us about your machine learning journey.
Thank you so much. It’s an honor.
At my previous job, I was part of a data analytics team tasked with implementing a new business intelligence solution. I worked in various roles to support the solution:
- Modeling ETL processes
- Designing the data warehouse
- Developing reports and dashboards
- Maintaining the databases and operating systems
This exposed me to the end-to-end data engineering process. There was always something new to learn, a bug to patch, a feature to implement. I got comfortable with the idea that learning is a lifelong process — a mindset that has proved invaluable on this journey.
I often came across the terms predictive analytics, machine learning, AI — and wondered about the next step of data analysis. So when I left my old job a couple of years ago, I figured I might see what the AI hype was all about.
The world of machine learning was intimidating at first glance. The sheer scope of it all — ML applications, algorithms, research papers, blogs and tutorials, frameworks, and libraries — was overwhelming. Within all the chaos, I came across data competitions. They are well-defined projects with a fixed scope and timeline. Competitions helped me narrow my focus and learn one thing at a time.
The primary appeal of AI is its potential to solve a broad range of challenges. Over the past two years, I have completed projects across domains that I had never given a second thought before — weather forecasting, manufacturing, construction, medicine, particle physics, space exploration, and more. Machine learning is the lens through which I glimpse how the world works. I know that this is just the beginning. There is still a long way to go, and I can’t wait to see what wonderful things await.
Where did you learn about the Spot the Crop Data Competition, and what made you decide to participate?
I’m a member of the Zindi community and a regular competitor, so I found out about the competition once it was published.
Using a satellite orbiting Earth from hundreds of kilometers in space to identify the type of crop growing on a farm sounded like a clever idea. I was excited to challenge myself to build such a solution.
Your winning algorithm outperformed 2045 solutions submitted by 509 participants from 65 countries. How did you approach the problem, and what do you think set you apart?
Once I understood the problem and the data structure, sequential image classification seemed like the way to go. Since the fields were of different sizes, the images needed to be resized, padded, or cropped to a uniform size. However, the variance in the field sizes was too large — from one to tens of thousands of pixels — so the approach felt wasteful and inefficient to implement. After experimenting with CNN-LSTM models, I quickly realized that the limits of my environment wouldn’t let me comfortably explore the idea.
The data could be easily represented in a tabular format, so I tried that next. I summarized the images by taking the mean of the pixels representing a particular field for each time-step and reframed the problem as time-series signal classification. I came across the fastai based tsai library, which implements state-of-the-art time-series modeling techniques. The tool was well suited to the task. This approach was much more computationally lighter and yielded promising results from the start.
The ability to prototype ideas quickly was an enormous advantage. Initially, I focused my experiments on feature engineering, testing out the various band combinations (indices) useful for crop monitoring tasks. There are many vegetation indices in literature, and some are more useful than others in differentiating various crops. I ended up using 34 indices in the final solution.
Data augmentation was important to regularize and improve the robustness of the model. I tried out the time series augmentation techniques implemented in tsai, but most didn’t significantly improve the results. In the final solution, I only used the CutMix augmentation. Another effective augmentation technique was to divide large fields into smaller subsets, thereby increasing the total number of samples. I was careful to group the subsets into the same fold during cross-validation to prevent data leakage.
Finally, I tested the various architectures implemented in tsai. I ensembled XceptionTime and InceptionTime models in the final solution. Most of the models produced similar results, and it was really a matter of balancing speed versus accuracy.
Were you familiar with using machine learning on satellite imagery before this competition? How does this differ from common problems in computer vision?
No, this was my first time working with this type of data.
The complexity of this data sets it apart from common computer vision tasks. Normal color images have only 3 dimensions — the red, blue, and green channels. In contrast, this dataset has 13 dimensions/ bands of sequential data. Hundreds of other interactions (indices) can be derived by combining these bands using various formulae.
The unique structure of the data allows for multiple approaches to the problem. Solutions based on sequential image classification or segmentation algorithms may be the most powerful since they can take advantage of the spatial and temporal features. If speed is more important than accuracy, the data can be compressed across the space and time dimensions into a tabular format suitable for classical machine learning algorithms.
What unexpected insights into the data have you discovered?
I suspected that there was some noise in the dataset. While preprocessing the data, I noticed a few fields labeled with multiple crop types. Additionally, labels such as planted pastures, fallow, weeds, and small grain grazing sound vague and might encompass multiple crop types.
It’s also expected that some farmers may plant more than one type of crop in one season. Some of the crops with long growth cycles may have been planted alongside short-term crops with faster returns. Intercropping is a popular practice used to maximize resource utilization and reduce the risk of crop failure.
These observations may suggest why label mixing regularization techniques like MixUp and CutMix worked really well.
Any challenges you would like to share?
I struggled with the CNN-LSTM approach, mostly because of the memory and processing constraints of my environment.
Machine learning is a fast-growing field. How do you stay up-to-date with the latest technological developments?
I mostly rely on competitions for hands-on experience. Subscriptions to relevant blogs and news feeds (e.g., Towards Data Science, MIT News) keep me immersed in the AI world. Following popular ML repositories on GitHub and tracking new research through sites like Papers With Code helps me stay updated with state-of-the-art techniques.
Any words of advice for beginner data scientists who would like to participate in data competitions?
Competitions are easily the best way to jumpstart your data science journey. You have the chance to solve a diverse range of real-world challenges. Don’t be overwhelmed. Find a project you care about and jump in. Aim at completing it, not winning. Break down complex problems into a single thing you can accomplish in a day.
You will make mistakes along the way. Don’t beat yourself up. All hell won’t break loose if you fail. Celebrate your small victories and try again tomorrow. Accept that your journey to mastery is never-ending, and learn to enjoy the process.
See you on the leaderboard!