Data Science
Data Science for Climate Change
Predict vegetation index on the MENA Region
Interested in the climate space? Want to make an impact?
Put your data science skills to the test by participating in this challenge!
Problem Statement
(The following information is taken from the bitgrit competition page)
In the critical arena of climate research, understanding and mitigating the effects of climate change in the Middle East and North Africa (MENA) region has never been more crucial. With the MENA region facing unique environmental challenges, exacerbated by its vulnerability to climate variability, the quest for innovative solutions is paramount.
This backdrop sets the stage for a groundbreaking initiative by bitgrit. As part of its commitment to fostering cutting-edge research and community engagement, bitgrit is proud to collaborate with the Japan Aerospace Exploration Agency (JAXA) to launch this challenge, aimed at leveraging satellite data to predict the Vegetation Index. This key indicator of vegetation health stands at the core of our competition, reflecting the broader implications of climate change on the region’s ecosystems.
This challenge emerges at a pivotal moment, inviting participants from around the globe to delve into a rich dataset that includes variables like land temperatures, ground moisture, precipitation, and more, spanning up to the year 2024. With the dual goals of advancing scientific understanding and promoting practical applications, this competition is a clarion call to data scientists, climate researchers, and environmental advocates. Whether you’re deeply entrenched in the field of climate science or a data enthusiast eager to apply your skills to a cause of global importance, the JAXA MENA Region Climate Change Impact Challenge offers a unique platform to contribute to meaningful climate action.
The data 💾
Get the data by registering for the competition.
The dataset is composed of satellite data over a rectangular box of geographical coordinates denoting the MENA Region (Middle East and North Africa) as well as a portion of Southern Europe. This is organized into different climate-related measurements taken from space over the same bounding box, each of which is sampled monthly across a time period that goes between the year 2000 to the year 2023 (although with some variations).
These measurements are:
- Land Temperature
- Aerosol Depth
- Ground Moisture
- Precipitation (Rain)
- Shortwave Radiation
- Vegetation Index
📂
├──measurements
├── aerosol_depth
├── aerosol_depth_2002-01-01.csv
├── aerosol_depth_2002-02-01.csv
└── ...
├──ground_moisture
├──land_temperature
├──precipitation
├──shortwave_radiation
└──vegetation_index
└── submission_format.csv
For each one of the measurements in the dataset, and as per the above paragraph, the first column (index) indicates the LATITUDE and the first row (header) indicates the LONGITUDE in the geographical coordinate system.
Null/NaN
values indicate invalid measurements (like the Ocean). Note that when predicting that a certain geographical point is null it should be set to -1.
Due to differences in the resolution of the satellite images, some measurements have smaller resolution over the same coordinate space. Keep this in mind when working with this data. The resolution for the measurements (including Null/NaN values) is described as:
- Vegetation Index, Aerosol Depth, Land Temperature, Shortwave Radiation:
1659 x 610
- Precipitation, Ground Moisture:
829 x 305
The goal 🥅
The last measurement — Vegetation Index
— corresponds to the target variable of this data science problem. The values for which this variable has to be predicted correspond to the same MENA Region bounding box over the following dates:
2023–09–01
2023–12–01
2024–03–01
The code is on Deepnote
Data preparation
Concatenate Data
Since we have monthly data for each measurements in individual CSV folders, let’s concatenate them using glob
to match the file patterns and pd.concat
to join all of them into one big measurement file with their respective dates
We use tqdm to get a progress bar to visualize the process.
Note: this is memory-intensive so you might get out of memory errors.
Now that it’s done processing, let’s look at their shapes.
Notice that ground moisture
and precipitation
has half the columns of the other measurements, that’s because, as mentioned in the competition, it has a smaller resolution.
Also all the measuremnts have different number of rows, that’s because some of them are missing a few months, so as with anything in the real world, the data isn’t perfect.
Note: Speaking about different resolutions, to utilize all the measurements data, you should perform some form of interpolation to match the resolution of the other files. This way, all the measurements have the same scale and you can work with the data to train a model.
In this tutorial, I won’t be performing the interpolation, I’ll leave that as an exercises for you readers!
Let’s take a look at the Vegetation index data frame.
If you scroll all the way to the right, you see the dates column.
The index is the LAT value and the columns are our LONG.
We have lots of -1
values here, that represents a missing value.
Melting into one
Now, the shape of this data is not suitable for us to train a model, we want to be able to join all the measurements together.
The solution is to melt all the measurement value into one column, and build an index that matches the submission file YYMMDD-(LAT:LONG)
Here’s what the end result looks like.
Note: the original dataset has data from 2000, but it requires more time and compute to process all of them, so I filtered it down to after 2019.
Let’s dig into the code.
Since there’s a lot of data here, I split the dataset into chunks of size 10000, and used joblib Parallel to
After we process this on all the measurements data, let’s merge all of them into one dataset.
we use a simple left join on vegetation index.
Here’s what it looks like!
Shift-ing the data
Now remember the goal, we want to build a model to predict 3, 6 and 9 months ahead.
This means we have to shift the vegetation index values, and use the measurement values of any given month, say January 2022, and provide the future values of vegetation index (April 2022) as a target.
To do the shifting, we need to extract out the date and lat, long values from our index.
Since we’re also making predictions, we need our latest data, June 2023 to predict the 3 specific dates given in the competition overview.
Let’s prepare that dataset first.
Note that we replace -1
values with NaN so that the model isn’t learning any false patterns in the data.
Now we do the shifting using .shift()
And create three separate data frames for training
Machine Learning
Here we use LightGBM to train a simple regression model.
We also impute missing valeus with mean.
Predict
Let’s use the models we trained to make the predictions on our June 2023 dataset.
Then we concatenate them all into one giant list to create our submission file.
Here’s what the final submission file will look like.
That’s all for this starter solution!
Next steps
This was a simple baseline solution. There’s lots of room for improvements here.
- include more data for training
- interpolate the other measurement values
- create date & temporal features using feature engineering
- experiment with hyperparameter tuning and use different models
Thanks for reading
Be sure to follow the bitgrit Data Science Publication to keep updated!
Want to discuss the latest developments in Data Science and AI with other data scientists? Join our discord server!
Follow Bitgrit below to stay updated on workshops and upcoming competitions!
Discord | Website | Twitter | LinkedIn | Instagram | Facebook | YouTube