Predicting Australian Wildfires with Weather Forecast Data

Margriet Groenendijk
IBM Data Science in Practice
6 min readDec 14, 2020

Join this Call for Code Spot Challenge and build a model to predict the area of wildfires in Australia for February 2021

We are very excited to run this challenge, and for me it is a great opportunity to use weather data and my knowledge of land surface processes with machine learning models. You might wonder what land surface processes are, so I will come back to that as these may help you in deciding what features to use when building your prediction model. But first some more details about the challenge and how you can participate.

Photo by Daniel Morton on Unsplash

The challenge

The goal of the challenge is to predict the area of wildfires in 7 regions in Australia for February 2021. The final submissions deadline is on 31 January 2021 and the winner will get $5000.

This challenge is a Call for Code Spot Challenge. Find out more about Call for Code and the various Spot Challenges in these blogs.

Go here to read more about the wildfires challenge, then create a free account to sign up and head over to the leaderboard to submit your predictions. All data and example notebooks are available in this GitHub repository. You can ask any questions you have in the Slack channel (#cfcsc-wildfires). There are even more resources, including videos, at the end of this post.

Land surface processes

One of the most important things to do before you start building a model, is to learn about the subject area. For this challenge it would be good to learn a little about wildfires. But no need to become an expert!

During my Physical Geography degree one of the courses I had to do was modelling. There was no machine learning involved, not even writing code. It was all about drawing boxes, lines and arrows to build conceptual models of land surface processes. Anything that happens at the land surface from landslides, floods, droughts, wildfires and more. It all came down to sketches like the example below that I have drawn for water moving through a forest. Start by thinking of the states of the system (in blue) and then add in the processes that connect them and variables that you think might be important (in red). And keeping it simple is fine, no need to add every detail. In the below I am for instance not adding how every single leaf contributes to the transpiration by the trees.

As you might have noticed already, this is the hydrological cycle, which you can read about more here. But the point I want to make is that I started with the components that I thought were important in grey, then added in where water is stored and how it moves between these, and finally added some of the variables and parameters that could be driving how all these processes change over time.

The next step could then be defining equations for all processes and eventually you can write this into code. This can then be a small part of a physical land surface model (LSM) that can be used as one of the components in an even more complex climate or weather model.

But for the challenge this is not needed at all, as you will be probably using machine learning models. But I think that this exercise is still a very useful way to figure out what features to use. For the wildfires it could look maybe something like the below sketch, which is also not complete, I am leaving that up to you.

From the above it does become clear that the weather will have a large impact on wildfires.

Now that you have some knowledge of the processes involved, let’s have a look at the datasets that are available for the challenge.

The data

Several datasets are provided to help you build and test a prediction model, which are extracted from the Weather Operations Center: Geospatial Analytics component (PAIRS Geoscope).

The historical wildfire area is available as daily time series for each of the 7 regions. Use these to train and test your model, as this is the variable that you will predict for February 2021. In the below plot an example is given of the daily data for 2 regions. Note that the y-axis has a log scale.

Estimated fire area (km2) for 2 regions as created in this notebook

The wildfire area is an estimation and not an exact observation. In short, for each of the 7 regions all pixels in satellite images are added up that contain a fire. But as for instance, a fire can be either larger or smaller than a pixel that will introduce some uncertainties in the data. This is described in more detail in this video replay and the data documentation.

Historical weather and historical weather forecast data is provided as daily time series for each of the 7 regions. Use the first to figure out which features to use, and the second to build your model. The forecast data is the important one, as on January the 31st you will predict the wildfires for every day in February. The only data that will be available is the weather forecast. The big question will be how to make the best use of all the data!

Additional data about the land use is also available as the percentages of different land use classes for each region. And to get an idea of the seasonal variations of the vegetation you can use the Normalized difference vegetation index (NDVI), which is derived from satellite images and is a measure for the greenness of the vegetation.

All data is compressed into one zip file that can be downloaded from the repo directly:

import zipfile
!wget -N https://raw.githubusercontent.com/Call-for-Code/Spot-Challenge-Wildfires/main/data/Nov_10.zip
zip = zipfile.ZipFile(“Nov_10.zip”)
zip.extractall()

This is the file for the first phase of the challenge. For the second and final phase a new file will be added that will contain more recent data. Make sure to use the right one for each phase! Once you have downloaded the data, each csv file can be loaded into a Pandas DataFrame from where you can start exploring the data:

import pandas as pd 
file_wildfires = "Nov_10/Historical_Wildfires.csv"
wildfires_df = pd.read_csv(file_wildfires)
wildfires_df['Date'] = pd.to_datetime(wildfires_df['Date'])
wildfires_df.head()
First 5 rows of the historical wildfires data

Have a look at the example notebooks in the repo as well.

In addition to the data provided you are free to use any other open dataset, as long as the data is free and available for everyone.

The predictions

What kind of model and what data to use is up to you, it is a competition after all! The below is a summary of the challenge timeline:

  1. Explore the data. Make sure you really understand the data, create plots and look at the distributions, outliers and correlations.
  2. Use the historical data to build and test your model. Predict the wildfires for February 2020 with data available on 31 January 2020. This is the phase 1 that will run until 9 January 2021.
  3. Then the data will be updated every week that you can use to improve your model by making predictions for the 3rd and 4th week of January 2021 in phase 2 and 3.
  4. In phase 4 you will make predictions for all days in February 2021 on 31 January. Only this submission will be used to determine the winner.
  5. More details about the four phases are in the leaderboard.

Answer the Call

To find out more about the challenge, all information you need are in the below resources. And let me know how you are getting on, find a team and ask questions in the Slack channel (#cfcsc-wildfires). Hope to see you there!

Watch the replays:

--

--