AIN311 Project - Climate Change and Forest Fires - Blog 2

Data Preprocessing and Evaluating Models

Hüseyin Eren Doğan
AIN311 Fall 2023 Projects
4 min readDec 18, 2023

--

Data Preprocessing:

Firstly, we’ve looked into the datasets which we’re planning to use and decided to use these two datasets that they’re more efficient than others and sufficient for our work:

GlobalLandTemperaturesByCity.csv set in the following link:

The dataset contains forest fires in Turkey between 2000–2021:

Climate Change Data:

Climate Change Dataset — Average Temperatures for Cities in Turkey (Nov. 1743 – Aug. 2013)

After some preprocessing like dropping some redundant columns, extracting the data for Turkey from the global temperature dataset and creating dataframes for each city in Turkey; we have used polynomial regression to predict monthly average temperature in a city.

DataFrame for Çorlu

Here is a prediction result for average temperature in Çorlu, 2017.

2017–01: 3.6654165786241117
2017–02: 6.546845090460433
2017–03: 9.938755585376143
2017–04: 13.487229861404801
2017–05: 16.83834971657993
2017–06: 19.63819694893491
2017–07: 21.53285335650349
2017–08: 22.16840073731899
2017–09: 21.190920889415143
2017–10: 18.246495610825367
2017–11: 12.981206699583165
2017–12: 5.041135953722115

Then created new dataframes which contains cities and their average temperature by every year. Then we also wrote a polynomial regression model for this data.

Polynomial Regression model: 3.695e-07 x³ — 0.002063 x² + 3.838 x — 2367

Polynomial Regression Derivative: 1.109e-06 x² — 0.004125 x + 3.838

Prediction for Yearly Average Temperature Change in Çorlu, 2023: 0.028880063306379267

We’re planning to use this model to predict the temperature values in the future by calculating it’s derivative.

Forest Fires Data:

Now, we need to deal with the forest fire data. After some preprocessing, our data looks like this:

Forest Fires in Turkey (2000–2021)
Number of Total Forest Fires in Turkey by Years

For using these datas together, we need to match them with some columns. Since we have cities and their coordinates in climate data, we can match them with using coordinates. But we have a problem here, the coordinates in forest data can not be matched with in the climate data’s. So we finded the solution with getting the closest city for each coordinate in forest data. Firstly we extracted the cities and their coordinates from climate data,

then used GeoPandas library and write a function to calculate the distances between coordinates and add the nearest city to forest fires data.

climate_gdf = gpd.GeoDataFrame(city_coordinates, geometry=gpd.points_from_xy(city_coordinates['Longitude'], city_coordinates['Latitude']))
forest_gdf = gpd.GeoDataFrame(fires_00_21, geometry=gpd.points_from_xy(fires_00_21['longitude'], fires_00_21['latitude']))

tree = cKDTree(climate_gdf.geometry.apply(lambda geom: (geom.x, geom.y)).tolist())

def find_nearest_city(point):
_, idx = tree.query((point.x, point.y))
nearest_city = climate_gdf.index[idx]
return nearest_city

forest_gdf['NearestCity'] = forest_gdf.geometry.apply(find_nearest_city)

forest_gdf

Now we have the nearest city to fire point in forest fires data, then we’ve calculated the number of forest fires in each month in each city.

Number of Forest Fires for Each Month in Cities of Turkey

Then we decided to predict the ‘count’ value, for this we’ve added the average temperature value to this data. Since this data has a range (Nov. 2000 — Mar.2021) and climate data’s range ends in Aug. 2013, we’ve used the polynomial regression model which we’ve predicted the average temperature value for after Aug. 2013 while we’re adding directly the value from climate data for before Aug. 2013 and got the data ready for evaluating models.

Evaluating Models:

Now we’re trying to predict the occurrence of forest fires using temperature fluctuations as predictors. To do that, evaluating two models: Linear Regression and k-NearestNeighbors Regression.

Results of training these models calculated as MSE and R-squared:

Linear Regression Model
Mean Squared Error: 7421.040521343406
R-squared: 0.2917232023289914

kNN Regression Model
Mean Squared Error: 9499.912816635162
R-squared: 0.09331207550096021

Linear Regression model indicates an acceptable decent degree of predictive accuracy while kNN Regressor showing a quite poor performance.

MSE of the linear regression model indicates that the model’s predictions are off by around 7421.04 units on average. With an R-squared of 0.29, the model explains around 29.

k-NN Regressor’s Mean Squared Error, which averages around 9499.91 units, are indicating that the kNN model’s predictions differing more significantly from the actual values. The model only accounts for around 9% of the variability in the number of forest fires, as indicated by the lower R-squared score.

Although the linear regression model provides important insight into the connection between temperature changes and forest fires and kNN model had its limitations, they recognized and highlighted the complexity of the underlying dynamics and advanced our knowledge of the factors that affect the occurrence of forest fires.

Gökhan Çelik & Hüseyin Eren Doğan & Umut Şahin

--

--