Modeling Earth Temperature Data

JF
10 min readDec 19, 2021

--

Danielle Coates, Jhordan Figueroa, Sybil Shi

Global warming and climate change have been hot topics in recent years, and the three of us wanted to analyze temperature data to see how significant the changes in temperature are. We obviously want to take a closer look to determine how much the temperature has really increased, but we’re also concerned about how this will trend in the future. How much warmer will the earth be getting? We also wanted to know how the temperature varied in our regions since we live across the United States. How will warming affect our areas?

We found Earth Surface Temperature Data to explore. It consists of five datasets: one for global average temperatures, then ones for temperatures by city, by major city, by state, and by country. There is data from 1750 through 2015, although the data from 1750–1850 is a bit more limited compared to 1850 onwards. We used AWS S3 to host our data and to make it readily available for anyone to use.

global_temps_df = pd.read_csv('s3://545finalprojectupenn/GlobalTemperatures.csv')

EDA

First, we wanted to visualize the data to see if we noticed any trends. Starting with a heatmap (shown below), we immediately saw seasonality. The winter months are colder, while the summer months are hot. We also saw a bit more variation in data from 1750–1850. The data is missing some values here, and it’s likely that temperature averaging methods weren’t as consistent. We also confirmed the variation in the data with some line plots and boxplots, and again the temperature averages seem to become less variable after 1850.

#PLOTTING
global_temps_heatmap = global_temps_heatmap.set_index('dt')
series = global_temps_heatmap['LandAverageTemperature']
groups = series.groupby(pd.Grouper(freq='A'))
years = pd.DataFrame()
for name, group in groups:
years[name.year] = group.values
years = years.T
sns.heatmap(years)

Because there is known seasonal variation in temperature, we decided to plot a rolling average of the global temperature, using a window of 12 months (shown below). Here again we see the variation in some of the early data, but we notice an upward trend, particularly in the last few decades. This is consistent with our expectations, as there has been an increase in greenhouse gases in the atmosphere that would lead to this warming.

We also created a plot to analyze the change in the minimum temperature seen each month and the maximum temperature seen each month. Again these plots were created with a 12 month rolling average and are shown below. An interesting observation with these two plots is that the minimum temperature seems to stay fairly consistent, while the maximum temperature seen has been trending upward. The increase in global temperatures and the maximum temperatures seen encouraged us to explore this data further.

#PLOTTING
global_temps_df = global_temps_df.set_index('dt')
rolling_max = pd.Series.rolling(global_temps_df["LandMaxTemperature"],center=False,window=12).mean()
rolling_min = pd.Series.rolling(global_temps_df["LandMinTemperature"],center=False,window=12).std()
plt.subplots(1, 2, figsize=(10,5))
plt.subplot(1, 2, 1)
plt.plot(rolling_min, 'b')
plt.title('Land Minimum Temperatures')
plt.xlabel('Year')
plt.ylabel('Temperature (Celsius)')
plt.subplot(1, 2, 2)
plt.plot(rolling_max, 'r')
plt.title('Land Maximum Temperatures')
plt.xlabel('Year')
plt.ylabel('Temperature (Celsius)')
plt.tight_layout(4)
plt.show()

We also wanted to look specifically at the United States, since all three of us live there. We put together a choropleth map (shown below) to look at how state temperatures vary over time. Although this blog does not support video, you can see a time lapse here in our Colab notebook. We noticed as expected that the northern states experience colder winters and the southern states have hotter summers. We also noted that coastal states had slightly more moderate weather than their inland counterparts at the same latitude, likely due to the moderating effect of the ocean.

We delved into the state differences a bit more with a query to determine how our states (New Jersey and Hawaii) compared to each other. Danielle and Sybil have been dealing with some pretty cold temperatures right now in the winter, but Jhordan brags that he has nice temperatures all year. Hawaii averages 22℃ (~71℉) with a standard deviation of about 1.5 ℃ , while New Jersey has an average temperature of 10 ℃ (~50℉) with a standard deviation of 8.7℃. That large standard deviation aligns with our seasonal changes, compared to Hawaii’s relatively moderate climate.

So, visually we have been noticing an increase in temperature, how much has the temperature really been rising? We decided to look at regional data to find the greatest temperature increase and the smallest temperature increase (or decrease) from the beginning of the dataset to the end. We took an average of the first 20 years and the last 20 years to compare them, since this can help reduce some of the variability of the data while still keeping a large range between the two time periods. The cities varied across regions, but the most important thing we saw was that everywhere experienced a temperature increase; the lowest difference in temperature change was a 0.65 ℃ increase in Edmonton, Canada. Only 12 cities in the city dataset and 4 cities in the major city dataset had an increase that was less than 1 ℃. The vast majority of cities experienced a 1–2 ℃ increase in temperature, with some experiencing even higher.

#SPARK QUERY
min_difference_query = """SELECT late_average.City, late_average.Country, (late_average.LateTemperature - early_average.EarlyTemperature) AS TemperatureDifference
FROM late_average
JOIN early_average ON late_average.City = early_average.City AND late_average.Country = early_average.Country
ORDER BY TemperatureDifference
LIMIT 20"""
min_difference = spark.sql(min_difference_query)
min_difference.createOrReplaceTempView('min_difference')
min_difference.show()
Here we see that the smallest change in temperature is still an increase of 0.65 C. It seems that most cities then have had their temperature increase by at least 1 C in the past 250 years.

MODELING

Now that we are understanding that increases are happening everywhere, we wanted to find out if we can model the climate and use it to predict temperatures in the future. We first started with linear regression, using cross validation to try to find the best regularization parameters. Our linear regression model is shown on the below as the orange line. We found an equation of:

y = 0.0044*x + 0.99 where x is the scaled year from regularization and y is the temperature.

This model predicts that we will see a 0.44℃ increase each century. While this is an okay model, and somewhat follows the general trend, it doesn’t really follow the increase in temperature that we have seen over the past few decades. For example, it predicts that the temperature in 2020 will be 8.98℃, when in actuality it was 9.38℃.

To improve this model, we then decided to set up a neural network. We used a Rectified Linear Unit (ReLU) activation function with an adam optimization parameter. After training over 10 epochs, we were able to get our predictions to line up pretty well with the actual data, as seen in the graph on the below.

#NEURAL NETWORK MODEL
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)
model_nn = Sequential()
model_nn.add(Dense(6500, kernel_initializer='normal', input_dim = 2, activation='relu'))
model_nn.add(Dense(4000,activation='relu'))
model_nn.add(Dense(3500,activation='relu'))
model_nn.add(Dense(1))
model_nn.compile(loss='mse', optimizer='adam', metrics=['mse','mae'])
output_model = model_nn.fit(X_train, y_train, epochs=10, batch_size=150, verbose=1, validation_split=0.2)

Because neural networks are not limited to a linear function, they can generally better predict labels more accurately. Our MSE compared to the linear regression was similar, but the neural net has the ability to predict the temperature for a given month, which the linear regression cannot do. We can also see that it predicts the temperature in 2020 more accurately than our linear model, with our net’s prediction at 9.3℃ and the actual temperature of 9.36℃.

So we know the neural net will generally predict better than linear models, but what about a model for time series data specifically? We decided to use Autoregressive Integrated Moving Average (ARIMA) to see if we could improve our prediction even further compared to the neural network. We checked stationarity by visual inspection, summary statistics, and Augmented Dickey–Fuller test (ADF), all suggesting it is stationary. We checked seasonality by visual inspection which suggests a yearly cycle. Since ARIMA supports data with trends, but not data with seasonality, seasonality needs to be removed prior to ARIMA modeling. Alternatively, we found out the Seasonal Autoregressive Integrated Moving Average (SARIMA) model supports data with trend and seasonality. In addition, we found a AUTO_ARIMA package that can build a SARIMA model automatically. Can we build a better version than the automatic version? It sounds challenging and interesting, hence we decided to build a customized SARIMA model, and then compare it with the automatic SARIMA model.

The SARIMA model has 7 parameters that need to be tuned, 3 are the same as ARIMA, 4 for the seasonality components. Given this monthly data exhibits an obvious yearly cycle, we can set one parameter to be 12 and tune the rest 6. We wrote our own grid search algorithm to find the optimal set that minimizes RMSE. The automatic version performs stepwise to find the optimal set that minimizes Akaike Information Criterion (AIC). The table below shows the parameters and AIC.

Parameters for AIC

AIC of the customized version is less than the automatic version, suggesting a slightly better performance, which can also be shown from the chart below on the test set.

AIC graph comparisons

Additionally, using the ARIMA model, we developed a function to predict the temperature of a future year in a given city. We wanted to know about our local areas and also give people the ability to look up how the climate might be changing in their specific region.

CONCLUSION

Overall, comparing the model types with the last 5 years of temperature data, shown in the table below, we see that the neural network has the best performance. Linear regression was the least accurate, since the linear regression model can’t accommodate the increase in slope for the temperature that has been seen over the past few decades. Both the SARIMA models and the neural network are able to predict average temperature for the year as well as predict the temperature for a given month. They can both handle seasonality much better.

Model Comparisons

Using the neural network, our model predicts that by 2050 the average temperature will be 10.06 C. This is almost a 2 degree increase from the mid 1800s, which is consistent with many other climate change model predictions. Thus, we can reliably use our model for future exploration on the effects of climate change.

While we continually discussed improvements and ways to make this project more robust, due to time constraints we were not able to solve all of our challenges, and there are features that we would be interested in improving and implementing. Our first challenge was with data cleaning. Our data had a large amount of missing data, and our end solution was to drop these values, which can affect the model performance. We experimented with data imputation using KNNImputator, but we ran into issues with Spark crashing with this data so we decided to drop the NaN values to meet the deadline. An improvement for the future would be to try working with AWS EMR to run the data.

We also encountered a challenge while doing the grid search for 6 parameters SARIMA. It is very time consuming. The more parameters, the longer. We designed a process to cut it down from a couple of hours to 36mins (for the same set of search lists). This process can be parallelized using a spark cluster, in a sense that, we can process “mini-batches” simultaneously, find out the optimal of each mini-batch, and then, find the optimal across all mini-batches.

Another challenge we faced was version control. Working on Google Colab as a team was not the most ideal because we learned you can’t work on the project at the same time in the same Colab notebook. With more time, we would explore better software to use for data science or potentially move to a Github repository.

The last challenge we faced was optimization. Every time someone started a new session to work on the project, all the cells would have to be re-run. So a lot of time was spent downloading data and getting it into the correct form. With more time, we would improve our pipelines using AWS to have the data in the correct format faster for more rapid prototyping.

Looking forward to future work, we are interested in continuing to improve our model. We want to do this by implementing a crawler to scrape data in real time. We want to update our model with daily temperatures to be able to continue to better predict temperatures in the future. As more data becomes available, our model can become more accurate.

Overall, the project turned out to be a great experience and we all enjoyed working together!

--

--