Forecasting Climate Change: An Exploratory Time Series Data Analysis Approach

Shardul Chavan
AI Skunks
13 min readApr 10, 2023

--

What is Time Series Analysis?

Time series data analysis is a statistical method that involves analyzing and modelling data collected over time. It is commonly used in fields such as finance, economics, and climate science to understand patterns and trends in data over time. By analyzing past data, time series analysis can help forecast future values and make informed decisions. Time series analysis involves a variety of techniques, including complex models such as autoregressive integrated moving average (ARIMA) models. Overall, time series data analysis is a powerful tool for understanding and predicting trends in data over time.

In this article, we will delve into the world of time series analysis on climate data. We will explore different trends, seasonal patterns and examine the concept of stationarity. Through this analysis, we aim to gain deeper insights into climate change.

To access the dataset used in this article, please follow the link provided below.

This dataset contains the climate time series data that we will be analyzing in this article.

When working with climate time series data, there are several steps you can follow to perform a thorough analysis:

  1. Data Preparation: The first step is to prepare the data for analysis. This involves cleaning and organizing the data, and converting it into a time series format. You may need to address issues such as missing data, outliers, and inconsistent time intervals.
  2. Exploratory Data Analysis (EDA): EDA involves visualizing the data to understand its underlying patterns and relationships. This can be done by plotting the time series data, examining summary statistics, and exploring the autocorrelation structure using ACF and PACF plots.
  3. Time Series Modeling: Once you have a good understanding of the data, you can begin to develop time series models for forecasting. Common models used in climate time series analysis include ARIMA, SARIMA, and exponential smoothing. These models can be fit to the data and used to make forecasts for future time periods.
  4. Model Evaluation: It is important to evaluate the accuracy of the time series models to determine how well they are able to forecast future values. This can be done by comparing the predicted values to the actual values and calculating metrics such as mean absolute error (MAE) and root mean squared error (RMSE).

Load the Dataset

new_climate_data=pd.read_csv("https://raw.githubusercontent.com/shardulchavan/Time_Series_Forecasting/main/Data/FAOSTAT_data_1-10-2022.csv")

Data Imputation

In Time series analysis, data imputation plays a crucial role in handling missing data. It involves filling in the missing values using statistical methods such as interpolation, extrapolation, or regression. Proper data imputation ensures the accuracy and reliability of the analysis results.

We will first create NaN values and test different imputation techniques to find what best works for us.

new_climate_data1=new_climate_data.copy()
num_rows = 2000
rows = np.random.choice(new_climate_data.index, size=num_rows, replace=False)
# replace the selected entries with NaN values
new_climate_data.loc[rows, 'Value'] = np.nan

This code is performing data imputation on the climate time series data using various methods.

  1. Mean imputation fills missing values with the mean of the available data.
  2. Forward fill imputes missing values with the last available value.
  3. Linear interpolation fills missing values using a linear function that connects the available data points.
  4. Moving average imputes missing values with the average of a sliding window of data points.

These methods can help to address missing values in the data and ensure that it is suitable for analysis.

new_climate_data['Mean_imputed_Value'] = new_climate_data['Value'].fillna(new_climate_data['Value'].mean())  # Mean imputation
new_climate_data['FwdFill_Imputed_Value'] =new_climate_data['Value'] .fillna(method='ffill') # Forward fill
new_climate_data['linearinterpolate_Imputed_Value'] = new_climate_data['Value'].interpolate(method='linear') # Linear Interpolation
new_climate_data['MA_imputed'] = new_climate_data['Value'].rolling(window=4, min_periods=1, center=True).mean() # Moving average

To compare different imputation methods, you can calculate various evaluation metrics to compare the imputed values to the true values. Here are a few commonly used metrics:

  1. Root Mean Squared Error (RMSE): This measures the average difference between the imputed values and the true values, with larger differences weighted more heavily. Lower RMSE values indicate better imputation accuracy.
  2. Mean Absolute Error (MAE): This measures the average absolute difference between the imputed values and the true values, without weighting. Lower MAE values indicate better imputation accuracy.
  3. Mean Absolute Percentage Error (MAPE): This measures the average percentage difference between the imputed values and the true values. Lower MAPE values indicate better imputation accuracy.
  4. Correlation Coefficient (r): This measures the strength and direction of the linear relationship between the imputed values and the true values. Higher correlation coefficients indicate better imputation accuracy.
from sklearn.metrics import mean_squared_error, mean_absolute_error
from scipy.stats import pearsonr
rmse = pd.Series({
'ffill': np.sqrt(mean_squared_error(new_climate_data1['Value'], new_climate_data['FwdFill_Imputed_Value'])),
'mean': np.sqrt(mean_squared_error(new_climate_data1['Value'], new_climate_data['Mean_imputed_Value'])),
'moving_average': np.sqrt(mean_squared_error(new_climate_data1['Value'], new_climate_data['MA_imputed'])),
'linear_interpolation': np.sqrt(mean_squared_error(new_climate_data1['Value'],new_climate_data['linearinterpolate_Imputed_Value']))
})
mae = pd.Series({
'ffill': mean_absolute_error(new_climate_data1['Value'], new_climate_data['FwdFill_Imputed_Value']),
'mean': mean_absolute_error(new_climate_data1['Value'], new_climate_data['Mean_imputed_Value']),
'moving_average': mean_absolute_error(new_climate_data1['Value'], new_climate_data['MA_imputed']),
'linear_interpolation': mean_absolute_error(new_climate_data1['Value'], new_climate_data['linearinterpolate_Imputed_Value'])
})
mape = pd.Series({
'ffill': np.mean(np.abs((new_climate_data1['Value']-new_climate_data['FwdFill_Imputed_Value']) / new_climate_data1['Value'])) * 100,
'mean': np.mean(np.abs((new_climate_data1['Value'] - new_climate_data['Mean_imputed_Value']) / new_climate_data1['Value'])) * 100,
'moving_average': np.mean(np.abs((new_climate_data1['Value'] - new_climate_data['MA_imputed']) / new_climate_data1['Value'])) * 100,
'linear_interpolation': np.mean(np.abs((new_climate_data1['Value'] - new_climate_data['linearinterpolate_Imputed_Value']) / new_climate_data1['Value'])) * 100
})
r = pd.Series({
'ffill': pearsonr(new_climate_data1['Value'],new_climate_data['FwdFill_Imputed_Value'])[0],
'mean': pearsonr(new_climate_data1['Value'], new_climate_data['Mean_imputed_Value'])[0],
})

print(rmse.idxmin(),rmse.min())
print(mae.idxmin(),mae.min())
print(mape.idxmin(),mape.min())
print(r.idxmax(),r.max())

Output:
mean 0.09987648837459784
linear_interpolation 0.006357208168927806
mean 0.995345366999649

Based on the provided output, it seems that the linear interpolation method has the lowest value of 0.0063, which indicates a lower error compared to the mean imputation method with a value of 0.0999

From the above results, we will perform data imputation on the original missing values

Let’s impute using the linear interpolation method and then check for remaining null values

imputed_data['Value'] = imputed_data['Value'].interpolate(method='linear')
imputed_data.isnull().sum()

When dealing with time series data, it is important to ensure that the data is properly indexed based on the time of the instance in the dataset. In order to achieve this, we need to encode the months into a numerical form and create a new date column that includes the year, month, and day in a datetime format.

To accomplish this, we begin by converting the month names in the dataset into numerical values. We then create a new column named ‘date’ that combines the year, month, and day columns into a single datetime object. This process ensures that the data is properly indexed and allows us to perform time-based analyses such as trend and seasonality analysis.

map_encoding={'January':1,'February':2,'March':3,'April':4,'May':5,'June':6,'July':7,'August':8,'September':9, 'October':10, 'November':11, 'December':12}
imputed_data['new_Months_Code']=imputed_data['Months'].map(map_encoding)
#converting to int
imputed_data['new_Months_Code']=imputed_data['new_Months_Code'].astype(int)
#combine year, month and day=1 into new column 'date'
imputed_data['date'] = pd.to_datetime(imputed_data["Year"].astype(str) + "-" + imputed_data["new_Months_Code"].astype(str) + "-01")
# Sort the values of date column
climate_data=imputed_data.sort_values('date',ascending=True).reset_index(drop=True)
# Set the date column as the index
climate_data.set_index('date', inplace=False)

Visualizing Time Series Data

We proceed further with only the necessary data;

final_climate_data=climate_data.loc[:,['Area Code (FAO)','Area','Months','Year','Value','date']]
final_climate_data.set_index('date', inplace=True)
import matplotlib.pyplot as plt

# Create a histogram of a Temperature values column
plt.hist(final_climate_data['Value'])
plt.xlabel('Temperature in Celsius')
import pandas as pd
import plotly.express as px

# Calculate the average temperature by year and region
avg_temp_by_year_region = final_climate_data.groupby(['Year', 'Area'])['Value'].mean().reset_index()

# Create a line chart with multiple lines using plotly
fig = px.line(avg_temp_by_year_region, x='Year', y='Value', color='Area', labels={'Year': 'Year', 'AvgTemperatureC': 'Value', 'Region': 'Area'}, title='Yearly Average Temperature by Region')

# Show the chart
fig.show()
We are calculating the average temperature by year and region using groupby function, and then creating a line chart with multiple lines using plotly to visualize the yearly average temperature by region over time.
import plotly.graph_objs as go
import pandas as pd

# Load data
data = final_climate_data

# Create choropleth map
fig = go.Figure(data=go.Choropleth(
locations=data['Area'], # Geographical region
z=data['Value'], # Temperature
locationmode='country names',
colorscale='Viridis', # Color scale
colorbar_title = "Temperature (°C)",
))

# Set layout
fig.update_layout(
title_text='Temperature by Region',
geo=dict(
showframe=False,
showcoastlines=False,
projection_type='equirectangular'
),
updatemenus=[dict(
type='buttons',
buttons=[dict(label='Play',
method='animate',
args=[None,
{'frame': {'duration': 1000, 'redraw': False},
'fromcurrent': True, 'transition': {'duration': 0}}
]),
dict(label='Pause',
method='animate',
args=[[None], {'frame': {'duration': 0, 'redraw': False},
'mode': 'immediate',
'transition': {'duration': 0}}])
]
)]
)

# Create frames for animation
frames = [go.Frame(data=[go.Choropleth(
locations=data['Area'],
z=data[str(year)],
locationmode='country names',
colorscale='Viridis',
colorbar_title = "Temperature (°C)",
)],
name=str(year)
) for year in range(2000, 2010)]

# Add frames to figure
fig.frames = frames

# Show figure
fig.show()
import plotly.graph_objs as go
import pandas as pd
import chart_studio.plotly as py
import chart_studio.tools as tls


# Create 3D scatter plot
fig = go.Figure(data=[go.Scatter3d(
x=data['Year'], # Year
y=data['Area'], # Area
z=data['Value'], # Temperature
mode='markers',
marker=dict(
size=10,
color=data['Value'], # Color
colorscale='Viridis', # Color scale
opacity=0.8
)
)])

# Set layout
fig.update_layout(
scene=dict(
xaxis_title='Year',
yaxis_title='Area',
zaxis_title='Temperature (°C)'),
title='Temperature by Area and Year'
)

# Show figure
fig.show()
from scipy.stats import linregress
from matplotlib import pyplot as plt
yearly_mean = final_climate_data.groupby('Year')['Value'].mean()

# Fit a linear regression model to the data
slope, intercept, r_value, p_value, std_err = linregress(yearly_mean.index, yearly_mean.values)

# Plot the yearly mean temperatures in a line chart

plt.plot(yearly_mean.index, yearly_mean.values)
plt.xlabel('Year')
plt.ylabel('Average Temperature (Celsius)')
plt.title('Yearly Mean Temperature')

# Plot the regression line over the line chart
plt.plot(yearly_mean.index, slope * yearly_mean.index + intercept, color='red', label='Trend line')

plt.legend()
plt.show()
In this code, we are plotting the yearly mean temperatures in a line chart and fitting a linear regression model to the data to show the trend in temperature change over time. The code also plots the regression line over the line chart to highlight the trend.
# Fit a quadratic polynomial trend line using numpy's polyfit() function
X=(monthly_mean.index - monthly_mean.index.min()).days
trend_coeffs = np.polyfit(X, monthly_mean.values, 2)
trend_line = np.polyval(trend_coeffs, X)

# Create a line chart using matplotlib
plt.plot(X, monthly_mean.values)
plt.plot(X, trend_line, 'r--')

# Add labels and title to the chart
plt.xlabel('Year_Month')
plt.ylabel('Temperature (Celsius)')
plt.title('Yearly Mean Temperature with Quadratic Trend Line')

# Display the chart
plt.show()
Here, we are fitting a quadratic polynomial trend line to the monthly mean temperature data using numpy’s polyfit() function and then plotting the yearly mean temperature along with the trend line using matplotlib.
from statsmodels.tsa.seasonal import seasonal_decompose

decomposition = seasonal_decompose(final_climate_data['Value'], model='additive',period=12)
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid


plt.plot(trend)
plt.xlabel('Month')
plt.ylabel('Temperature (Celsius)')
plt.title('Trend Plot of Temperature')
plt.show()
We are decomposing the time series data into its trend, seasonal, and residual components using the seasonal_decompose function from statsmodels library. Then, we are visualizing the trend component using the matplotlib library.
We are decomposing the time series data into its trend, seasonal, and residual components using the seasonal_decompose function from statsmodels library. Then, we are visualizing the trend component using the matplotlib library.

plt.plot(seasonal)
plt.xlabel('Month')
plt.ylabel('Temperature (Celsius)')
plt.title('Seasonal Plot of Temperature')
plt.show()
Here we are plotting the seasonal component of the temperature time series data using the seasonal_decompose() function from the statsmodels library and visualizing it using a line chart.

Stationarity Testing

According to the Forecasting Principles and Practice textbook by Rob J Hyndman and George Athanasopoulos;

A stationary time series is one whose properties do not depend on the time at which the series is observed. Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times. On the other hand, a white noise series is stationary — it does not matter when you observe it, it should look much the same at any point in time.

What is Autocorrelation?

Autocorrelation is an important statistical tool in time series analysis because it can help identify patterns and relationships between observations at different lags. Specifically, autocorrelation measures the correlation between a time series and a lagged version of itself.

It is a useful tool for understanding the properties of time series data and for building models that can be used for forecasting and other applications in time series analysis.

Autocorrelation formula:

How to Choose Autocorrelation Technique?

The choice of autocorrelation technique to use depends on several characteristics of the time series data, including:

  1. Type of time series: Is the time series stationary or non-stationary? A stationary time series has constant statistical properties over time, while a non-stationary time series has changing statistical properties. The type of time series can influence the choice of autocorrelation technique.
  2. Seasonality: Does the time series exhibit seasonality or periodicity? If so, methods such as seasonal autoregressive integrated moving average (SARIMA) models or Fourier analysis may be appropriate.
  3. Time series length: How long is the time series? Short time series may require simpler models or techniques, while longer time series may require more complex models or techniques to capture the autocorrelation.
  4. Strength and persistence of autocorrelation: How strong and persistent is the autocorrelation in the time series? If the autocorrelation is weak, simple methods such as the autocorrelation function (ACF) may suffice. If the autocorrelation is strong, more advanced methods such as wavelet analysis or autoregressive (AR) models may be required.
  5. Level of noise: How noisy is the time series data? If the time series data contains a lot of noise or random variation, simpler methods such as ACF may be sufficient, while more complex models such as ARIMA or wavelet analysis may be needed if the noise is non-random and exhibits structure.
  6. Purpose of analysis: What is the purpose of the autocorrelation analysis? Different techniques may be appropriate depending on whether the goal is to forecast future values, identify periodicity, or detect unusual events.

Overall, the choice of autocorrelation technique should be based on a careful analysis of the specific characteristics of the time series data and the goals of the analysis. It’s often a good idea to try several different techniques and compare their results to choose the best one for the given situation.

Correlogram

A correlogram, also known as an autocorrelation plot or ACF plot, is a graphical tool used to visualize the autocorrelation function (ACF) of a time series. The correlogram displays the autocorrelation coefficients of a time series at different lags on the y-axis, with the lag values themselves plotted on the x-axis.

A correlogram is a useful tool in time series analysis because it can help identify the presence of patterns in the time series that repeat at regular intervals, known as seasonality. It can also be used to determine the appropriate order of autoregressive (AR) and moving average (MA) models, which are commonly used to model time series data.

In a correlogram, the autocorrelation coefficients are plotted as bars or dots, with error bars indicating the uncertainty in the estimates. The error bars are calculated based on the assumption that the autocorrelation coefficients are normally distributed with a mean of zero and a variance that depends on the sample size and the autocorrelation coefficient itself.

The correlogram can be used to visualize the autocorrelation function of a single time series or to compare the autocorrelation functions of multiple time series. It is a widely used tool in the field of time series analysis and is supported by many software packages and programming languages, including R and Python.

import statsmodels.api as sm
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
acf_data = sm.tsa.stattools.acf(final_climate_data['Value'], nlags=30)
pacf_data = sm.tsa.stattools.pacf(final_climate_data['Value'], nlags=30)

How Does Stationarity affect the Autocorrelation test?

  1. Stationarity can affect autocorrelation. Autocorrelation is a measure of the correlation between a time series and a lagged version of itself. if a time series is stationary, the autocorrelation function (ACF) should decay quickly to zero as the lag increases. If the ACF decays slowly or does not decay at all, this may indicate that the time series is non-stationary, or that the model being used to fit the data is inadequate.

2. For a stationary time series, the partial autocorrelation function (PACF) will converge to zero for all lags beyond the order of the autoregressive (AR) process. This means that the PACF can be used to estimate the order of an AR model for a stationary time series.

However, for a non-stationary time series, the PACF may not converge to zero, even for large lags. This is because the relationship between the observations at different lags may be affected by changes in the mean, variance, or other properties of the time series over time.

# Create correlogram
fig, (ax1, ax2) = plt.subplots(2, 1)
plot_acf(final_climate_data['Value'], lags=30, ax=ax1)
plot_pacf(final_climate_data['Value'], lags=30, ax=ax2)

plt.show()

However in our case, the ACF does not change over time and remains close to zero, this could indicate that the time series is stationary. However, it is important to note that other statistical tests may be necessary to confirm stationarity, such as unit root tests like the Augmented Dickey-Fuller (ADF) test or the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test.

Augmented Dickey-Fuller (ADF) test

from statsmodels.tsa.stattools import adfuller
# Perform ADF test
result = adfuller(final_climate_data['Value'])
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
print('\t%s: %.3f' % (key, value))

# Interpret the results
if result[1] <= 0.05:
print('The time series is stationary.')
else:
print('The time series is non-stationary.')

Output:

ADF Statistic: -11.120072
p-value: 0.000000
Critical Values:
1%: -3.430
5%: -2.862
10%: -2.567
The time series is stationary.

Based on the given results of the Augmented Dickey-Fuller (ADF) test, the ADF statistic of -11.120072 is lower than the critical values at the 1%, 5%, and 10% levels. Additionally, the p-value of 0.000000 indicates that the null hypothesis of non-stationarity can be rejected with a high level of confidence.

Therefore, we can conclude that the time series being tested is stationary. Specifically, the ADF test has provided strong evidence that the trend and/or seasonality in the data has been removed, and that the series is likely to exhibit a constant mean and variance over time.

Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test

from statsmodels.tsa.stattools import adfuller, kpss

result_kpss = kpss(final_climate_data['Value'])
print('\nKPSS Statistic: %f' % result_kpss[0])
print('p-value: %f' % result_kpss[1])
print('Critical Values:')
for key, value in result_kpss[3].items():
print('\t%s: %.3f' % (key, value))

Output:

KPSS Statistic: 59.434158
p-value: 0.010000
Critical Values:
10%: 0.347a
5%: 0.463
2.5%: 0.574
1%: 0.739
The test statistic is outside of the range of p-values available in the
look-up table. The actual p-value is smaller than the p-value returned./usr/local/lib/python3.9/dist-packages/statsmodels/tsa/stattools.py:2018: InterpolationWarning:

The test statistic is outside of the range of p-value

In conclusion, time series analysis is a powerful tool for understanding and predicting trends in climate data over time. Through the steps of data preparation, exploratory data analysis, time series modeling, and model evaluation, we can gain deeper insights into climate change patterns and make informed decisions about the future. By applying these techniques to the climate time series data, we have uncovered important trends and seasonal patterns, examined the concept of stationarity, and developed a model for forecasting future values. As we continue to study climate change and its impact on our planet, time series analysis will undoubtedly play a crucial role in informing our understanding and guiding our actions.
Click here to view the code.

References

  1. Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: principles and practice. OTexts.
  2. Statistical Modeling of Time Series Data Part 2: Exploratory Data Analysis by Towards AI. Available at: https://towardsai.net/p/data-visualization/statistical-modeling-of-time-series-data-part-2-exploratory-data-analysis

--

--