Time Series Analysis

Cao Jianzhen
AI Skunks
Published in
42 min readApr 1, 2023

Time series is the series of numbers that arrange the values ​​of the same statistical index according to the time sequence of their occurrence. Depending on the time of observation, the times in a time series can be years, seasons, months, or any other time format. And the main purpose of time series analysis is to predict the future based on existing historical data.
In this article, we will show the whole process of dealing with time series data for prediction, with a dataset about weather from Kaggle.

Content

  1. Visualize the Time Series
  2. Detect Missing Values and Impute
  3. Test Training Splitting
  4. Calculate Time Series Statistics
    4.1. Covariance
    4.2. Autocovariance
    4.3. First Intuitions on (Weak) Stationarity
    4.4. Autocorrelation
  5. Moving Average Model
    5.1. MA(1)
    5.2. MA(n)
  6. ARMA Model
    6.1. Autoregressive Model
    6.2. Autoregressive Moving Average Model
  7. ARIMA Model
    7.1. Stationary
    7.2. Differencing
    7.3. Autoregressive Integrated Moving Average Model
  8. fbProphet Model
  9. Greykite Model
  10. Neural Network Model
    10.1. Neural Network Model Structure
    10.2. Neural Network Model Training
  11. AIC & BIC

Visualize the Time Series

What would you like to do first when you get a new dataset?
List equations to calculate some statistics?
Fit models on the dataset?
Or anything else like these?
I don’t think so.
In my opinion, it’s very important to have an intuitive view on a new dataset, which helps us get basic knowledge and analyze accordingly. The intuitive knowledge and analysis can give us a direction which leads us to further research in case of losing in the mathematical maze.
So let’s get started with visualizing.

import pandas as pd 

import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
url='https://raw.githubusercontent.com/rrRIOoo/data_cache/main/boston_weather_data.csv'
df = pd.read_csv(url, encoding= 'unicode_escape')
df_original = df.copy(deep=False)

df.head()
df.describe()
       tavg       tmin         tmax         prcp       wdir       wspd         pres
count 3652.000000 3653.000000 3653.000000 3653.000000 3066.000000 3653.000000 3492.000000
mean 11.549863 7.507692 15.875691 2.936819 200.876712 17.555407 1016.520790
std 9.559483 9.386869 10.221851 7.505755 100.070063 6.009118 7.767252
min -17.800000 -23.300000 -12.100000 0.000000 0.000000 2.500000 983.900000
25% 3.900000 0.600000 7.200000 0.000000 112.000000 13.300000 1011.400000
50% 11.500000 7.800000 15.600000 0.000000 226.500000 16.600000 1016.400000
75% 19.900000 15.600000 24.400000 1.500000 279.000000 20.900000 1021.600000
max 32.300000 28.300000 37.800000 78.000000 360.000000 61.200000 1042.400000

This dataset contains Boston weather data in 2013~2023. In table format, we can clearly see which kinds of data are included in the dataset.

year = []
month = []
day = []
for i in df['time']:
tmp = i.split('-')
year.append(int(tmp[0]))
month.append(int(tmp[1]))
day.append(int(tmp[2]))
df['year'] = year
df['month'] = month
df['day'] = day
df.drop(labels='time', axis=1, inplace=True)
print(df.head())
   tavg  tmin  tmax  prcp   wdir  wspd    pres  year  month  day
0 3.2 1.1 5.0 0.0 342.0 15.1 1002.7 2013 3 1
1 3.1 1.1 5.6 0.0 307.0 14.4 1004.2 2013 3 2
2 2.6 0.6 6.1 0.0 NaN 14.4 1002.6 2013 3 3
3 1.8 -0.6 5.0 0.0 303.0 28.1 1003.0 2013 3 4
4 2.9 0.0 6.1 0.0 NaN 16.6 1013.5 2013 3 5

We split the time column into 3 new columns: year, month and day in order to more conveniently deal with the dataset.
Now we have a simple and rough question: is weather in Boston cold or hot in these years?
To ask this quetion, we can plot a histgram. A histogram is a graphical display of data using bars of different heights. In a histogram, each bar groups numbers into ranges. Taller bars show that more data falls in that range. A histogram displays the shape and spread of continuous sample data.

plt.hist(df['tavg'])
plt.xlabel('Temperature (Celsius)')
plt.ylabel('Frequency')
plt.title('Histogram of Average Temperatures (Celsius)')
plt.show()

As we can see, most data fall into 0~25, and the whole data range between -20~35, so we can say weather in Boston is cool in general, neither too cold nor hot. We can also see the temperature distribution looks good, almost no outliers.

df = df[df['tavg'] > -33]
plt.hist(df['tavg'])
plt.xlabel('Temperature (Celsius)')
plt.ylabel('Frequency')
plt.title('Histogram of Average Temperatures (Celsius)')
plt.show()

Next question is how temperature alters in these years. To show that we can use a line chart. We calculate mean temperature of each year and plot a line chart. We can see the line go up or down at time point of each year, representing temperature changes.

yearly_mean = df.groupby('year')['tavg'].mean()

plt.plot(yearly_mean.index, yearly_mean.values)
plt.xlabel('year')
plt.ylabel('Average Temperature (Celsius)')
plt.title('Yearly Mean Temperature')
plt.show()

But wait! The temperature before 2014 runs very high and after 2022 dives much low suddenly, which looks so weird. That is because our dataset has weather data only after March 1st, 2013 and before March 1st, 2023. Here we discard data in 2013 and 2023 due to the incompleteness.

# Remove the 2020 data from the DataFrame
df = df[df['year'] < 2023]
df = df[df['year'] > 2013]

# Aggregate the data by year and calculate the mean temperature for each year
yearly_mean = df.groupby('year')['tavg'].mean()

# Plot the yearly mean temperatures in a line chart
plt.plot(yearly_mean.index, yearly_mean.values)
plt.xlabel('Year')
plt.ylabel('Average Temperature (Celsius)')
plt.title('Yearly Mean Temperature')
plt.show()
#!pip install scipy

There is already a visible trend during these years: the temperature basically keeps running up, which is consistent the conclustion of global temperature getting higher and higher. But it is still good to fit a line or curve to indicate the trend so that we can see it more clearly. We call this trend line We can use linear or polynomial model to try.

from scipy.stats import linregress

# Fit a linear regression model to the data
slope, intercept, r_value, p_value, std_err = linregress(yearly_mean.index, yearly_mean.values)

# Plot the yearly mean temperatures in a line chart
plt.plot(yearly_mean.index, yearly_mean.values)
plt.xlabel('Year')
plt.ylabel('Average Temperature (Celsius)')
plt.title('Yearly Mean Temperature')

# Plot the regression line over the line chart
plt.plot(yearly_mean.index, slope * yearly_mean.index + intercept, color='red', label='Trend line')

plt.legend()
plt.show()

# Fit a quadratic polynomial trend line using numpy's polyfit() function
trend_coeffs = np.polyfit(yearly_mean.index, yearly_mean.values, 2)
trend_line = np.polyval(trend_coeffs, yearly_mean.index)

# Create a line chart using matplotlib
plt.plot(yearly_mean.index, yearly_mean.values)
plt.plot(yearly_mean.index, trend_line, color='red', label='Trend line')

# Add labels and title to the chart
plt.xlabel('Year')
plt.ylabel('Average Temperature (Celsius)')
plt.title('Yearly Mean Temperature with Quadratic Trend Line')

# Display the chart
plt.legend()
plt.show()

We can also use a bar chart with a color map to show temperature situation. We can inspect this chart in 2 dimenstions: length and color. A longer bar stands for a higher temperature. A color closer to yellow represents a higher temperature, and blue represents a lower temperature, just as the spectrum shows.

fig = px.bar(yearly_mean.reset_index(), x='year', y='tavg', color='tavg', labels={'year': 'Year', 'tavg': 'Average Temperature (Celsius)'}, title='Average Temperature by Year with Color Map')

# Show the chart
fig.show()

We can also explore seasonality of the dataset. Seasonality is a characteristic of a time series in which the data experiences regular and predictable changes that recur every calendar year. Any predictable fluctuation or pattern that recurs or repeats over a one-year period is said to be seasonal. We calculate temperature in 12 months of each year and analyze its seasonality. In the plot below we can see similar fluctuation recur many times.

from statsmodels.tsa.seasonal import seasonal_decompose

year_monthly_mean = df.loc[df['year'] == 2014].groupby('month')['tavg'].mean()
tmp = []
for i in range(12):
tmp.append('2014-'+str(i))
year_monthly_mean.index = tmp

for year in range(2015, 2023):
tmp = []
for i in range(12):
tmp.append(str(year)+'-'+str(i))
rowOfyear = df.loc[df['year'] == year]
monthly_mean = rowOfyear.groupby('month')['tavg'].mean()
monthly_mean.index = tmp
year_monthly_mean = pd.concat([year_monthly_mean,monthly_mean])

# Decompose the time series into seasonal, trend, and residual components using the seasonal_decompose() function from statsmodels
decomposition = seasonal_decompose(year_monthly_mean, model='additive', period=12)

# Plot the seasonal component of the time series using matplotlib
plt.plot(decomposition.seasonal)
plt.xlabel('Year')
plt.ylabel('Temperature (Celsius)')
plt.title('Seasonal Plot of Temperature')
plt.show()

There are many other approaches to do overall trend analysis.
Time period: Consider the time period covered by the data and choose an appropriate scale for the x-axis. Depending on the time period and frequency of the data, you may want to show the data by year, month, day, hour, or some other time unit.

Annotations: Consider whether to include annotations in the graph, such as labels, titles, or captions, to help explain the data and highlight important features or events.

Data points: Consider whether to include individual data points in the graph, or whether to show only the overall trend or summary statistics of the data. Showing individual data points can help to reveal patterns and outliers in the data, but it can also make the graph more cluttered and harder to read.

Detect Missing Values and impute

In real world, datasets are often imperfect, many of which have missing data. Missing data can occur for various reasons such as incomplete data collection, data entry errors, or data loss during transmission. You can delete rows containing missing data sometimes, but the missing data cannot be easily ignored in some cases. So here is another technique: data imputaion.

Data imputation is the process of filling in missing or incomplete data values in a dataset. Imputing missing data allows for a more complete analysis of the dataset and can improve the accuracy of statistical models.

There are several methods for data imputation, including:

  • Mean imputation: Replacing missing values with the mean of the observed values for that variable.
  • Last observation carried forward (LOCF): Using the last observed value for a variable and carries it forward for all subsequent time points where the data is missing in a longitudinal or time-series dataset.
  • Linear interpolation: Drawing a straight line between the two closest observed values and estimating the missing value based on where it falls on the line.
  • Seasonal decomposition imputation: Separating time series data into its seasonal, trend, and residual components, imputing the missing values in each component, and then combining the imputed components back together.
  • Time series forecasting imputation: Using various time series forecasting techniques such as ARIMA, exponential smoothing, or machine learning algorithms to predict the missing values based on the patterns observed in the historical data.
  • Multiple imputation: Generating several imputed datasets and combining the results to provide a more accurate estimate of the missing data.

In this dataset, column ‘tavg’ basically doesn’t have missing values, so we use column ‘wdir’ to practice each imputation techniuqe.

print('Original dataset missing values:', df['wdir'].isnull().sum().sum())
Original dataset missing values: 439
df = df_original.copy(deep=False)
df_mean_imputed = df['wdir'].fillna(df.mean())
print('Mean imputed dataset missing values:', df_mean_imputed.isnull().sum().sum())
Mean imputed dataset missing values: 587
<ipython-input-18-362329c835bf>:2: FutureWarning:

Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.
df = df_original.copy(deep=False)
df_linear_interpolated = df['wdir'].interpolate()
print('Linear interpolated dataset missing values:', df_linear_interpolated.isnull().sum().sum())
Linear interpolated dataset missing values: 0
df = df_original.copy(deep=False)
df_locf_imputed = df['wdir'].fillna(method='ffill')
print('LOCF imputed dataset missing values:', df_locf_imputed.isnull().sum().sum())
LOCF imputed dataset missing values: 0
import statsmodels.api as sm
df_copy = df_original.copy(deep=False)

df_copy['time'] = pd.to_datetime(df_copy['time'])
# Set the date column as the index
df_copy.set_index('time', inplace=True)

for i in range(len(df_copy['wdir'])-1):
if not np.isfinite(df_copy.iat[i,4]):
df_copy.iat[i,4] = 0

# Perform seasonal decomposition
decomposition = sm.tsa.seasonal_decompose(df_copy['wdir'], model='additive')

# Impute missing values in the trend and seasonal components using linear interpolation
trend = decomposition.trend.copy(deep=False)
seasonal = decomposition.seasonal.copy(deep=False)
trend.interpolate(inplace=True)
seasonal.interpolate(inplace=True)


# Reconstruct the time series from the imputed components
imputed_df = trend + seasonal + decomposition.resid

# Check the number of missing values in the original and imputed datasets
print('Imputed dataset missing values:', imputed_df.isna().sum().sum())
Imputed dataset missing values: 6
df_copy = df_original.copy(deep=False)

df_copy['time'] = pd.to_datetime(df_copy['time'])
# Set the date column as the index
df_copy.set_index('time', inplace=True)

for i in range(len(df_copy['wdir'])-1):
if not np.isfinite(df_copy.iat[i,4]):
df_copy.iat[i,4] = 0

# Create a copy of the original dataset for imputation
imputed_df = df_copy.copy(deep=False)

# Set the number of imputations to perform
num_imputations = 10

# Perform multiple seasonal decompositions and impute missing values
for i in range(num_imputations):
# Perform seasonal decomposition
decomposition = sm.tsa.seasonal_decompose(imputed_df['wdir'], model='additive')

# Impute missing values in the trend and seasonal components using linear interpolation
trend = decomposition.trend.copy(deep=False)
seasonal = decomposition.seasonal.copy(deep=False)
trend.interpolate(inplace=True)
seasonal.interpolate(inplace=True)

# Reconstruct the time series from the imputed components
reconstructed_df = trend + seasonal + decomposition.resid
# Replace missing values in the imputed dataset with values from the reconstructed dataset
imputed_df.update(reconstructed_df, overwrite=False)

# Check the number of missing values in the original and imputed datasets
print('Imputed dataset missing values:', imputed_df.isnull().sum().sum())
Imputed dataset missing values: 162
from statsmodels.tsa.arima.model import ARIMA

df_copy = df_original.copy(deep=False)

df_copy['time'] = pd.to_datetime(df_copy['time'])
# Set the date column as the index
df_copy.set_index('time', inplace=True)

# Split the data into training and testing sets
train_size = int(len(df_copy) * 0.8)
train_data, test_data = df_copy[:train_size], df_copy[train_size:]

# Fit an ARIMA model to the training data
model = ARIMA(train_data['wdir'], order=(1, 1, 1))
fitted_model = model.fit()

# Make predictions on the testing data
predictions = fitted_model.forecast(len(test_data))

# Calculate the mean absolute error (MAE) of the predictions
mae = np.mean(np.abs(predictions - test_data['wdir'].values))

# Plot the original and predicted data
plt.plot(df_copy.index, df_copy['wdir'].values, label='Original')
plt.plot(test_data['wdir'].index, predictions, label='Predicted')
plt.legend()
plt.show()

# Print the MAE of the predictions
print('MAE:', mae)

Test Training Splitting

Before training our model, an important question is: on what?
We really have a dataset now, but should we use all of the data to train? Of course not. We should leave some data to test after training model. Then how to segment the whole dataset? Here we need test training splitting.
Test training splitting is a technique used in supervised machine learning to evaluate the performance of a predictive model. This technique involves dividing the available data into two separate sets: a training set and a test set. The training set is used to train the model on the available data, while the test set is used to evaluate the performance of the model on unseen data. The model is trained on the training set and the resulting model is then used to make predictions on the test set. The performance of the model is then evaluated using metrics such as accuracy, precision, recall, or F1 score. The purpose of this technique is to estimate how well the model will generalize to new data that was not used in the training process, and to prevent overfitting, which is when the model performs well on the training data but poorly on new data.

There are several methods for test training splitting.

  • Fixed split: involves setting a fixed proportion, such as 70/30 or 80/20, for dividing the data into training and test sets.
    Pros: it is straightforward and easy to implement.
    Cons: it has some limitations, such as it may not be appropriate for small datasets, where the size of the test set may not be sufficient to provide a reliable estimate of the model’s performance.
from sklearn.model_selection import train_test_split

df = df_original.copy(deep=False)

# Convert the date column to a datetime object
df['time'] = pd.to_datetime(df['time'])

# Set the date column as the index
df.set_index('time', inplace=True)

# Split the data into training and testing sets using a fixed split
train_data, test_data = train_test_split(df, test_size=0.2, shuffle=False)

# Print the shapes of the training and testing sets
print('Training set shape:', train_data.shape)
print('Testing set shape:', test_data.shape)
Training set shape: (2922, 7)
Testing set shape: (731, 7)
  • Rolling window: involves using a sliding window approach to split the data into training and test sets. The window is moved forward in time by a fixed number of observations, and at each step, the observations within the window are used for training the model, and the observations outside(after) the window are used for testing the model.
    Pros: it allows the model to be trained on a larger number of observations, which can improve the accuracy of the model.
    Cons: it requires more computational resources as the model needs to be trained multiple times on different subsets of the data.
from sklearn.model_selection import TimeSeriesSplit

df = df_original.copy(deep=False)

# Convert the date column to a datetime object
df['time'] = pd.to_datetime(df['time'])

# Set the date column as the index
df.set_index('time', inplace=True)

# Set the size of the rolling window
window_size = 500

# Initialize a time series split with the specified window size
tscv = TimeSeriesSplit(max_train_size=window_size)

# Iterate through the splits and perform the rolling window
for train_index, test_index in tscv.split(df):
train_data, test_data = df.iloc[train_index], df.iloc[test_index]

# Print the shapes of the training and testing sets for each split
print('Training set shape:', train_data.shape)
print('Testing set shape:', test_data.shape)
Training set shape: (500, 7)
Testing set shape: (608, 7)
Training set shape: (500, 7)
Testing set shape: (608, 7)
Training set shape: (500, 7)
Testing set shape: (608, 7)
Training set shape: (500, 7)
Testing set shape: (608, 7)
Training set shape: (500, 7)
Testing set shape: (608, 7)
  • Multiple window: similar to rolling window but uses multiple windows to train and test, and aggregates the results.
    Pros: it allows for a more comprehensive evaluation of the model’s performance across different time periods, improving accuracy.
    Cons: it requires more computational resources.
df = df_original.copy(deep=False)

# Convert the date column to a datetime object
df['time'] = pd.to_datetime(df['time'])

# Set the date column as the index
df.set_index('time', inplace=True)

# Set the sizes of the multiple windows
window_sizes = [300, 500]

# Initialize a time series split for each window size
tscv_splits = []
for window_size in window_sizes:
tscv = TimeSeriesSplit(max_train_size=window_size)
tscv_splits.append(tscv)

# Iterate through the splits for each window size and perform the multiple window
for i, tscv in enumerate(tscv_splits):
print('Results for window size:', window_sizes[i])
for train_index, test_index in tscv.split(df):
train_data, test_data = df.iloc[train_index], df.iloc[test_index]

# Print the shapes of the training and testing sets for each split
print('Training set shape:', train_data.shape)
print('Testing set shape:', test_data.shape)

# Do additional processing for each split as needed
# For example, calculate and print the mean of the training and testing sets
print('Training set mean:', np.mean(train_data['tavg']))
print('Testing set mean:', np.mean(test_data['tavg']))
Results for window size: 300
Training set shape: (300, 7)
Testing set shape: (608, 7)
Training set mean: 12.075
Testing set mean: 9.55296052631579
Training set shape: (300, 7)
Testing set shape: (608, 7)
Training set mean: 9.718666666666667
Testing set mean: 11.345559210526316
Training set shape: (300, 7)
Testing set shape: (608, 7)
Training set mean: 12.111
Testing set mean: 13.497203947368419
Training set shape: (300, 7)
Testing set shape: (608, 7)
Training set mean: 13.403000000000002
Testing set mean: 10.29325657894737
Training set shape: (300, 7)
Testing set shape: (608, 7)
Training set mean: 9.549000000000001
Testing set mean: 12.152467105263158
Results for window size: 500
Training set shape: (500, 7)
Testing set shape: (608, 7)
Training set mean: 12.8088
Testing set mean: 9.55296052631579
Training set shape: (500, 7)
Testing set shape: (608, 7)
Training set mean: 11.5426
Testing set mean: 11.345559210526316
Training set shape: (500, 7)
Testing set shape: (608, 7)
Training set mean: 9.216
Testing set mean: 13.497203947368419
Training set shape: (500, 7)
Testing set shape: (608, 7)
Training set mean: 14.1566
Testing set mean: 10.29325657894737
Training set shape: (500, 7)
Testing set shape: (608, 7)
Training set mean: 11.697799999999999
Testing set mean: 12.152467105263158
  • Time series cross-validation: involves dividing the time-series data into multiple consecutive segments or folds and training the model on each segment while testing it on the subsequent segment.
    Pros: it provides more accurate performance estimates and avoids overfitting.
    Cons: it is computationally expensive and sensitive to hyperparameters.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
df = df_original.copy(deep=False)
df = df.interpolate()
# Convert the date column to a datetime object
df['time'] = pd.to_datetime(df['time'])

# Set the date column as the index
df.set_index('time', inplace=True)

# Initialize a time series split with the number of splits equal to 5
tscv = TimeSeriesSplit(n_splits=5)

# Initialize a linear regression model
model = LinearRegression()

# Iterate through the splits and perform time series cross-validation
for train_index, test_index in tscv.split(df):
# Extract the training and testing sets
train_data, test_data = df.iloc[train_index], df.iloc[test_index]

# Separate the features (X) and target (y) variables
X_train, y_train = train_data.drop(columns=['tavg']), train_data['tavg']
X_test, y_test = test_data.drop(columns=['tavg']), test_data['tavg']

# Fit the model on the training data
model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test)

# Calculate the mean squared error of the predictions
mse = mean_squared_error(y_test, y_pred)

# Print the mean squared error for each split
print('Mean squared error:', mse)
Mean squared error: 1.4917696346816534
Mean squared error: 1.2719922777782877
Mean squared error: 1.0331032833383902
Mean squared error: 1.1509144571883518
Mean squared error: 1.340623652465109

Calculate Time Series Statistics

There are some important statistics useful in time series, and we will introduce three in this article.

Covariance

In probability theory and statistics, covariance is a measure of the joint variability of two random variables. It describes how two variables vary together, or in other words, how one variable changes when the other changes.
The formula for calculating the covariance between two variables X and Y is given by:

Specifically, covariance measures the extent to which two variables are linearly related. A positive covariance indicates that the two variables tend to increase or decrease together, while a negative covariance indicates that they tend to move in opposite directions. A covariance of zero indicates that there is no linear relationship between the two variables.

import numpy as np
import matplotlib.pyplot as plt

# Set the mean and standard deviation of the distributions
mean = [0, 0]
std_dev = [1, 1]

# Set the desired correlation between the distributions
correlation = 0.8

# Calculate the covariance matrix based on the correlation and standard deviations
covariance = [[std_dev[0]**2, correlation*std_dev[0]*std_dev[1]],
[correlation*std_dev[0]*std_dev[1], std_dev[1]**2]]

# Generate random data with the specified mean, standard deviation, and covariance
data = np.random.multivariate_normal(mean=mean, cov=covariance, size=100)

# Separate the data into two arrays representing two variables
x = data[:, 0]
y = data[:, 1]

# Calculate the correlation coefficient between x and y
correlation_coef = np.corrcoef(x, y)[0, 1]

# Create a scatter plot of the data points
fig, ax = plt.subplots()
ax.scatter(x, y)

# Set the title and labels for the plot
ax.set_title("Scatter plot with correlation coefficient")
ax.set_xlabel("X")
ax.set_ylabel("Y")

# Add a text box to the plot to display the correlation coefficient
textstr = f"Correlation coefficient: {correlation_coef:.2f}"
props = dict(boxstyle='round', facecolor='wheat', alpha=0.5)
ax.text(0.05, 0.95, textstr, transform=ax.transAxes, fontsize=12, verticalalignment='top', bbox=props)

# Show the plot
plt.show()

What we do abvove are

  1. Set the mean and standard deviation of the distributions
  2. Set desired correlation between them
  3. Calculate the covariance matrix by formula
  4. 𝐶𝑜𝑣(𝑋,𝑌)=𝐶𝑜𝑟𝑟(𝑋,𝑌)×σ(𝑋)×σ(𝑌)
  5. where cov represents covariance, Corr represents correlation coefficient and σ represents standard deviation
  6. Generate random data with the specified mean, standard deviation, and covariance
  7. Split generated data into X and Y
  8. Calculate the correlation coefficient between X and Y
  9. Plot data points

What would the data distribution change when covariance value changes? Let’s try.

# Set the desired correlation between the distributions
correlation = 0.9

# Calculate the covariance matrix based on the correlation and standard deviations
covariance = [[std_dev[0]**2, correlation*std_dev[0]*std_dev[1]],
[correlation*std_dev[0]*std_dev[1], std_dev[1]**2]]

# Generate random data with the specified mean, standard deviation, and covariance
data = np.random.multivariate_normal(mean=mean, cov=covariance, size=100)

# Separate the data into two arrays representing two variables
x = data[:, 0]
y = data[:, 1]

# Calculate the correlation coefficient between x and y
correlation_coef = np.corrcoef(x, y)[0, 1]

# Create a scatter plot of the data points
fig, ax = plt.subplots()
ax.scatter(x, y)

# Set the title and labels for the plot
ax.set_title("Scatter plot with correlation coefficient")
ax.set_xlabel("X")
ax.set_ylabel("Y")

# Add a text box to the plot to display the correlation coefficient
textstr = f"Correlation coefficient: {correlation_coef:.2f}"
props = dict(boxstyle='round', facecolor='wheat', alpha=0.5)
ax.text(0.05, 0.95, textstr, transform=ax.transAxes, fontsize=12, verticalalignment='top', bbox=props)

# Show the plot
plt.show()

As we can see, distribution of X and Y seems more linear when correlation coefficient increases(covariance thus increases).

# Set the desired correlation between the distributions
correlation = 0.95

# Calculate the covariance matrix based on the correlation and standard deviations
covariance = [[std_dev[0]**2, correlation*std_dev[0]*std_dev[1]],
[correlation*std_dev[0]*std_dev[1], std_dev[1]**2]]

# Generate random data with the specified mean, standard deviation, and covariance
data = np.random.multivariate_normal(mean=mean, cov=covariance, size=100)

# Separate the data into two arrays representing two variables
x = data[:, 0]
y = data[:, 1]

# Calculate the correlation coefficient between x and y
correlation_coef = np.corrcoef(x, y)[0, 1]

# Create a scatter plot of the data points
fig, ax = plt.subplots()
ax.scatter(x, y)

# Set the title and labels for the plot
ax.set_title("Scatter plot with correlation coefficient")
ax.set_xlabel("X")
ax.set_ylabel("Y")

# Add a text box to the plot to display the correlation coefficient
textstr = f"Correlation coefficient: {correlation_coef:.2f}"
props = dict(boxstyle='round', facecolor='wheat', alpha=0.5)
ax.text(0.05, 0.95, textstr, transform=ax.transAxes, fontsize=12, verticalalignment='top', bbox=props)

# Show the plot
plt.show()

If we set correlation coefficient negative(covariance thus negative), we can see that X and Y are negatively linearly related.

# Set the desired correlation between the distributions
correlation = -0.95

# Calculate the covariance matrix based on the correlation and standard deviations
covariance = [[std_dev[0]**2, correlation*std_dev[0]*std_dev[1]],
[correlation*std_dev[0]*std_dev[1], std_dev[1]**2]]

# Generate random data with the specified mean, standard deviation, and covariance
data = np.random.multivariate_normal(mean=mean, cov=covariance, size=100)

# Separate the data into two arrays representing two variables
x = data[:, 0]
y = data[:, 1]

# Calculate the correlation coefficient between x and y
correlation_coef = np.corrcoef(x, y)[0, 1]

# Create a scatter plot of the data points
fig, ax = plt.subplots()
ax.scatter(x, y)

# Set the title and labels for the plot
ax.set_title("Scatter plot with correlation coefficient")
ax.set_xlabel("X")
ax.set_ylabel("Y")

# Add a text box to the plot to display the correlation coefficient
textstr = f"Correlation coefficient: {correlation_coef:.2f}"
props = dict(boxstyle='round', facecolor='wheat', alpha=0.5)
ax.text(0.05, 0.95, textstr, transform=ax.transAxes, fontsize=12, verticalalignment='top', bbox=props)

# Show the plot
plt.show()

Autocovariance

Autocovariance is a measure of how much two values in a time series are related to each other at different time lags. More specifically, autocovariance measures the covariance between a time series and a lagged version of itself. The autocovariance function is a mathematical function that describes the covariance between a time series and a lagged version of itself. Given a time series X of length n, the autocovariance function is defined as:

where Cov is the covariance function, t is a time index, k is the time lag, and γ(k) is the autocovariance at lag k.
Autocovariance is a useful tool in time series analysis, as it can be used to detect patterns in the data such as seasonality or trend. Additionally, the autocovariance function can be used to estimate other statistical properties of the time series, such as the autocorrelation function and the power spectrum.

import numpy as np
import matplotlib.pyplot as plt

df = df_original.copy(deep=False)

df.fillna(0,inplace=True)

# Calculate the autocovariance function
lags = range(1, 51)
autocov = [np.cov(df['tavg'][:-k], df['tavg'][k:])[0, 1] for k in lags]

# Plot the autocovariance function
fig, ax = plt.subplots()
ax.plot(lags, autocov, marker='o')

# Set the title and labels for the plot
ax.set_title("Autocovariance function")
ax.set_xlabel("Lag")
ax.set_ylabel("Covariance")

# Show the plot
plt.show()

Here we use generate a random normal time series and plot the graph by tuning the parameter k from 0 to 50.

First Intuitions on (Weak) Stationarity

A concept related to autocovariance in time series is weak stationarity .

Weak stationarity is a key assumption in time series analysis. It means that the mean, variance, and autocovariance of the time series do not change over time. In other words, the statistical properties of the time series are constant over time.

Autocovariance measures the linear dependence between two observations in a time series at different time lags. Weak stationarity, on the other hand, is a property of a time series that implies its mean, variance, and autocovariance are constant over time.

If a time series is weakly stationary, its autocovariance is only dependent on the lag between two observations, and not on the actual time at which they were observed. In other words, the autocovariance function of a weakly stationary time series is time-invariant. This property makes it easier to model and analyze the time series using techniques such as Autoregressive Integrated Moving Average (ARIMA) models, which we will introduce more later.

Autocorrelation

Autocorrelation is a measure of how correlated a time series is with a lagged version of itself at different time lags. In other words, it measures the degree of similarity between observations of a time series separated by a particular time lag. The autocorrelation function (ACF) is a mathematical tool used to measure autocorrelation. It is defined as the correlation between the time series and a lagged version of itself, at different lags.
Given a time series X of length n, the autocorrelation function is defined as:

where Corr is the correlation function, t is a time index, k is the time lag, and ρ(k) is the autocorrelation at lag k.

The autocorrelation function can be plotted as a graph to visualize the correlation between the time series and its lagged versions at different lags. The ACF is commonly used in time series analysis to determine the presence of patterns, such as seasonality, and to identify appropriate models for forecasting future values of the time series.

Autocorrelation values range from -1 to 1, where a value of 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates no correlation between the time series and its lagged versions.

We can show autocorrelation with different parameter k in a similar way.

import numpy as np
import matplotlib.pyplot as plt

df = df_original.copy(deep=False)

df.fillna(0,inplace=True)
# Calculate the autocovariance coefficients
lags = range(1, 51)
autocov = [np.cov(df['tavg'][:-k], df['tavg'][k:])[0, 1] for k in lags]
variance = np.var(df['tavg'])
autocorrelation = [covariance / variance for covariance in autocov]

# Plot the autocorrelation function
fig, ax = plt.subplots()
ax.plot(lags, autocorrelation, marker='o')

# Set the title and labels for the plot
ax.set_title("Autocorrelation function")
ax.set_xlabel("Lag")
ax.set_ylabel("Autocorrelation coefficient")

# Show the plot
plt.show()

After introducing some important concepts, we will show several different models and apply them on our dataset.

Moving Average Model

OK let’s take a look at first model, moving-average model.
The Moving Average Model is a popular time series model used for predicting future values of a time series based on its past values and a random error term. The model assumes that the current value of the time series is a function of the average of its past values, plus a random error term.
A Moving Average Model of order q, denoted as MA(q), is defined as:

where

  • Xt is the current value of the time series
  • μ is the mean of the time series
  • εt is a random error term at time t
  • θ1, θ2, …, θq are parameters that represent the coefficients of the error terms at lags 1 to q.
  • q is the order of the moving average model.

The Moving Average Model assumes that the random error terms εt are uncorrelated, with mean 0 and constant variance σ². It is used to forecast future values of the time series by estimating the values of the parameters θ1, θ2, …, θq using statistical methods, and then using these estimates to predict future values of the time series.
Usually the Moving Average Model is not used single, but in combination with other time series models, such as autoregressive (AR) model to improve the accuracy of time series forecasts.

MA(1)

MA(1) stands for Moving Average with order 1.

import numpy as np
import matplotlib.pyplot as plt
df = df_original.copy(deep=False)

df.fillna(0,inplace=True)
# Set the parameters of the time series
n = len(df)
mu = np.mean(list(df['tavg']))
sigma = 1
theta = 0.5

# Generate a random normal time series
epsilon = np.random.normal(mu, sigma, n)

# Generate the time series for a MA(1) process
ts = np.zeros(n)
ts[0] += mu + epsilon[0]
for i in range(1, n):
ts[i] = mu + epsilon[i] + theta*epsilon[i-1]

# Calculate the autocorrelation coefficients
lags = range(1, 21)
autocorr = [np.corrcoef(ts[:-k], ts[k:])[0, 1] for k in lags]

# Plot the correlogram
fig, ax = plt.subplots()
ax.stem(lags, autocorr, use_line_collection=True)

# Set the title and labels for the plot
ax.set_title("Correlogram for MA(1) process")
ax.set_xlabel("Lag")
ax.set_ylabel("Autocorrelation coefficient")

# Show the plot
plt.show()

Correlogram

This plot is correlogram. Correlogram, also known as autocorrelation plot or ACF plot, is a graphical representation of the autocorrelation function (ACF) of a time series. It is used to visualize the correlation between a time series and its lagged values at different time lags. The correlogram displays the autocorrelation coefficients of a time series at different lags on the y-axis, with the lag values themselves plotted on the x-axis.

Interpretation of a correlogram involves looking for patterns in the autocorrelation values. A significant autocorrelation value at a particular lag indicates that the time series is correlated with its past values at that lag. For example, if there is a significant positive autocorrelation value at lag 1, this suggests that the time series is positively correlated with its value at the previous time step. Similarly, if there is a significant negative autocorrelation value at lag 2, this suggests that the time series is negatively correlated with its value two time steps ago.

The shape of the correlogram can also provide information about the nature of the time series. For example, a periodic or seasonal time series may exhibit a repeating pattern in the autocorrelation values at multiples of the seasonal lag. A random or non-seasonal time series may show no significant autocorrelation values at any lag.

The correlogram is a useful tool in time series analysis for identifying patterns and selecting appropriate models for forecasting future values of the time series.

MA(n)

MA(2) stands for Moving Average with order 2, in which we take more error terms into considerations.

import numpy as np
import matplotlib.pyplot as plt
df = df_original.copy(deep=False)

df.fillna(0,inplace=True)
# Set the parameters of the time series
n = len(df)
mu = np.mean(list(df['tavg']))
sigma = 1
theta1 = 0.5
theta2 = 0.3

# Generate a random normal time series
epsilon = np.random.normal(mu, sigma, n)

# Generate the time series for a MA(2) process
ts = np.zeros(n)
ts[0] = mu + epsilon[0]
ts[1] = mu + epsilon[1] + theta1*epsilon[0]
for i in range(2, n):
ts[i] = mu + epsilon[i] + theta1*epsilon[i-1] + theta2*epsilon[i-2]

# Calculate the autocorrelation coefficients
lags = range(1, 21)
autocorr = [np.corrcoef(ts[:-k], ts[k:])[0, 1] for k in lags]

# Plot the correlogram
fig, ax = plt.subplots()
ax.stem(lags, autocorr, use_line_collection=True)

# Set the title and labels for the plot
ax.set_title("Correlogram for MA(2) process")
ax.set_xlabel("Lag")
ax.set_ylabel("Autocorrelation coefficient")

# Show the plot
plt.show()

Similarly we can get MA(3) and even MA(n) as well.

import numpy as np
import matplotlib.pyplot as plt
df = df_original.copy(deep=False)

df.fillna(0,inplace=True)
# Set the parameters of the time series
n = len(df)
mu = np.mean(list(df['tavg']))
sigma = 1
theta1 = 0.3
theta2 = 0.2
theta3 = -0.1

# Generate a random normal time series
epsilon = np.random.normal(mu, sigma, n)

# Generate the time series for a MA(3) process
ts = np.zeros(n)
ts[0] = mu + epsilon[0]
ts[1] += mu + epsilon[1] + theta1*epsilon[0]
ts[2] += mu + epsilon[2] + theta1*epsilon[1] + theta2*epsilon[0]
for i in range(3, n):
ts[i] = mu + epsilon[i] + theta1*epsilon[i-1] + theta2*epsilon[i-2] + theta3*epsilon[i-3]

# Calculate the autocorrelation coefficients
lags = range(1, 21)
autocorr = [np.corrcoef(ts[:-k], ts[k:])[0, 1] for k in lags]

# Plot the correlogram
fig, ax = plt.subplots()
ax.stem(lags, autocorr, use_line_collection=True)

# Set the title and labels for the plot
ax.set_title("Correlogram for MA(3) process")
ax.set_xlabel("Lag")
ax.set_ylabel("Autocorrelation coefficient")

# Show the plot
plt.show()

ARMA Model

We have mentioned that MA is often used together with autoregressive model(AR), now we introduce it further.

Autoregressive Model

Autoregressive Models (AR) assume that the current value of a variable is a linear combination of its past values, weighted by coefficients called autoregressive parameters. Mathematically, an AR(p) model is expressed as:

where y_t is the value of the variable at time t, c is a constant, φ_1, φ_2, …, φ_p are the autoregressive parameters, ε_t is a white noise error term with zero mean and constant variance, and p is the order of the model.

Autoregressive Moving Average Model

Now we have AR and MA, if we add these two models together, we get Autoregressive Moving Average Model(ARMA).
ARMA models are a class of statistical models that combine the concepts of autoregressive models and moving average models. They are widely used in time series analysis to model and forecast data that exhibit autocorrelation and seasonality.

ARMA models combine both autoregressive and moving average terms to capture the dynamics of the data. Mathematically, an ARMA(p, q) model is expressed as:

where y_t is the value of the variable at time t, c is a constant, φ_1, φ_2, …, φ_p are the autoregressive parameters, ε_t is a white noise error term with zero mean and constant variance, θ_1, θ_2, …, θ_q are the moving average parameters, and p and q are the orders of the autoregressive and moving average terms, respectively.

from statsmodels.tsa.arima.model import ARIMA
from sklearn.model_selection import train_test_split

df = df_original.copy(deep=False)
df['tavg'].interpolate(inplace=True)
# Convert the date column to a datetime object
df['time'] = pd.to_datetime(df['time'])

# Set the date column as the index
df.set_index('time', inplace=True)

# Split the data into training and testing sets using a fixed split
train_data, test_data = train_test_split(df['tavg'], test_size=0.2, shuffle=False)

model = ARIMA(train_data, order=(2, 0, 2))
results = model.fit()

# Test the model using the testing data
predictions = results.forecast(steps=len(test_data))

# Plot the results
plt.plot(train_data, label='Training Data')
plt.plot(test_data, label='Testing Data')
plt.plot(predictions, label='Predictions')
plt.legend()
plt.show()
print(results.summary())
                               SARIMAX Results                                
==============================================================================
Dep. Variable: tavg No. Observations: 2922
Model: ARIMA(2, 0, 2) Log Likelihood -7534.699
Date: Fri, 31 Mar 2023 AIC 15081.397
Time: 15:21:49 BIC 15117.277
Sample: 03-01-2013 HQIC 15094.321
- 02-28-2021
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 10.2258 3.180 3.215 0.001 3.992 16.459
ar.L1 1.2618 0.034 36.598 0.000 1.194 1.329
ar.L2 -0.2657 0.034 -7.827 0.000 -0.332 -0.199
ma.L1 -0.3990 0.033 -11.940 0.000 -0.464 -0.333
ma.L2 -0.3894 0.020 -19.202 0.000 -0.429 -0.350
sigma2 10.1593 0.228 44.505 0.000 9.712 10.607
===================================================================================
Ljung-Box (L1) (Q): 0.01 Jarque-Bera (JB): 78.17
Prob(Q): 0.91 Prob(JB): 0.00
Heteroskedasticity (H): 1.01 Skew: -0.13
Prob(H) (two-sided): 0.84 Kurtosis: 3.76
===================================================================================

We can also see the components with the help of decomposition.

decomposition = seasonal_decompose(df['tavg'], model='additive')

# Plot the trend component
plt.subplot(411)
plt.plot(df['tavg'], label='Original')
plt.legend()
plt.subplot(412)
plt.plot(decomposition.trend, label='Trend')
plt.legend()

# Plot the seasonal component
plt.subplot(413)
plt.plot(decomposition.seasonal, label='Seasonality')
plt.legend()

# Plot the residual component
plt.subplot(414)
plt.plot(decomposition.resid, label='Residuals')
plt.legend()

plt.tight_layout()
plt.show()

ARIMA Model

ARIMA, autoregressive integrated moving average, is very closed to ARMA. But before introducing ARIMA, I want to explain another point, stationary.

Stationary

In time series analysis, stationarity refers to the statistical properties of a time series that remain constant over time. Specifically, a stationary time series has constant mean, variance, and autocorrelation structure across all time points.

A time series that is not stationary is said to exhibit trends, cycles, or seasonality, which can make it difficult to model and forecast accurately. For example, a time series that exhibits a trend will have a changing mean over time, while a time series that exhibits seasonality will have changing means and variances over time at specific intervals.

In time series analysis, there are two types of stationarity: weak stationarity and strong stationarity.

  • Weak stationarity requires the mean, variance, and autocovariance to be constant over time, but it allows for a time-varying autocorrelation structure. This means that the autocorrelation between two time points can depend on the distance between them, but the overall autocorrelation structure remains stationary over time.
  • Strong stationarity, on the other hand, requires that all statistical properties of a time series remain constant over time, including the joint distribution of any subset of time points. In other words, a strongly stationary time series has the same statistical properties at any point in time, and therefore the autocorrelation structure is fixed and does not depend on the distance between time points.

In practice, strong stationarity is difficult to achieve and is usually assumed only in certain contexts, such as in theoretical discussions. Weak stationarity, however, is a more practical assumption and is commonly used in time series analysis and forecasting.

So, in order to predict more accurately, we should try to make a non-stationary time series stationary. One common approach is to apply a mathematical transformation, such as differencing, to remove the trend or seasonality.

Differencing

What is differencing?
In general, differencing refers to computing the difference between consecutive observations in a time series.

Differencing has a parameter called order.

  • First-order differencing involves computing the difference between each observation and the previous observation.
  • Second-order differencing involves computing the difference between each first-order difference and its previous value.
  • Similarly, higher-order differencing involves calculations for subsequent differences.
Year |  Sales | Diff1 | Diff2
-----------------------------
2015 | 1000 | NaN | NaN
2016 | 1200 | 200 | NaN
2017 | 1500 | 300 | 100
2018 | 1800 | 300 | 0
2019 | 2000 | 200 | -100

As we can see, the trend is removed from the first-order series, and in the second-order series, both the trend and seasonality is removed.

Autoregressive Integrated Moving Average Model

Now add differencing into ARMA, we get ARIMA, that is, do differencing first to make a time series stationary, then do the same as ARMA.
ARIMA (Autoregressive Integrated Moving Average) is a popular time series forecasting model that combines the ideas of autoregression (AR), differencing (I), and moving average (MA).
The order of an ARIMA model is typically denoted as (p, d, q), where:

  • p is the order of the autoregressive (AR) component, which represents the dependence of the current value on the past p values.
  • d is the order of the differencing component, which represents the number of times the time series needs to be differenced to become stationary.
  • q is the order of the moving average (MA) component, which represents the dependence of the current value on the past q forecast errors.
from statsmodels.tsa.arima.model import ARIMA
from sklearn.model_selection import train_test_split

df = df_original.copy(deep=False)
df['tavg'].interpolate(inplace=True)
# Convert the date column to a datetime object
df['time'] = pd.to_datetime(df['time'])

# Set the date column as the index
df.set_index('time', inplace=True)

# Split the data into training and testing sets using a fixed split
train_data, test_data = train_test_split(df['tavg'], test_size=0.2, shuffle=False)

model = ARIMA(train_data, order=(2, 2, 2))
results = model.fit()

# Test the model using the testing data
predictions = results.forecast(steps=len(test_data))

# Plot the results
plt.plot(train_data, label='Training Data')
plt.plot(test_data, label='Testing Data')
plt.plot(predictions, label='Predictions')
plt.legend()
plt.show()

print(results.summary())
                               SARIMAX Results                                
==============================================================================
Dep. Variable: tavg No. Observations: 2922
Model: ARIMA(2, 2, 2) Log Likelihood -7541.567
Date: Fri, 31 Mar 2023 AIC 15093.135
Time: 15:21:57 BIC 15123.032
Sample: 03-01-2013 HQIC 15103.903
- 02-28-2021
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.7319 0.018 40.452 0.000 0.696 0.767
ar.L2 -0.2366 0.017 -13.939 0.000 -0.270 -0.203
ma.L1 -1.9066 0.010 -193.754 0.000 -1.926 -1.887
ma.L2 0.9102 0.010 92.518 0.000 0.891 0.929
sigma2 10.2282 0.230 44.463 0.000 9.777 10.679
===================================================================================
Ljung-Box (L1) (Q): 1.92 Jarque-Bera (JB): 68.04
Prob(Q): 0.17 Prob(JB): 0.00
Heteroskedasticity (H): 1.01 Skew: 0.07
Prob(H) (two-sided): 0.90 Kurtosis: 3.73
===================================================================================
decomposition = seasonal_decompose(df['tavg'], model='additive')

# Plot the trend component
plt.subplot(411)
plt.plot(df['tavg'], label='Original')
plt.legend()
plt.subplot(412)
plt.plot(decomposition.trend, label='Trend')
plt.legend()

# Plot the seasonal component
plt.subplot(413)
plt.plot(decomposition.seasonal, label='Seasonality')
plt.legend()

# Plot the residual component
plt.subplot(414)
plt.plot(decomposition.resid, label='Residuals')
plt.legend()

plt.tight_layout()
plt.show()

fbProphet Model

As you can see ARMA and ARIMA are very rough, which thus don’t have good performance. So next we introduce a more powerful model, fbProphet.
fbProphet is a powerful open-source library for time series forecasting in Python. It was developed by Facebook’s Core Data Science team and is designed to be both scalable and easy to use. fbProphet is particularly useful for business applications where you want to forecast future values of a time series based on historical data.
fbProphet uses an additive model that decomposes a time series into its trend, seasonality, and holiday components.

  • The trend component models the overall direction of the time series. fbProphet uses a piecewise linear or logistic growth curve to model the trend. This allows the model to capture changes in trend over time, rather than assuming a fixed trend throughout the entire time series.
  • The seasonal component models recurring patterns such as daily, weekly, and yearly cycles. fbProphet can detect and model multiple seasonalities in the data, which can be useful in capturing complex patterns in the time series.
  • The holiday component models the effect of one-off events such as public holidays or major events that may impact the time series. fbProphet allows you to specify holidays and their impact on the time series in a customizable way, such as by providing a list of dates and their associated impact.

fbProphet uses a Bayesian framework that incorporates prior knowledge and uncertainty into the model. This allows the model to handle missing data and outliers in a robust way.

To fit the model to the data, fbProphet uses a Markov Chain Monte Carlo (MCMC) algorithm that samples from the posterior distribution of the model parameters. The resulting samples are used to estimate the model’s parameters and make predictions.

fbProphet has many excellent features, including:

  • Automatic detection of seasonalities in the data
  • Customizable models for handling trend, seasonality, and holidays
  • Robust handling of missing data and outliers
  • Built-in cross-validation and hyperparameter tuning
  • Interpretable model diagnostics and visualizations
  • Support for forecasting time series with multiple seasonality
import pandas as pd
from prophet import Prophet

df = df_original.copy(deep=False)
df['tavg'].interpolate(inplace=True)
df = df.rename(columns={'time': 'ds', 'tavg': 'y'})

# Split the data into training and testing sets using a fixed split
train_data, test_data = train_test_split(df['y'], test_size=0.2, shuffle=False)

model = Prophet()

# Fit the model to the data
model.fit(df)

# Make future predictions using the model
future = model.make_future_dataframe(periods=365)
predictions = model.predict(future)

# Plot the results
model.plot(predictions)
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()

It’s easy to see that pbProphet is much more accurate!

fig2 = model.plot_components(predictions)

Greykite

Greykite is a library designed for time series forecasting with a simple modeling interface, which makes data exploration and model tuning easy. Its main forecasting algorithm, Silverkite, is easy to tune parameters and capture diverse time series characteristics, such as (potentially time-varying) trends, seasonality, repeated events, holidays and so on. The output of Silverkite is interpretable, allowing visualizations of trend, seasonality, and other effects, along with their statistical significance.

Greykite offers several key benefits, including flexibility, providing time series regressors for trend, seasonality, holidays, changepoints, and autoregression. It’s also intuitive with exploratory plots, templates for tuning, and explainable forecasts with clear assumptions. And it is fast, allowing for quick prototyping and deployment at scale.

This is archetecture of Silverkite, in which

  • green parallelogram: model input, including input time series and any known anomalies, events, regressors, or changepoint dates
  • orange oval: model output, including forecasts, prediction intervals, and diagnostics.
  • blue rectangle: computation steps of Silverkite, which can be decomposed into two phases:
    Phase (1): the conditional mean model.
    Phase (2): the volatility/error model

Phase (1) can be also decomposed into 4 steps:

  1. Raw features are extracted from timestamps, events data, and history of the series .
  2. Features are transformed for appropriate basis functions.
  3. A changepoint detection algorithm is applied on data to discover changes in the trend and seasonality of series.
  4. An appropriate machine learning algorithm is applied for fitting the features from above.

In this Phase (2), a simple conditional variance model is fitted to the residuals.

#!pip install greykite 
#pip command may cause many dependency problems because of python version,
#so you can install greykite locally on python 3.7.7 and run the code
import numpy as np
import pandas as pd
import plotly as plotly
from collections import defaultdict
import warnings

warnings.filterwarnings("ignore")
from greykite.common.data_loader import DataLoader
from greykite.framework.templates.autogen.forecast_config import ForecastConfig
from greykite.framework.templates.autogen.forecast_config import MetadataParam
from greykite.framework.templates.forecaster import Forecaster
from greykite.framework.templates.model_templates import ModelTemplateEnum
from greykite.framework.utils.result_summary import summarize_grid_search_results
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
# Load data
url='https://raw.githubusercontent.com/rrRIOoo/data_cache/main/boston_weather_data.csv'
df = pd.read_csv(url, encoding= 'unicode_escape')
df_original = df.copy(deep=False)
df = df_original.copy(deep=False)
df['tavg'].interpolate(inplace=True)



metadata = MetadataParam(
time_col="time", # ----> name of the time column ("date" in example above)
value_col="tavg", # ----> name of the value column ("sessions" in example above)
freq="D" # ----> H" for hourly, "D" for daily, "W" for weekly, etc.
# ----> Any format accepted by `pandas.date_range`
)

forecaster = Forecaster() # Creates forecasts and stores the result
result = forecaster.run_forecast_config( # result is also stored as `forecaster.forecast_result`.
df=df,
config=ForecastConfig(
model_template=ModelTemplateEnum.SILVERKITE.name,
forecast_horizon=365, # forecasts 365 steps ahead
coverage=0.95, # 95% prediction intervals
metadata_param=metadata
)
)
ts = result.timeseries
fig = ts.plot()
plotly.io.show(fig)

This model also works well! The output is relatively consistent with the actual situation.

Neural Network Model

A neural network model is a type of machine learning model that is inspired by the structure and function of the human brain. It consists of a large number of interconnected processing units, known as neurons, that work together to process information and make predictions. These neurons are organized into layers, with each layer processing the output from the previous layer to produce more abstract representations of the input data. The final layer produces the output of the model, which can be a classification or regression prediction.

Neural Network Model Structure

A Neural Network Model is composed of multiple layers of interconnected nodes called neurons. There are three main types of layers in a typical neural network:

  • Input layer: This layer takes in the initial input data and passes it on to the next layer.
  • Hidden layers: These layers are sandwiched between the input and output layers and are responsible for learning and extracting the features from the input data. Each hidden layer is composed of multiple neurons, and the output of one layer serves as the input for the next layer.
  • Output layer: This layer produces the final output of the model. The number of neurons in this layer depends on the type of problem being solved. For example, in a binary classification problem, there would be one neuron in the output layer, while in a multi-class classification problem, there would be multiple neurons in the output layer.

Neurons in each layer are connected to neurons in the next layer by weights, which determine the strength and direction of the connections. Each neuron computes a weighted sum of its inputs, applies a non-linear activation function to the sum, and then passes the result to the next layer. This process is repeated until the final output is produced.

Neural Network Model Training

Training a neural network model typically involves the following steps:

  1. Data preparation: The data is split into training and validation sets. The input features are normalized to make them comparable and to ensure that the model trains faster.
  2. Model initialization: The model is initialized with random weights and biases.
  3. Forward pass: The training data is fed through the network, and the output is compared to the actual target values.
  4. Loss calculation: The difference between the predicted and actual values is computed using a loss function, such as mean squared error.
  5. Backpropagation: The error is backpropagated through the network to update the weights and biases using an optimization algorithm such as gradient descent. The optimization algorithm determines the direction and magnitude of weight and bias updates that minimize the loss function.
  6. Validation: The trained model is tested on a validation set to check its performance on data that it has not seen before.
  7. Hyperparameter tuning: The hyperparameters of the network, such as learning rate, batch size, and number of hidden layers, are adjusted to improve the model’s performance on the validation set.
  8. Prediction: Once the model is trained and validated, it can be used to make predictions on new data.

These steps are typically repeated for multiple epochs until the model converges to a minimum loss or the performance on the validation set stops improving.

import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.models import Sequential

df = df_original.copy(deep=False)
df['tavg'].interpolate(inplace=True)
# Convert the date column to a datetime object
df['time'] = pd.to_datetime(df['time'])

# Set the date column as the index
df.set_index('time', inplace=True)

# Split the data into training and testing sets using a fixed split
train_data, test_data = train_test_split(df['tavg'], test_size=0.2, shuffle=False)

# Define window size and step for creating input-output pairs
window_size = 10
step = 1

# Create input-output pairs for training data
X_train, y_train = [], []
for i in range(0, len(train_data) - window_size, step):
X_train.append(train_data[i:i+window_size])
y_train.append(train_data[i+window_size])
X_train, y_train = np.array(X_train), np.array(y_train)

# Create input-output pairs for test data
X_test, y_test = [], []
for i in range(0, len(test_data) - window_size, step):
X_test.append(test_data[i:i+window_size])
y_test.append(test_data[i+window_size])
X_test, y_test = np.array(X_test), np.array(y_test)

# Define the neural network model
model = Sequential([
LSTM(64, input_shape=(window_size, 1)),
Dense(1)
])

# Compile the model
model.compile(optimizer='adam', loss='mse')

# Train the model
model.fit(X_train, y_train, epochs=50, batch_size=32)

# Evaluate the model on test data
test_loss = model.evaluate(X_test, y_test)

# Make predictions on test data
y_pred = model.predict(X_test)

# Plot the actual and predicted values
import matplotlib.pyplot as plt
plt.plot(y_test, label='actual')
plt.plot(y_pred, label='predicted')
plt.legend()
plt.show()
Epoch 1/50
91/91 [==============================] - 5s 16ms/step - loss: 101.5297
Epoch 2/50
91/91 [==============================] - 1s 14ms/step - loss: 31.5259
Epoch 3/50
91/91 [==============================] - 1s 12ms/step - loss: 18.9718
Epoch 4/50
91/91 [==============================] - 1s 11ms/step - loss: 14.3593
Epoch 5/50
91/91 [==============================] - 1s 10ms/step - loss: 12.4541
Epoch 6/50
91/91 [==============================] - 1s 11ms/step - loss: 11.7125
Epoch 7/50
91/91 [==============================] - 1s 10ms/step - loss: 11.2718
Epoch 8/50
91/91 [==============================] - 1s 11ms/step - loss: 11.0569
Epoch 9/50
91/91 [==============================] - 1s 11ms/step - loss: 10.8924
Epoch 10/50
91/91 [==============================] - 1s 10ms/step - loss: 10.6760
Epoch 11/50
91/91 [==============================] - 1s 13ms/step - loss: 10.5410
Epoch 12/50
91/91 [==============================] - 1s 13ms/step - loss: 10.4427
Epoch 13/50
91/91 [==============================] - 1s 14ms/step - loss: 10.3735
Epoch 14/50
91/91 [==============================] - 1s 14ms/step - loss: 10.3549
Epoch 15/50
91/91 [==============================] - 1s 11ms/step - loss: 10.3131
Epoch 16/50
91/91 [==============================] - 1s 10ms/step - loss: 10.3134
Epoch 17/50
91/91 [==============================] - 1s 12ms/step - loss: 10.2209
Epoch 18/50
91/91 [==============================] - 1s 11ms/step - loss: 10.2515
Epoch 19/50
91/91 [==============================] - 1s 12ms/step - loss: 10.2305
Epoch 20/50
91/91 [==============================] - 1s 10ms/step - loss: 10.1978
Epoch 21/50
91/91 [==============================] - 1s 9ms/step - loss: 10.1309
Epoch 22/50
91/91 [==============================] - 1s 10ms/step - loss: 10.0723
Epoch 23/50
91/91 [==============================] - 1s 11ms/step - loss: 10.2221
Epoch 24/50
91/91 [==============================] - 1s 11ms/step - loss: 10.0814
Epoch 25/50
91/91 [==============================] - 1s 11ms/step - loss: 10.0154
Epoch 26/50
91/91 [==============================] - 2s 17ms/step - loss: 10.0380
Epoch 27/50
91/91 [==============================] - 2s 17ms/step - loss: 9.9343
Epoch 28/50
91/91 [==============================] - 1s 11ms/step - loss: 9.9801
Epoch 29/50
91/91 [==============================] - 1s 10ms/step - loss: 9.9612
Epoch 30/50
91/91 [==============================] - 1s 10ms/step - loss: 9.9356
Epoch 31/50
91/91 [==============================] - 1s 10ms/step - loss: 9.8944
Epoch 32/50
91/91 [==============================] - 1s 10ms/step - loss: 9.8721
Epoch 33/50
91/91 [==============================] - 1s 10ms/step - loss: 9.8442
Epoch 34/50
91/91 [==============================] - 1s 7ms/step - loss: 9.8405
Epoch 35/50
91/91 [==============================] - 1s 6ms/step - loss: 9.8895
Epoch 36/50
91/91 [==============================] - 1s 7ms/step - loss: 9.8636
Epoch 37/50
91/91 [==============================] - 1s 6ms/step - loss: 9.7215
Epoch 38/50
91/91 [==============================] - 1s 6ms/step - loss: 9.7437
Epoch 39/50
91/91 [==============================] - 1s 6ms/step - loss: 9.7422
Epoch 40/50
91/91 [==============================] - 1s 7ms/step - loss: 9.6511
Epoch 41/50
91/91 [==============================] - 1s 9ms/step - loss: 9.6454
Epoch 42/50
91/91 [==============================] - 1s 9ms/step - loss: 9.6459
Epoch 43/50
91/91 [==============================] - 1s 9ms/step - loss: 9.6070
Epoch 44/50
91/91 [==============================] - 1s 8ms/step - loss: 9.5800
Epoch 45/50
91/91 [==============================] - 1s 6ms/step - loss: 9.5617
Epoch 46/50
91/91 [==============================] - 1s 6ms/step - loss: 9.5609
Epoch 47/50
91/91 [==============================] - 1s 6ms/step - loss: 9.5195
Epoch 48/50
91/91 [==============================] - 1s 6ms/step - loss: 9.4539
Epoch 49/50
91/91 [==============================] - 1s 6ms/step - loss: 9.4593
Epoch 50/50
91/91 [==============================] - 1s 6ms/step - loss: 9.4500
23/23 [==============================] - 1s 4ms/step - loss: 11.3113
23/23 [==============================] - 1s 5ms/step

As we can see neural network model also does good job.

AIC & BIC

Now we have tried some models, it’s natural to raise a question: which is better?
Of course we want more accurate predictions, but is it the only criterion for a good model? Imagine if a model with 1000 parameters has accuracy of 95%, and another model’s accuracy reaches 97% with 1000000000 parameters, can we simply say the second is better?
Sometimes not. Maybe we don’t need such a high accuracy, 95% is enough. Cost of 1000000000 parameters is not worthy and two many parameters may cause overfitting. We need more comprehensive criteria for model assessment, and two common criteria are AIC and BIC

AIC and BIC are used for model selection in the context of regression analysis and other parametric models.

AIC stands for Akaike Information Criterion, and it is defined as:

where L is the likelihood function of the model, and p is the number of parameters in the model. AIC penalizes models with more parameters, to prevent overfitting, but still takes into account the goodness of fit of the model.

  • Likelihood function L(θ | D) is defined as the probability of the observed data D given the parameters θ of the model

BIC stands for Bayesian Information Criterion, and it is defined as:

where L is the likelihood function of the model, p is the number of parameters in the model, and n is the sample size. BIC also penalizes models with more parameters, but the penalty is more severe than AIC, especially for smaller sample sizes.

It is convenient to get AIC and BIC of a model with the help of sklearn or other frameworks or tools. For example, we can calculate ARIMA’s AIC and BIC in sklearn.

from statsmodels.tsa.arima.model import ARIMA
from sklearn.model_selection import train_test_split

df = df_original.copy(deep=False)
df['tavg'].interpolate(inplace=True)
# Convert the date column to a datetime object
df['time'] = pd.to_datetime(df['time'])

# Set the date column as the index
df.set_index('time', inplace=True)

# Split the data into training and testing sets using a fixed split
train_data, test_data = train_test_split(df['tavg'], test_size=0.2, shuffle=False)

model = ARIMA(train_data, order=(2, 2, 2))
results = model.fit()

# Test the model using the testing data
predictions = results.forecast(steps=len(test_data))

plt.show()

print(results.summary())
                               SARIMAX Results                                
==============================================================================
Dep. Variable: tavg No. Observations: 2922
Model: ARIMA(2, 2, 2) Log Likelihood -7540.902
Date: Fri, 31 Mar 2023 AIC 15091.804
Time: 20:40:06 BIC 15121.701
Sample: 03-01-2013 HQIC 15102.573
- 02-28-2021
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.7322 0.018 40.480 0.000 0.697 0.768
ar.L2 -0.2365 0.017 -13.939 0.000 -0.270 -0.203
ma.L1 -1.9066 0.010 -193.891 0.000 -1.926 -1.887
ma.L2 0.9102 0.010 92.589 0.000 0.891 0.930
sigma2 10.2244 0.230 44.481 0.000 9.774 10.675
===================================================================================
Ljung-Box (L1) (Q): 1.92 Jarque-Bera (JB): 68.47
Prob(Q): 0.17 Prob(JB): 0.00
Heteroskedasticity (H): 1.01 Skew: 0.07
Prob(H) (two-sided): 0.88 Kurtosis: 3.74
===================================================================================

Both AIC and BIC are used to compare different models and select the one that fits the data the best, while avoiding overfitting. In general, the lower the AIC or BIC value, the better the model. However, the two criteria may lead to different choices of the best model, especially when the sample size is small or the number of parameters is large.

AIC and BIC are widely used in a variety of fields, including statistics, econometrics, and machine learning, to select the most appropriate model from a set of candidate models.

Quiz

  1. Give an example of time series?
    Stock prices: the daily closing prices of a particular stock over a period of time.
  2. Methods for data imputation?
    Mean imputation, Last observation carried forward (LOCF), Linear interpolation, Seasonal decomposition imputation, Time series forecasting imputation, etc.
  3. What is autocovariance?
    Autocovariance is a measure of how much two values in a time series are related to each other at different time lags. More specifically, autocovariance measures the covariance between a time series and a lagged version of itself.
  4. What is autocorrelation?
    Autocorrelation is a measure of how correlated a time series is with a lagged version of itself at different time lags. In other words, it measures the degree of similarity between observations of a time series separated by a particular time lag.
  5. What is Moving Average Model?
    The Moving Average Model is a popular time series model used for predicting future values of a time series based on its past values and a random error term. The model assumes that the current value of the time series is a function of the average of its past values, plus a random error term.
  6. What is Autoregressive Models?
    Autoregressive Models (AR) assume that the current value of a variable is a linear combination of its past values, weighted by coefficients called autoregressive parameters.
  7. What is Autoregressive Moving Average Model?
    The ARMA model combines these two models and uses both past values and past errors to predict future values.
  8. What is stationarity?
    In time series analysis, stationarity refers to the statistical properties of a time series that remain constant over time. Specifically, a stationary time series has constant mean, variance, and autocorrelation structure across all time points.
  9. Difference between strong and weak stationarity?
    Weak stationarity requires the mean, variance, and autocovariance to be constant over time, but it allows for a time-varying autocorrelation structure. This means that the autocorrelation between two time points can depend on the distance between them, but the overall autocorrelation structure remains stationary over time.
    Strong stationarity, on the other hand, requires that all statistical properties of a time series remain constant over time, including the joint distribution of any subset of time points. In other words, a strongly stationary time series has the same statistical properties at any point in time, and therefore the autocorrelation structure is fixed and does not depend on the distance between time points.
  10. What are AIC & BIC?
    IC and BIC are used for model selection in the context of regression analysis and other parametric models.
    AIC stands for Akaike Information Criterion, and it is defined as:

where L is the likelihood function of the model, and p is the number of parameters in the model. AIC penalizes models with more parameters, to prevent overfitting, but still takes into account the goodness of fit of the model.

  • Likelihood function L(θ | D) is defined as the probability of the observed data D given the parameters θ of the model

BIC stands for Bayesian Information Criterion, and it is defined as:

where L is the likelihood function of the model, p is the number of parameters in the model, and n is the sample size. BIC also penalizes models with more parameters, but the penalty is more severe than AIC, especially for smaller sample sizes.

--

--