Olympic Medal Numbers Predictions with Time Series, Part 3: Time Series Forecasting
Fbprophet, Darts, AutoTS, Arima, Sarimax, and Monte Carlo Simulation
In Part 1, we worked on data cleaning. For example, missing values imputing, dropping constant columns, matching incorrectly spelled words.
In Part 2, we worked on data analysis such as finding trends, checking data distribution, calculating p-values, and controlling predictability. After data analysis, we found important missing values in the 1980 Olympic Games. You can read the details in part 2.
In this part, you can see different time series machine learning models used and their scores in this dataset. Used machine learning models are Fbprophet, Darts, AutoTS, Arima, Sarimax and Monte Carlo Simulation.
Before starting the work, some libraries that are used in forecasting, should be imported. These libraries in below.
from darts import TimeSeries
import darts
from AutoTS.AutoTS import AutoTS
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from scipy.stats import norm
from sklearn.metrics import mean_squared_error
test_size = 2
1. Fbprophet
Fbprophet is an open-source library. It is developed by Facebook for one variable time series forecasting. It supports seasonality and holidays. It has constant column names. If you want to work with fbprophet, you should change time columns name with ‘ds’ and value column name with ‘y’. You can find more info from https://facebook.github.io/prophet/docs/quick_start.html.
Fbprophet is one variable model. For this reason, we forecast medals for each countries and calculate scores by mean squared error.
total_rmse = 0
for team in df_athletics['Team'].unique():
train_test = pd.DataFrame()
train_test['ds'] = df_athletics_timeseries.index
train_test['y'] = df_athletics_timeseries[team].values
train = train_test[:-test_size]
test = train_test[-test_size:]
model = Prophet(growth='linear')
model.fit(train)
future = model.make_future_dataframe(periods=test_size)
forecast = model.predict(future)
rmse = mean_squared_error(test['y'], forecast['yhat'][-test_size:], squared=False)
total_rmse += rmse
print('Prophet RMSE: ', total_rmse/len(df_athletics['Team'].unique()))
2. Darts
Darts is an open-source library and it is developed by Unit8 for time series forecasting. Darts includes many different machine learning models for time series. In this work, we used FFT (Fast Fourier Transform), ExponentialSmoothing, RegressionModel, RandomForest, LightGBMModel, Baseline Models. Every model has its own features for example FFT does not support multiple variable time series forecasting but Random Forest does. You can find more info from https://unit8co.github.io/darts/.
We forecast medals for each country and calculate scores by mean squared error.
2.1 Darts — FFT
total_rmse = 0
for team in df_athletics['Team'].unique():
train_test = pd.DataFrame()
train_test['ds'] = df_athletics_timeseries.index
train_test['range'] = np.arange(len(df_athletics_timeseries))
train_test['y'] = df_athletics_timeseries[team].values
Series = TimeSeries.from_dataframe(train_test, 'range', 'y')
train, val = Series.split_before(pd.Timestamp(len(Series) - test_size))
model = darts.models.FFT()
model.fit(train)
prediction = model.predict(len(val))
rmse = mean_squared_error(val.values(), prediction.values(), squared=False)
total_rmse += rmse
print('Darts FFT RMSE: ', total_rmse/len(df_athletics['Team'].unique()))
2.2 Darts — ExponentialSmoothing
total_rmse = 0
for team in df_athletics['Team'].unique():
train_test = pd.DataFrame()
train_test['ds'] = df_athletics_timeseries.index
train_test['range'] = pd.date_range(start='2021-04-01', end='2021-04-29', periods=len(df_athletics_timeseries))
train_test['y'] = df_athletics_timeseries[team].values
Series = TimeSeries.from_dataframe(train_test, 'range', 'y')
train, val = Series.split_before(pd.Timestamp('2021-04-28'))
model = darts.models.ExponentialSmoothing()
model.fit(train)
prediction = model.predict(len(val))
rmse = mean_squared_error(val.values(), prediction.values(), squared=False)
total_rmse += rmse
print('Darts ExponentialSmoothing RMSE: ', total_rmse/len(df_athletics['Team'].unique()))
2.3 Darts — RegressionModel
total_rmse = 0
for team in df_athletics['Team'].unique():
train_test = pd.DataFrame()
train_test['ds'] = df_athletics_timeseries.index
train_test['range'] = np.arange(len(df_athletics_timeseries))
train_test['y'] = df_athletics_timeseries[team].values
Series = TimeSeries.from_dataframe(train_test, 'range', 'y')
train, val = Series.split_before(pd.Timestamp(len(Series) - test_size))
model = darts.models.RegressionModel(lags=1)
model.fit(train)
prediction = model.predict(len(val))
rmse = mean_squared_error(val.values(), prediction.values(), squared=False)
total_rmse += rmse
print('Darts RegressionModel RMSE: ', total_rmse/len(df_athletics['Team'].unique()))
2.4 Darts — RandomForest
total_rmse = 0
for team in df_athletics['Team'].unique():
train_test = pd.DataFrame()
train_test['ds'] = df_athletics_timeseries.index
train_test['range'] = np.arange(len(df_athletics_timeseries))
train_test['y'] = df_athletics_timeseries[team].values
Series = TimeSeries.from_dataframe(train_test, 'range', 'y')
train, val = Series.split_before(pd.Timestamp(len(Series) - test_size))
model = darts.models.RandomForest(lags=1)
model.fit(train)
prediction = model.predict(len(val))
rmse = mean_squared_error(val.values(), prediction.values(), squared=False)
total_rmse += rmse
print('Darts RandomForest RMSE: ', total_rmse/len(df_athletics['Team'].unique()))
2.5 Darts — LightGBMModel
total_rmse = 0
for team in df_athletics['Team'].unique():
train_test = pd.DataFrame()
train_test['ds'] = df_athletics_timeseries.index
train_test['range'] = np.arange(len(df_athletics_timeseries))
train_test['y'] = df_athletics_timeseries[team].values
Series = TimeSeries.from_dataframe(train_test, 'range', 'y')
train, val = Series.split_before(pd.Timestamp(len(Series) - test_size))
model = darts.models.LightGBMModel(lags=1)
model.fit(train)
prediction = model.predict(len(val))
rmse = mean_squared_error(val.values(), prediction.values(), squared=False)
total_rmse += rmse
print('Darts LightGBMModel RMSE: ', total_rmse/len(df_athletics['Team'].unique()))
2.6 Darts — Baseline Models
total_rmse = 0
for team in df_athletics['Team'].unique():
train_test = pd.DataFrame()
train_test['ds'] = df_athletics_timeseries.index
train_test['range'] = np.arange(len(df_athletics_timeseries))
train_test['y'] = df_athletics_timeseries[team].values
Series = TimeSeries.from_dataframe(train_test, 'range', 'y')
train, val = Series.split_before(pd.Timestamp(len(Series) - test_size))
model = darts.models.baselines.NaiveDrift()
model.fit(train)
prediction = model.predict(len(val))
rmse = mean_squared_error(val.values(), prediction.values(), squared=False)
total_rmse += rmse
print('Darts LightGBMModel RMSE: ', total_rmse/len(df_athletics['Team'].unique()))
3. AutoTS
AutoTS is an open-source library that is used to automate time series forecasting. It supports multiple variable time series forecasting. You can find more info from https://winedarksea.github.io/AutoTS/build/html/source/tutorial.html.
df_athletics_timeseries.index = pd.date_range(start='2021-04-01', end='2021-04-29', periods=len(df_athletics_timeseries))
total_rmse = 0
for team in df_athletics['Team'].unique():
train = df_athletics_timeseries[:-test_size]
test = df_athletics_timeseries[-test_size:]
model = AutoTS()
model.fit(train, series_column_name=team)
preds = model.predict(start=pd.to_datetime('2021-04-28 00:00:00'), end=pd.to_datetime('2021-04-29 00:00:00'))
rmse = mean_squared_error(test[team], preds, squared=False)
total_rmse += rmse
print('AutoTS RMSE: ', total_rmse/len(df_athletics['Team'].unique()))
4. Arima
Arima (Autoregressive Integrated Moving Average) is a statistical analysis model. It can be used better understand the data set or to predict future trends. You can find more info from https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima.model.ARIMA.html
total_rmse = 0
for team in df_athletics['Team'].unique():
train_test = pd.DataFrame()
train_test['ds'] = df_athletics_timeseries.index
train_test['y'] = df_athletics_timeseries[team].values
train = train_test[:-test_size]
test = train_test[-test_size:]
stepwise_fit = auto_arima(train['y'], trace=False, suppress_warning=True)
model = ARIMA(train['y'], order=stepwise_fit.order)
model_fit = model.fit()
preds = model_fit.forecast(test_size)
rmse = mean_squared_error(test['y'], preds, squared=False)
total_rmse += rmse
print('Arima RMSE: ', total_rmse/len(df_athletics['Team'].unique()))
5. Sarimax
Sarimax (Seasonal ARIMA) is a statistical analysis model. The difference between arima and sarima is sarima supports seasonality handling. We also used sarimax for data understanding in part 2. You can find more info about sarimax from https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html
total_rmse = 0
for team in df_athletics['Team'].unique():
train_test = pd.DataFrame()
train_test['ds'] = df_athletics_timeseries.index
train_test['y'] = df_athletics_timeseries[team].values
train = train_test[:-test_size]
test = train_test[-test_size:]
stepwise_fit = auto_arima(train['y'], trace=False, suppress_warning=True)
model = SARIMAX(train['y'], order=stepwise_fit.order)
model_fit = model.fit()
preds = model_fit.forecast(test_size)
rmse = mean_squared_error(test['y'], preds, squared=False)
total_rmse += rmse
print('Sarimax RMSE: ', total_rmse/len(df_athletics['Team'].unique()))
6. Monte Carlo Simulation
Monte Carlo Simulation is a forecasting model that is used for forecasting cannot easily be predictable due to the intervention of random variables. Basically it predicts randomly many times (simulation number) according to data properties such as standard deviation, variance. Then it selects the best fit for forecasting.
simulation_num = 500
days_to_test = 27
days_to_predict = 2
total_rmse = 0
for team in df_athletics['Team'].unique():
train_test = pd.DataFrame()
train_test['ds'] = df_athletics_timeseries.index
train_test['y'] = df_athletics_timeseries[team].values
train = train_test[:-test_size]
test = train_test[-test_size:]
########### Monte Carlo
daily_return = np.log(1 + train['y'].pct_change())
daily_return.replace([np.inf, -np.inf], 0, inplace=True)
daily_return.replace(np.nan, 0, inplace=True)
average_daily_return = daily_return.mean()
variance = daily_return.var()
drift = average_daily_return - (variance/2)
standard_deviation = daily_return.std()
predictions = np.zeros(days_to_test+days_to_predict)
predictions[0] = train['y'][0]
pred_collection = np.ndarray(shape=(simulation_num, days_to_test+days_to_predict))
for j in range(0, simulation_num):
for i in range(1,days_to_test+days_to_predict):
random_value = standard_deviation * norm.ppf(np.random.rand())
predictions[i] = predictions[i-1] * np.exp(drift + random_value)
pred_collection[j] = predictions
differences = np.array([])
for k in range(0, simulation_num):
difference_arrays = np.subtract(train['y'].values[-days_to_test:], pred_collection[k][-days_to_test:])
difference_values = np.sum(np.abs(difference_arrays))
differences = np.append(differences,difference_values)
best_fit = np.argmin(differences)
best_pred = pred_collection[best_fit]
###########
rmse = mean_squared_error(test['y'], best_pred[-days_to_predict:], squared=False)
total_rmse += rmse
print('Monto Carlo Simulation RMSE: ', total_rmse/len(df_athletics['Team'].unique()))
7. Mean Prediction
Mean prediction means is predict always train data to mean value. It is useful for the baseline model. If machine learning models have a higher score than mean prediction, we can say that predictions are not random.
total_rmse = 0
for team in df_athletics['Team'].unique():
train_test = pd.DataFrame()
train_test['ds'] = df_athletics_timeseries.index
train_test['y'] = df_athletics_timeseries[team].values
train = train_test[:-test_size]
test = train_test[-test_size:]
pred = train['y'].mean()
rmse = mean_squared_error(test['y'], [pred, pred], squared=False)
total_rmse += rmse
print('Mean Prediction RMSE: ', total_rmse/len(df_athletics['Team'].unique()))
8. Results
You can see the scores in below. We did not do hyperparameter tuning for this work except arima and sarimax and we just used the default parameters of the models. With hyperparameter tuning, scores can be better.
Fbprophet RMSE: — — — — — — — — — 3.180356233094025
Darts FFT RMSE: — — — — — — — — — 1.7806018079168815
Darts ExponentialSmoothing RMSE: — 2.125171341043955
Darts RegressionModel RMSE: — — — -1.530246244439605
Darts RandomForest RMSE: — — — —- 1.5932785912162637
Darts LightGBMModel RMSE: — — — — 1.6763816820500097
AutoTS RMSE: — — — — — — — — — — 1.4205065327167645
Arima RMSE: — — — — — — — — — —- 1.5376332124117644
Sarimax RMSE: — — — — — — — — — 1.8889240559227485
Monto Carlo Simulation RMSE: — — — 1.769909711992457
Mean Prediction RMSE: — — — — — — 1.6757768602609089
Discussion
According to the result, four models have a higher score than the mean prediction. They are Darts RegressionModel, Darts RandomForest, AutoTS Arima and AutoTS that have the highest score.
Firstly, when we compare arima and sarimax scores, arima has higher score than sarimax. The reason of this, there is no seasonality affect on medal numbers. For this reason, fbprophet has bad score because the seasonality affect is not closed in default parameters of fbprophet. After closing seasonality affect, fbprophet score is 1.67.
The number of medals a country has won at the Olympics depends on the number of medals won by another country. So the models that has support multiable input variable, has the advantage for this forecasting. That is why AutoTS has best score for this problem. Don’t forget, If we work with another dataset that seasonality affect on and it has just one input variable, the result can be change.
👋 Thanks for reading. If you enjoy my work, don’t forget to like, follow me on medium and on LinkedIn. It will motivate me in offering more content to the Medium community ! 😊
References:
[1]: https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results
[2]: https://www.kaggle.com/hasanbasriakcay/which-country-is-good-at-which-sports-in-olympics
[3]: https://www.kaggle.com/hasanbasriakcay/which-country-is-good-at-which-sports-in-olympics
[4]: https://2001-2009.state.gov/r/pa/ho/time/qfp/104481.htm