Olympic Medal Numbers Predictions with Time Series, Part 3: Time Series Forecasting

Hasan Basri Akçay
DataBulls
Published in
8 min readOct 12, 2021

--

Fbprophet, Darts, AutoTS, Arima, Sarimax, and Monte Carlo Simulation

In Part 1, we worked on data cleaning. For example, missing values imputing, dropping constant columns, matching incorrectly spelled words.

In Part 2, we worked on data analysis such as finding trends, checking data distribution, calculating p-values, and controlling predictability. After data analysis, we found important missing values in the 1980 Olympic Games. You can read the details in part 2.

In this part, you can see different time series machine learning models used and their scores in this dataset. Used machine learning models are Fbprophet, Darts, AutoTS, Arima, Sarimax and Monte Carlo Simulation.

Before starting the work, some libraries that are used in forecasting, should be imported. These libraries in below.

from darts import TimeSeries
import darts
from AutoTS.AutoTS import AutoTS
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from scipy.stats import norm

from sklearn.metrics import mean_squared_error

test_size = 2

1. Fbprophet

Fbprophet is an open-source library. It is developed by Facebook for one variable time series forecasting. It supports seasonality and holidays. It has constant column names. If you want to work with fbprophet, you should change time columns name with ‘ds’ and value column name with ‘y’. You can find more info from https://facebook.github.io/prophet/docs/quick_start.html.

Fbprophet is one variable model. For this reason, we forecast medals for each countries and calculate scores by mean squared error.

total_rmse = 0
for team in df_athletics['Team'].unique():
train_test = pd.DataFrame()
train_test['ds'] = df_athletics_timeseries.index
train_test['y'] = df_athletics_timeseries[team].values

train = train_test[:-test_size]
test = train_test[-test_size:]

model = Prophet(growth='linear')
model.fit(train)
future = model.make_future_dataframe(periods=test_size)
forecast = model.predict(future)

rmse = mean_squared_error(test['y'], forecast['yhat'][-test_size:], squared=False)
total_rmse += rmse
print('Prophet RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

2. Darts

Darts is an open-source library and it is developed by Unit8 for time series forecasting. Darts includes many different machine learning models for time series. In this work, we used FFT (Fast Fourier Transform), ExponentialSmoothing, RegressionModel, RandomForest, LightGBMModel, Baseline Models. Every model has its own features for example FFT does not support multiple variable time series forecasting but Random Forest does. You can find more info from https://unit8co.github.io/darts/.

We forecast medals for each country and calculate scores by mean squared error.

2.1 Darts — FFT

total_rmse = 0
for team in df_athletics['Team'].unique():
train_test = pd.DataFrame()
train_test['ds'] = df_athletics_timeseries.index
train_test['range'] = np.arange(len(df_athletics_timeseries))
train_test['y'] = df_athletics_timeseries[team].values

Series = TimeSeries.from_dataframe(train_test, 'range', 'y')

train, val = Series.split_before(pd.Timestamp(len(Series) - test_size))

model = darts.models.FFT()
model.fit(train)
prediction = model.predict(len(val))
rmse = mean_squared_error(val.values(), prediction.values(), squared=False)
total_rmse += rmse
print('Darts FFT RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

2.2 Darts — ExponentialSmoothing

total_rmse = 0
for team in df_athletics['Team'].unique():
train_test = pd.DataFrame()
train_test['ds'] = df_athletics_timeseries.index
train_test['range'] = pd.date_range(start='2021-04-01', end='2021-04-29', periods=len(df_athletics_timeseries))
train_test['y'] = df_athletics_timeseries[team].values

Series = TimeSeries.from_dataframe(train_test, 'range', 'y')

train, val = Series.split_before(pd.Timestamp('2021-04-28'))

model = darts.models.ExponentialSmoothing()
model.fit(train)
prediction = model.predict(len(val))
rmse = mean_squared_error(val.values(), prediction.values(), squared=False)
total_rmse += rmse
print('Darts ExponentialSmoothing RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

2.3 Darts — RegressionModel

total_rmse = 0
for team in df_athletics['Team'].unique():
train_test = pd.DataFrame()
train_test['ds'] = df_athletics_timeseries.index
train_test['range'] = np.arange(len(df_athletics_timeseries))
train_test['y'] = df_athletics_timeseries[team].values

Series = TimeSeries.from_dataframe(train_test, 'range', 'y')

train, val = Series.split_before(pd.Timestamp(len(Series) - test_size))

model = darts.models.RegressionModel(lags=1)
model.fit(train)
prediction = model.predict(len(val))
rmse = mean_squared_error(val.values(), prediction.values(), squared=False)
total_rmse += rmse
print('Darts RegressionModel RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

2.4 Darts — RandomForest

total_rmse = 0
for team in df_athletics['Team'].unique():
train_test = pd.DataFrame()
train_test['ds'] = df_athletics_timeseries.index
train_test['range'] = np.arange(len(df_athletics_timeseries))
train_test['y'] = df_athletics_timeseries[team].values

Series = TimeSeries.from_dataframe(train_test, 'range', 'y')

train, val = Series.split_before(pd.Timestamp(len(Series) - test_size))

model = darts.models.RandomForest(lags=1)
model.fit(train)
prediction = model.predict(len(val))
rmse = mean_squared_error(val.values(), prediction.values(), squared=False)
total_rmse += rmse
print('Darts RandomForest RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

2.5 Darts — LightGBMModel

total_rmse = 0
for team in df_athletics['Team'].unique():
train_test = pd.DataFrame()
train_test['ds'] = df_athletics_timeseries.index
train_test['range'] = np.arange(len(df_athletics_timeseries))
train_test['y'] = df_athletics_timeseries[team].values

Series = TimeSeries.from_dataframe(train_test, 'range', 'y')

train, val = Series.split_before(pd.Timestamp(len(Series) - test_size))

model = darts.models.LightGBMModel(lags=1)
model.fit(train)
prediction = model.predict(len(val))
rmse = mean_squared_error(val.values(), prediction.values(), squared=False)
total_rmse += rmse
print('Darts LightGBMModel RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

2.6 Darts — Baseline Models

total_rmse = 0
for team in df_athletics['Team'].unique():
train_test = pd.DataFrame()
train_test['ds'] = df_athletics_timeseries.index
train_test['range'] = np.arange(len(df_athletics_timeseries))
train_test['y'] = df_athletics_timeseries[team].values

Series = TimeSeries.from_dataframe(train_test, 'range', 'y')

train, val = Series.split_before(pd.Timestamp(len(Series) - test_size))

model = darts.models.baselines.NaiveDrift()
model.fit(train)
prediction = model.predict(len(val))
rmse = mean_squared_error(val.values(), prediction.values(), squared=False)
total_rmse += rmse
print('Darts LightGBMModel RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

3. AutoTS

AutoTS is an open-source library that is used to automate time series forecasting. It supports multiple variable time series forecasting. You can find more info from https://winedarksea.github.io/AutoTS/build/html/source/tutorial.html.

df_athletics_timeseries.index = pd.date_range(start='2021-04-01', end='2021-04-29', periods=len(df_athletics_timeseries))
total_rmse = 0
for team in df_athletics['Team'].unique():
train = df_athletics_timeseries[:-test_size]
test = df_athletics_timeseries[-test_size:]

model = AutoTS()
model.fit(train, series_column_name=team)
preds = model.predict(start=pd.to_datetime('2021-04-28 00:00:00'), end=pd.to_datetime('2021-04-29 00:00:00'))
rmse = mean_squared_error(test[team], preds, squared=False)
total_rmse += rmse
print('AutoTS RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

4. Arima

Arima (Autoregressive Integrated Moving Average) is a statistical analysis model. It can be used better understand the data set or to predict future trends. You can find more info from https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima.model.ARIMA.html

total_rmse = 0
for team in df_athletics['Team'].unique():
train_test = pd.DataFrame()
train_test['ds'] = df_athletics_timeseries.index
train_test['y'] = df_athletics_timeseries[team].values

train = train_test[:-test_size]
test = train_test[-test_size:]

stepwise_fit = auto_arima(train['y'], trace=False, suppress_warning=True)
model = ARIMA(train['y'], order=stepwise_fit.order)
model_fit = model.fit()
preds = model_fit.forecast(test_size)
rmse = mean_squared_error(test['y'], preds, squared=False)
total_rmse += rmse
print('Arima RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

5. Sarimax

Sarimax (Seasonal ARIMA) is a statistical analysis model. The difference between arima and sarima is sarima supports seasonality handling. We also used sarimax for data understanding in part 2. You can find more info about sarimax from https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html

total_rmse = 0
for team in df_athletics['Team'].unique():
train_test = pd.DataFrame()
train_test['ds'] = df_athletics_timeseries.index
train_test['y'] = df_athletics_timeseries[team].values

train = train_test[:-test_size]
test = train_test[-test_size:]

stepwise_fit = auto_arima(train['y'], trace=False, suppress_warning=True)
model = SARIMAX(train['y'], order=stepwise_fit.order)
model_fit = model.fit()
preds = model_fit.forecast(test_size)
rmse = mean_squared_error(test['y'], preds, squared=False)
total_rmse += rmse
print('Sarimax RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

6. Monte Carlo Simulation

Monte Carlo Simulation is a forecasting model that is used for forecasting cannot easily be predictable due to the intervention of random variables. Basically it predicts randomly many times (simulation number) according to data properties such as standard deviation, variance. Then it selects the best fit for forecasting.

simulation_num = 500
days_to_test = 27
days_to_predict = 2
total_rmse = 0
for team in df_athletics['Team'].unique():
train_test = pd.DataFrame()
train_test['ds'] = df_athletics_timeseries.index
train_test['y'] = df_athletics_timeseries[team].values

train = train_test[:-test_size]
test = train_test[-test_size:]

########### Monte Carlo

daily_return = np.log(1 + train['y'].pct_change())
daily_return.replace([np.inf, -np.inf], 0, inplace=True)
daily_return.replace(np.nan, 0, inplace=True)
average_daily_return = daily_return.mean()
variance = daily_return.var()
drift = average_daily_return - (variance/2)
standard_deviation = daily_return.std()

predictions = np.zeros(days_to_test+days_to_predict)
predictions[0] = train['y'][0]
pred_collection = np.ndarray(shape=(simulation_num, days_to_test+days_to_predict))

for j in range(0, simulation_num):
for i in range(1,days_to_test+days_to_predict):
random_value = standard_deviation * norm.ppf(np.random.rand())
predictions[i] = predictions[i-1] * np.exp(drift + random_value)
pred_collection[j] = predictions

differences = np.array([])
for k in range(0, simulation_num):
difference_arrays = np.subtract(train['y'].values[-days_to_test:], pred_collection[k][-days_to_test:])
difference_values = np.sum(np.abs(difference_arrays))
differences = np.append(differences,difference_values)

best_fit = np.argmin(differences)
best_pred = pred_collection[best_fit]

###########

rmse = mean_squared_error(test['y'], best_pred[-days_to_predict:], squared=False)
total_rmse += rmse
print('Monto Carlo Simulation RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

7. Mean Prediction

Mean prediction means is predict always train data to mean value. It is useful for the baseline model. If machine learning models have a higher score than mean prediction, we can say that predictions are not random.

total_rmse = 0
for team in df_athletics['Team'].unique():
train_test = pd.DataFrame()
train_test['ds'] = df_athletics_timeseries.index
train_test['y'] = df_athletics_timeseries[team].values

train = train_test[:-test_size]
test = train_test[-test_size:]

pred = train['y'].mean()

rmse = mean_squared_error(test['y'], [pred, pred], squared=False)
total_rmse += rmse
print('Mean Prediction RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

8. Results

You can see the scores in below. We did not do hyperparameter tuning for this work except arima and sarimax and we just used the default parameters of the models. With hyperparameter tuning, scores can be better.

Fbprophet RMSE: — — — — — — — — — 3.180356233094025
Darts FFT RMSE: — — — — — — — — — 1.7806018079168815
Darts ExponentialSmoothing RMSE: — 2.125171341043955
Darts RegressionModel RMSE: — — — -1.530246244439605
Darts RandomForest RMSE: — — — —- 1.5932785912162637
Darts LightGBMModel RMSE: — — — — 1.6763816820500097
AutoTS RMSE: — — — — — — — — — — 1.4205065327167645
Arima RMSE: — — — — — — — — — —- 1.5376332124117644
Sarimax RMSE: — — — — — — — — — 1.8889240559227485
Monto Carlo Simulation RMSE: — — — 1.769909711992457
Mean Prediction RMSE: — — — — — — 1.6757768602609089

Discussion

According to the result, four models have a higher score than the mean prediction. They are Darts RegressionModel, Darts RandomForest, AutoTS Arima and AutoTS that have the highest score.

Firstly, when we compare arima and sarimax scores, arima has higher score than sarimax. The reason of this, there is no seasonality affect on medal numbers. For this reason, fbprophet has bad score because the seasonality affect is not closed in default parameters of fbprophet. After closing seasonality affect, fbprophet score is 1.67.

The number of medals a country has won at the Olympics depends on the number of medals won by another country. So the models that has support multiable input variable, has the advantage for this forecasting. That is why AutoTS has best score for this problem. Don’t forget, If we work with another dataset that seasonality affect on and it has just one input variable, the result can be change.

👋 Thanks for reading. If you enjoy my work, don’t forget to like, follow me on medium and on LinkedIn. It will motivate me in offering more content to the Medium community ! 😊

References:

[1]: https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results
[2]: https://www.kaggle.com/hasanbasriakcay/which-country-is-good-at-which-sports-in-olympics
[3]: https://www.kaggle.com/hasanbasriakcay/which-country-is-good-at-which-sports-in-olympics
[4]: https://2001-2009.state.gov/r/pa/ho/time/qfp/104481.htm

--

--