Olympic Medal Numbers Predictions with Time Series, Part 1: Data Cleaning

Hasan Basri Akçay
DataBulls
Published in
3 min readOct 10, 2021

--

Fbprophet, Darts, AutoTS, Arima, Sarimax and Monte Carlo Simulation

This is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. This data was scraped from www.sports-reference.com in May 2018. We found this dataset in kaggle and the link is https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results.

Data Cleaning

Firstly dataset should be clean for forecasting. General data cleaning techniques are handling missing value, outlier, constant columns and wrong spelled words.

Impute Missing Values

data.isnull().sum()ID             0
Name 0
Sex 0
Age 9474
Height 60171
Weight 62875
Team 0
NOC 0
Games 0
Year 0
Season 0
City 0
Sport 0
Event 0
Medal 231333
dtype: int64

There are missing value in four different columns. They are age, height, weight and medal. Medal column can has missing value because all participant can not win a medal. But in this work, we interest only rows that is has a medal. So we will drop the missing rows in the medal columns.

data_clean = data_clean.dropna(subset=['Medal'])

Now we can impute the missing value in age, height and weight with mean value but all sport types has different age, height and weight distribution. For example weightlifting and basketball height distribution. So we impute the missing value with mean value that is in own sport type.

data_clean['Age'] = data.groupby(['Sport'])
['Age'].transform(lambda x: x.fillna(round(x.mean(), 2)))
data_clean['Height'] = data.groupby(['Sport'])['Height'].transform(lambda x: x.fillna(round(x.mean(), 2)))
data_clean['Weight'] = data.groupby(['Sport'])['Weight'].transform(lambda x: x.fillna(round(x.mean(), 2)))

Drop Constant Columns

Constant columns are weight for the forecasting model and there are no affect on forecasting. So if there is a constant column, we should drop it. We have not a constant column. You can see it in the below.

data_clean.columns[data_clean.nunique() <= 1]
Index([], dtype='object')

Matching Incorrectly Spelt Words

A dataset that created by human, always can have some incorrectly spelt word. Incorrectly spelt words are dangerous for forecasting models because the forecasting model can not realize they are same words. We use fuzzywuzzy library for this situation. We are looking for words that has match score higher than 90.

team_unique = data_clean['Team'].unique()import fuzzywuzzy
from fuzzywuzzy import process

for team in team_unique:
matches = fuzzywuzzy.process.extract(team, team_unique,
limit=2, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
if matches[1][1] > 90:
print(team, matches[1])
Netherlands ('Netherlands-1', 92)
United States ('United States-1', 93)
Soviet Union ('Soviet Union-2', 92)
Unified Team ('Unified Team-2', 92)
Switzerland ('Switzerland-1', 92)
East Germany ('East Germany-1', 92)
Great Britain ('Great Britain-1', 93)
Switzerland-1 ('Switzerland', 92)
Switzerland-2 ('Switzerland', 92)

After searching incorrectly spelt words, we fix it by deleting the numbers and punctuation marks.

data_clean["Team"] = data_clean['Team'].str.replace('[^\w\s]','')
data_clean["Team"] = data_clean['Team'].str.replace('\d+', '')
data_clean['Team'] = data_clean['Team'].str.strip()

👋 Thanks for reading. If you enjoy my work, don’t forget to like, follow me on medium and on LinkedIn. It will motivate me in offering more content to the Medium community ! 😊

Data cleaning part ends here. You can keep reading about data analysis in part 2. The link is Part 2.

Reference

[1]: https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results
[2]: https://www.kaggle.com/hasanbasriakcay/which-country-is-good-at-which-sports-in-olympics
[3]: https://www.kaggle.com/hasanbasriakcay/which-country-is-good-at-which-sports-in-olympics

--

--