Exploratory Data Analysis of Olympic Games’ Results for 120 years.

Kazumi Higashino
6 min readOct 12, 2021

What is this

In this article, we’re going to analyze the Olympic dataset which contains the results of each Olympic competition. The dataset has data on athletes who have participated in the Olympics for 120 years(their name, nationality, result, etc.). Through the analysis, we aim to:

  1. know the results of Olympics for 120 years.
  2. understands the trends of the Olympic games’ results.
  3. answer some of the common questions about the dataset.

Dataset

Tools and Libraries used in this project

  • Pandas
  • Numpy
  • matplotlib.pyplot
  • matplotlib.patches
  • seaborn
  • Jupyter Notebook
  • Jovian

Steps

We progress the analysis following these steps:

  1. Download the dataset
  2. Perform data preparation & cleaning
  3. Perform exploratory analysis & visualization
  4. Ask & answer questions about the data

✴︎✴︎✴︎

1. Download the Dataset

Import libraries used in this project.

Get the dataset from local.

2. Data Preparation and Cleaning

As the first step of EDA, we have to understand what the whole dataframe is like. Generally in this step, we’re going to do following things:

  • Explore the number of rows and columns, range of values etc.
  • Handle missing, incorrect, invalid, and duplicated data.

Let’s look at the olympic_df first.

.nunique() function shows the number of unique values.

You can get the list of columns by .columns method.

You can know the number of duplicates rows by .duplicated().sum() functions.

Now, we know the following things about olympic_df.

  • It contains 271116 rows and 15 columns.
  • There are two season types “Summer” and “Winter”.
  • It contains 135571 athletes’ data.
  • 15 columns are ‘ID’, ‘Name’, ‘Sex’, ‘Age’, ‘Height’, ‘Weight’, ‘Team’, ‘NOC’, ‘Games’, ‘Year’, ‘Season’, ‘City’, ‘Sport’, ‘Event’, ‘Medal’.
  • “Sport” column contains 66 competitions.
  • It contains 1385 duplicates values.

It seems there are 1385 duplicated values in the dataframe. Let’s create the new dataframe olympic_df_2 that doesn't contain duplicates values. If multiple rows share same values in ID, Name, Games, Sport, Event, and Medal, we could say those rows are duplicated.

Also, in this project, we’re going to see only about the data of summer olympics. So remove the winter data from dataframe.

Check whether we could create a new dataframe properly.

Duplicates rows are zero, and only containing Summer values in Season column, means we could create a new dataframe properly.

Next, let’s look at regions_df.

Now we will merge the athletes dataframe with the region dataframes on NOCcolumn.

For later analytics, add information of continents to merged_df.

First, we’re going to import those two datasets.

Change the column name for later dataframe combining.
We can change the column name using .rename() function.

Let’s take a look at first 5 rows of new dataframes.

Combine those two dataframes on alpha3 (ISO codes)

Now, we finally can add continents’ information to merged_df. Let's merge two dataframes on NOC.

Check the NaN values in dataframe using .isna().sum() functions.

We should be careful when we treat columns that contain NaN values.

3. Exploratory Analysis and Visualization

We’re going to dig into the dataframe by doing some aggregation and visualization.

First, we’ll make a few changes to the configuration for visualization.

Look at how many countries have participated in Olympics.

Look at the number of athletes from each countries for 120 years.

It seems USA has produced the most athletes to olympics.

Let’s take a look at the same number of each continent.

Europe looks produced two times athletes more than second placed continent.

Now, let’s compare the number of medals of each country.

We want compare the number of each medals, but current “Medal” column only represents categorical data of each medal.

Therefore, we want to create new columns which represent the number of each medal(like Gold, Silver and Bronze). We're going to use get_dummis method to get the number of each medal, and concatenate them to original dataframe using concat() function.

We can create a new dataframe which contains countries, the number of total medals, and the number of each medal they’ve gotten.

Let’s make a stacked bar chart.

It seems USA has gotten the most medals.

Now, we’re going to look at the time trends.

Let’s start with the trends of the number of athletes participated in olympics.

Next, look at the number of countries participated over years.

How about the result of individual athletes?

Let’s take a look at the top 20 athletes who got the most medals for 120 years.

Show this in a bar chart.

4. Asking and Answering Questions

Now, we have gotten deeper perspectives about this dataset.

Let’s ask and answer some common questions using those insights.

Q1: How many countries/athletes have participated over year?

As we can see, both numbers have kept increasing constantly for 120 years.

If you look at Chart2, you may notice a sudden decrease around 1970 to the early 1980s. We could guess this was caused by Cold War. Some countries boycotted the participation of the Olympic because of the political problem.

Q2: Which country produces the most athletes to the Olympics?

We can see USA has produced the most athletes to olympics for 120 years, and it’s almost 8000 athletes.

However, we could say countries have huge population could produce more successful athletes. So, let’s look at the percentage of athletes to total population.

We want to keep the list of top 10 countries have produced the most athletes, and add a new column contains the percentage of athletes to total population.

Now we get new dataframe containing the percentage of the number of athletes to population.

Let’s plot the bar chart of same countries with Charts1, but ordered by percent of athletes to total population.

Compare Charts4 to Charts3, we can see USA moved to the lowest place in Charts4 even though it got the first place in Chart3. This means USA doesn’t really produce many successful athletes comparing to its population. On the other hand, Sweden produces relatively many successful athletes even though it has less population.

Q3: Which country acquires the most medals?

It seems USA has acquired the most medals for recent 120 years.

Now we want to compare Chart5 to Chart3. Let’s merge two tables top_10_countries_medals_df and athletes_each_country.

It seems Japan and Canada are now out from the list top 10 countries produced the most athletes.

This means Japan and Canada has produced many athletes than most countries, but they haven’t been able to acquire not that many medals compare to other top 10 countries.

Q4: Who acquires the most medals in history? and which sport?

Michael Fred Phelps, II has gotten 28 medals and been top in the ranking of athletes have most medals.

But as the Chart6 represents, 7 athletes out of top 10 athletes are from Russia. It seems Russian athletes performed well on average.

Also, if you look at which sport top athletes participated in, it seems most of them are from Gymnastics. Usually, one athlete participates several competitions in Gymnastics or Swimming, so maybe that’s why athletes from those sports can get more medals than athletes from other sports.

✴︎✴︎✴︎

Thank you for reading!!

I apologize if there are any typos or mistakes. This is my very first EDA project, so please feel free to leave any comment of advise or correction.

And thank you for Jovian to let me publish this article. If you want to start learning Python / Pandas for data analysis, I highly recommend Jovians’s free course. This is the perfect course for beginners.

--

--