Exploratory Data Analysis in R: Tokyo 2020 Olympics
--
I have a Sports Science background and since I was a kid, the Olympic Games and the FIFA World Cup were two competitions I was always looking forward to. Fast forward a few years and now I am learning Data Analytics, so I decided to practice my R Programming Skills and analyze information from the Tokyo 2020 (2021) Olympic Games.
I wanted to use R to discover information about which countries had more athletes participating, which sports had more athletes participating, who were the countries that won more medals, what sports had more male and female athletes, and which ones had the bigger gender difference gap.
To do this, I’ve downloaded a database I’ve found on Kaggle (here) and started playing with it. If you want to see my analysis in detail and the code running, you can check it out here
Data Preparation
After importing the Athletes, Gender, and Medals .csv files into R, I used the clean_names function from the Janitor library to make all lower letters and replace spaces with underscores:
athletes <- clean_names(Athletes)
gender <- clean_names(Gender)
medals <- clean_names(Medals)
The Analysis
I used the ggplot and dplyr packages for this analysis. So to begin with, let’s see how many countries, sports, and athletes were participating in this Games:
Let’s see what were the top 15 countries per number of athletes participating:
Then, let’s see the top 15 sports per number of athletes participating:
Let’s talk medals now… How many medals were won in total at the Olympics?
First, I needed to convert the medals to numbers, sum, get percentages, round, and reorganize columns:
Now, let’s see the top 15 countries by number of total medals won:
And when it comes to percent of total medals won:
And now, let’s check the champions! Take a look at bronze, silver, and gold medals won:
Now let’s take a look at the gender differences between male and female athletes (note: there are other genders but the database is divided only in male and female athletes, and that is why I analyzed it this way).
First, I had to convert data to numeric, create gender difference count, and percentages:
Which sports had more and fewer male athletes participating in Tokyo 2021?
Now, let’s check which sports had more and fewer female athletes participating:
Let’s examine the gender gap between sports, first in numbers of athletes, then in percentage:
And to finish, let’s take a look at the number of male athletes per female in the top 15 sports with the biggest gender gaps:
Key Takeaways
- The USA was the best country competing in the Olympics, winning more gold, silver, bronze, and total medals than the others;
- China was the second-best team, achieving a second place on gold, silver, bronze, and total medals won;
- 1/4 of total medals were won by the USA, China, and Russia;
- Athletics was the group of sports with more athletes (both male and female), followed by Swimming and Football; - Athletics is also the sport with the highest difference in numbers between men and women athletes (103 more men athletes);
- In percentage, Wrestling is the most unequal sport, having more than 2 men per woman athlete at the Olympics;
- Only four sports had more women athletes than men athletes (Rhythmic Gymnastics and Artistic Swimming are sports where only female athletes compete on the Olympics);
- The top 5 most unequal sports in Tokyo 2020 were Wrestling, Cycling Road, Boxing, Equestrian, and Baseball/Softball. All of these had more than 1.5 men for each woman athlete competing
Final Acknowledgements
To compete in a high-performance sport is no easy task. Athletes face though challenges on a daily basis and have their results measured on one competition, one moment, one fraction of a second. Shout out to all athletes training and competing worldwide, and thank you for being an inspiration!
This is my second analysis using R and there are lots of improvements to be done. If you have suggestions or feedback, please reach out to me.