2018 Wasa Triathlon Results: Exploratory Data Analysis (EDA)

EDA using Python, Pandas and Matplotlib applied to the results from an olympic triathlon race in Wasa Lake

Alejandro Coy
Alejandro-DataScience Journey
8 min readNov 27, 2018

--

Hi, welcome to my first post about my new passion: Data Science! I come from a chemical engineering background and a lot has happened before my data science days…if you want to know more about me, click here!

Triathlon is my favorite hobby and for the last two years, I have spent most of my free time swiming, biking and running. Eventually, I became interested in coding and data science. So naturally, my first data science project had to be about a triathlon.

Project Description:

In its promotional campaign, the race is described as:

“Gerick Sports Wasa Lake Triathlon features elite prize money and attracts some of the fastest triathletes in the West. Age groupers and elites compete together on a scenic and fast course.”

Wasa Lake is located in the beautiful province of British Columbia just west from the Rocky Mountains and 4 hours away from Calgary.

In this exploratory data analysis project, we will explore the 2018 results for the Olympic race distance (1.5 km swim, 40 km bike ,10 km run), and compare them with the previous five years. I retrieved the data from the official timing company’s website and a copy of the file is available in the GitHub folder of the project.

The objective of this work? To practice exploratory data analysis using Python, Numpy, Pandas, and Matplotlib in Jupiter notebooks.

The 2018 race was my first time competing in this amazing location. I was disappointed with the swim but the conditions were rough. Similarly, the conditions on the bike course where challenging due to strong winds, that was reflected on the higher than expected bike timing, and the run was just ok. My race results didn’t go according to my expectations, but overall it was a fantastic experience and hope to repeat it in 2019.

Here are a few questions I had after the race that this project will help me answer:

1. Did the bad weather conditions affect everyone?

2. How does the results from this race compare with that of other years?

3. Did the weather conditions had a different effect on different people demographics ( age group categories)?

4. How can I compare my performance with the average timing of all the race participants?

5. What would be the time I need next year to land in a podium spot?

Now we get to the fun part, here is where exploratory data analysis is a powerful tool. I would use univariate and bivariate visualizing tools to answer all the questions above by wroking with Python, Pandas, and Matplolib.

Gathering, Importing and Data Cleaning

The data gathering for the project was straightforward since all the results were available in Startline Timing’s web page. I arranged the data in a master excel spreadsheet with an individual sheet for each year. I imported the data using Pandas built function and created a Data Frame for each year.

Data cleaning it is where the real process of data science begins: the task involved in the wrangling process were:

· All the columns in different years need to be named the same

· I dropped athletes without a Place by Category. This meant that the athlete did not finish the race.

· For the project, I dropped the results of relay teams.

· All the categories across the years need to be standardized in the format GenderInitialyear-Finalyear e.g. F18–24

· To analyze the finish position Division/Place column need to the transform from string to number.

After all the data is organized, visualization is the next step. As new data scientist, the Matplotlib gallery is a source of great inspiration and specially for this project, I used ideas from Russel Cox‘s training page where he analyzed almost every Ironman race in the calendar.

Participation over the years

During the preliminary exploration of the data, the first observation that caught my attention was the difference in the number of athletes in the different years.

Athletes participation from 2013 to 2018

During the last two years, the participation had declined to reach the lowest participation in 2017 with 200 participants which represent a decrease of 35% with respect to 2016. However, 2018 saw an increase of 14% but remains 23% lower than in 2016 and 32% lower than the highest participation in the last six years which was 2013 with 339 athletes.

The participation has been dominated by men with the gap steadily increasing in the last six years reaching a distribution of 61.4% men and 38.6% women in 2018.

To study in more detail the participation decrease I plotted the participants per category in the previous 6 years to identify if the decrease could be attribute to a category.

As is shown, the contribution in the big decrease in 2017 is shared among all the categories with the M40–44 category having the biggest decrease. The 2018 recovery in men participants was due to the significant increase in the M30–34 and M50–54 categories.

In the women side, the biggest drop in participation was in F35–39 and F40–44, however, these categories remain with most participants in 2017. In 2018 the F30–34 category had the most participants whole F35–39 and F40–49 continued decreasing.

As the conclusion, the decrease in participants was due to that the categories that historical had the most participants decrease over 40 % from 2016 to 2017 for both men and women.

2018 Winners By category:

A summary of the 2018 winning times for each category:

2018 Times Comparison

In order to answer to my first questions, I used a histogram plot for each discipline to represent the frequency of time across all the participants and compare with the historical distribution for the previous 5 years.

This year the overall times were slower in comparison with the previous 5 years. The major contributions to the increase in the overall times were the swim and the bike time which can be explained for choppy water and high winds on the bike course. On the contrary, the average distribution of the run times was like the past years. However, from the run histogram, it can be seen a narrower distribution which shows a more even field than in the 5 previous years where the data is skewed to the right. In conclusion, an association of slower times with bad conditions was obtained. With that, we solve the first two questions.

To answer my third question, we need to analyze in more detail the finishing time for each category. I compare the time for the first 10 positions per category for the past six years. The absence of some data for a specific year shows that there were some categories where the participants were lower than 10 or there was one or none athletes in that category for that year.

In the case of women, 2018 times are slower than in past years. An exception would be the F25–29 and F50–54 categories where the time of the winner and second place was faster but after the third position, the time increased significantly in comparison with previous years.

Similarly, for the men, 2018 times per position are higher than previous years. The only category with faster finishing time was M50–54, which confirms the effect of the bad conditions for all the participants independent of the category, which respond to our third questions.

One interesting inside from this plot is that allows us to see which category were more competitive. In this case, the flat line in the category M40–44 suggests that the time between second and sixth place was close, and indeed a closer look to the data shows that the difference among these positions was less than a minute.

2018 Average Times Per Category

Next, I compare the average time for each discipline per category for men and women. This gives us an idea of the strength of the field for each category and gives me the insight of where my performance rank among the average athlete in the race.

In 2018, the strength of the field on the men side was very even between all the categories. While the three participants in M18–24 were very strong swimmers, the M60–64 on average are the strongest cyclist of the field and M25–29 were the faster runners.

On the women side, the only participant in the F18–24 category was the strongest compared with the average time of the rest of the categories. Another interesting data point is the F60–64 average time. For this category, there was just one athlete that was faster than the average time of most of categories, way to go!

Podium Time

Finally, my last question. What would it take to finish in the podium next year?

In my case for M30–34 the average time of 2:19:31 seems fast for my actual fitness, but in racing, you never know. Especially in this category, there is a big time gap between this year nad the previous 5 years.

An interesting insight from the plot is the similarity in the average time for for men between 30 and 49 years while the fastest women are between 35 and 44 years.

Club Competition

No much analysis here, just to mention that the club competition was won by Team Trilife with a total time of 13:03:54 . Great result for a great team! Very proud of share the experince with everybody.

Overall club winner: Team Trilife

With that all my question were answered using the EDA methodology. The value of the analysis is that give us quick insides and guide you to create associations between variables and even allow you start thinking in predictions, which I considered is a powerful tool.

Thanks for reading.

  • All the code of the project can fe found in here
  • If you have any questions or are interesting in any other analysis leave a comment or send me a email: acoydata@gmail.com

--

--