Since learning about the art of web scraping, I had looked forward to getting my hands dirty. It was not as easy as I thought, although looking at HTML codes brought me back to the days where I spent hours personalizing my Blogspot page. This time around, I took the time to learn the basics of HTML and used my new knowledge to scrape the CrossFit Opens Leadership Board.
Every year, the finest athletes compete in the Crossfit Games to win the title of “Fittest on Earth”. To kickstart this competition, the CrossFit community (or cult) participates in the Crossfit Opens where the fittest are selected. Typically perceived as a sport of the young due to the intensity of the workouts and heavy lifting involved. Despite that, the emphasis on functional fitness and community has attracted people of all ages, from kids to elderly. As a Crossfitter who is frequently gasping for air at the end of a workout with a buddy who is older, I was interested to find out if my situation was the norm.
A quick check of the top 50 athletes revealed their age, ranging from 26–30 years old with their average rank decreasing at the number gets closer to 30 (The lower the ranking, the better their standing on the leaderboard). This certainly did not represent the community well, which consist of athletes from 16–54 years old on the leaderboard.
To find out how I will fare in the years coming, I scraped 30,000 athletes’ statistics from the Crossfit Opens Leaderboard 2019. The data was imposed on a linear regression model to determine if a younger athlete will perform better at a Crossfit Events using the ranks of previous events as a feature/variable.
Preliminary results showed that the features were right-skewed and only explained 8% (R-square value) of expected age. These features were transformed (log) and more features, such as nationality, were added to the dataset to improve the model.
Unfortunately, explainability only improved by 0.4%. Assumption of linearity testings was then conducted to determine if linear regression was even the ideal model for this dataset. The content below will get slightly technical, skip to the last paragraph for the conclusion.
Assumption #1 Regression is linear in parameters and correctly specified
The points were definitely not a linear fit, scattered throughout with majority of them on the right (see figure 1).
Assumption #1: X
Assumption #2 Residuals should be normally distributed with zero mean
Many of the residual points deviated from the straight line on the Q-Q plot. Assumption #2: X
Assumption #3 Error terms must have constant variance
Throughout the five events, similar graph patterns of variance being clustered in a corner was observed.
Assumption #3: X
Assumption #4 Errors are uncorrelated across observations
Residuals followed a clear pattern.
Assumption #4: X
Assumption testings in linear regression are essential to explain the expected value of the target and to produce accurate prediction from the linear model built. The linear model is clearly not a good fit for this dataset. Other factors that were not recorded that can potentially alter the relationship in the dataset, this includes the years of training, athletes’ body fat percentage. As a true fan of the sport, I would be keen to test this dataset on other models after acquiring more knowledge and skills.
At the end of the project, I generated a hex map to help myself better understand the relationship in the dataset to be certain that the data isn’t fitting. If all else does not sit well, at least it is safe to say that CrossFit is a sport for all ages!
As timely as this can be, a new season of CrossFit Games begins in two days!
(All codes & graphs for this post can be found on my github)