A Brief Look at Le Tour De France Through the Lense of Data.
The History of the Tour De France is a amazingly complex and dramatic story with more than its fair share of heroes and villains. Originally conceived as a means for a failing french cycling magazine to compete against it’s local rivals, the tour has grown into one of the most watched sporting events in the world. In the following post I wish to provide the reader with a sense of how enormous the task of analysing the tour really is.
The dataset I will be using shows several attributes of each tour from 1903 to 2014. The dataset uses the following fields:
- Number of Stages
- The Number of Starting Cyclists
- The number of Cyclists who finished
- Distance Cycled
- Time of the Winner (in hours)
- Average Speed of the Winner (km/h)
- Name of the Winner
- Country of the Winner
The data has three occurrences where a group of years having missing data. The first and second groups of missing data are a result of the tour being cancelled during the First and Second World Wars respectively. The third group of missing data is a result of Lance Armstrong’s admission of guilt to doping. Thanks Lance!
How do these variables vary over time?
Firstly, I wanted to see how the tour has varied over the course of it’s history. To show this I have plotted some of the more interesting variables against the year of the tour.
Okay, first things first. The tour use to be nearly 6000 km long! What the actual hell!? After only two years they decided to double the distance! Thankfully, the organisers have been slowly reducing this distance for nearly a hundred years, choosing instead to set ever more elaborate and technical routes.
So not only where the pioneering Tour De France cyclists cycling over 2000 km more than today’s competitors, they were also doing it in much longer stages. The number of stages has remained consistently in the lower 20’s for the majority of the tour’s history. Perhaps this represents a naturally found optimum between stage difficulty and traversing France?
As you can see there is no intuitive relationship between time and the age of the winner and why would there be? As long as your in the right age range of around 18 - 36 you just need to be in the right form to win. One thing that I did notice was from around 1940, there are small periods of the age slowly increasing and then dropping again. Initially puzzled, I came up with a theory that might explain this. I believe that the periods of increasing age are years where a generation of tour competitor competing together are winning the tour. Each generation of competitors will have a select number of cyclists capable of winning the tour. These cyclists have a number of years where they can perform at their peak fitness before the next generation of cyclists starts to take over. I believe that the periods of increasing age represent these select few cyclists ageing and when the period stops represents when the next generation is starting to outperform the older generation. Of course this is just a theory. Feel free to offer an alternative explanation.
How does the distance effect the winning time?
I wanted to see how the distance of the tour has affected the winning time. Firstly, I plotted the distance against the time and fitted a polynomial to visualise the relationship.
As you hopefully predicted there is a strong positive correlation between the distance cycled and the number of hours it took to cycle. You’ll notice it curves slightly, showing that as the distance increases the time taken increases exponentially. To better see how the winning times have altered over the years I decided to try and remove the distance’s influencing factor from the data. Below is a plot of the residuals of the above chart.
Explained briefly, if you were to use the polynomial line in the previous chart to try and predict the distance cycled given the winning time then this chart would show you how far from the true distance you were and whether you over or under shot the distance. I have changed the x axis to the year of the tour instead of winning time because I felt it was far more interesting. The red dashed line represents the point of zero error.
From this chart you can see a very interesting pattern. A kind of wave has formed peaking in the 50’s and the 90’s and dipping in the 30’s and the 70’s/80's. The intuitive way to see this chart is the peaks are where the winners are finishing quicker than would be expected considering the distance and the troughs are where the winners are taking longer than you would expect considering the distance. Normally seeing a pattern like the one above would indicate the model is generalising too much, but I don’t believe that this is happening as our polynomial follows the trend well enough. Instead I believe that variables other than distance are influencing the winning time. This is not the first time this pattern has been observed and many people interpret it as supporting evidence of doping in the tour. I think it is important to note that this theory has some serious flaws. For example it would seem to be suggesting that serious doping only took place at two periods in the history of the tour and that there were periods where doping was significantly reduced. Others have followed this line of analysis further and shown that these peaks and troughs do not align with those formed from other major tours such as the Giro d’Italia. If you wanted to take these graphs as evidence of doping you would have to believe that cyclists were doping for some tours and clean for others in the same year. Clearly, this can’t be the case.
From this you will hopefully see that there is definitely a line of investigation to be followed that could lead to interesting conclusions if given more data.
Which Countries are Winning the Tour?
Simply counting the number of victories each country has had yields some pretty interesting results.
Cleary The French own this tour, both literally and figuratively. That have twice as many victories as the next most successful country in the tour’s history. However, I don’t think this is the whole picture. We must remember that the race was made up almost exclusively of French and Belgian cyclists for much of the early years of the tour. One interesting route of analysis to follow would be to weight the countries victories with the percentage of the cyclists their country represented in that year. With the limited dataset I had, I resigned myself to the showing the country with distance and time.
At first glance the chart would appear to be a mess of colour and you would be right. The distance doesn’t seem to have too much of an affect on which country won. Although, a large number of wins were had by Belgium when the tour was running at it’s longest distance, this observation is made less significant when you see that Belgium has won several other races at varying distances. Despite the mess, you can also see that the diversity of the winners increases with time as more countries begin to enter teams.
One observation that I believe to be quite significant is that while the French hold the record for the most wins overall, they have not won the tour since 1985! Below you can see the last French win denoted by a white dashed line.
This long period without a winner has not gone unnoticed by the French. Every year their wishes of a french victory are left unanswered. For the French the tour has become a challenge to reclaim an event that has exemplified national pride for decades.
Next I wanted to compare the number of cyclists that entered each year with the number of cyclists that finished the tour over time.
In this plot the green line shows the number of cyclists starting the tour and the blue line shows the number of cyclists who completed the tour. Here we can see that that there is a very large difference for the first 30 years of the tour. This aligns quite well with the period where the tour was over 5000km. The survival rate appears to get steadily better until a sharp drop at the start of the millennia.
Lets see how the survival rate has actually changed over the years.
As you can see the tour use to have a incredible 20% survival rate. There has been a healthy increase in the number of people who completed the tour as time has gone on. In fact nearly everyone who enters now completes the tour. This is likely a result of a combination of rule changes and better educated cyclists and coaches. For example back in the day cyclists would pass around cigarettes before big climbs under the impression that it would open up their lungs. They were also only allowed to ride one bike and had to be their own mechanic. No wonder hardly any of them made it to the end!
Who has won the most?
Lets see how many times each winner has actually won.
Jacques Anquetil, Bernard Hinault, Eddy Merckx and Miguel Indurain all share the top number of wins standing at 5. This seems to be a natural maximum that any single cyclist can muster before retiring from the tour. I am confident that this number will continually be challenged by future cyclists as the training techniques and knowledge improve.
In an effort to compare these great men I have charted their wins in time against the survival rate of their respective tours. Perhaps there is something to say about the difficulty that each winner overcame.
Here we can see that Jacques Anquetil definitely overcame tours with the lowest survival rate while the rest of the 5 tour winners had relatively higher survival rates. Interestingly, the survival rates of each cyclists wins seems to to have a range of around 15%–20% difference.
Having shown you just a glance of what can be revealed with even the most limited dataset, you can begin to see what interesting insights could be found if we had a more exhaustive dataset. Most of the data that we need is out on the internet somewhere already. In fact each and every Tour De France that has ever happened has it’s own wikipedia page. Clearly, the Tour De France is a data scientists dream project as long as you can handle the large amount of data collection and preprocessing required!