Predicting National Champions: The Physical Inputs to Determine a Marathon Winner

Adam Parish
Writing 340
Published in
9 min readMar 6, 2024

The goal of athletes is to create a training plan that maximizes their output for the time and effort exerted. This is especially true for runners, whose training is especially difficult and causes exceptional wear and tear on their bodies. The objective of the paper is to create a model that predicts race outcomes based on a set of training and former racing variables. Multiple models will be evaluated, with two specific focuses in mind. The first set of models seeks to evaluate the effect of different inputs on the time for marathon runners. Time in this case is given by seconds over two hours in the marathon. This barrier has historically been an unbreakable mark, and thus the pursuit towards it is the goal of all marathon runners. The second set of models seeks to predict if given a set of characteristics, how the likelihood of becoming an American national champion changes. The second set of models features a binary variable, champion or not, and thus instead of predicting how the characteristics change the outcome, the goal is to predict how the odds change with differences in the inputs. After discussing the variables that will enter into the models, as well as a description of the athletes included in the study, the creation and results of the models will be examined.

The variables included in both models are considered strong indicators of future running success and are included to determine the amount to which they influence the race outcomes. The time variable is expressed in seconds over two hours, thus variables that have a negative coefficient can be considered better. These are expected to decrease the time and thus result in a faster race. Two of the most important variables are miles and elev, which measure weekly mileage and elevation of the athlete’s training city. The following variables are collected from each athlete and represent their total in each input at the time of their marathon race. The variables colchamp and hschamp are discrete, measuring the number of college and high school national championships won during the athlete’s collegiate and prep career. The variable amnatchamp is a categorical variable with a value of 1 if the athlete won the American marathon national championship that year and a value of 0 if not. Similarly, olyqual and wcqual measure if that athlete then qualified for the Olympics or the World Championships that year, denoted by a value of 1 if yes and 0 if no. The variables prevolyqual and prevwcqual are discrete values assessing the previous number of qualifications to the Olympics and World Championships respectively. Finally, each of the variables with names, such as edeyestone and jackmull, are the coaches of the represented athletes, with 1 if they coached that athlete and 0 if not. It is important to note that in the discussion of each of the model results, the coefficients, denoted in the “Coef.” column, are the values of interest, and will be used to draw conclusions.

Model 1: Full Linear Regression

The first model will evaluate the success on time if every variable is considered in the model. A linear regression model is constructed, with time as the dependent variable. As mentioned, negative values indicate athlete success as this decreases the expected time above two hours for an athlete.

The largest coefficients are associated with championships qualified for or won. The American national championship is tied to the Olympic Trials every four years, making those races the most competitive national championships. Therefore, it makes sense that the olyqual variable represents the largest drop in time at more than 6 minutes. For professional marathon runners, any time drop of more than a minute is extremely impactful. Additionally, wcqual represents a 24 second faster race and each collegiate championship won more than half a second. The final negative variable of note is elev which indicates that for every foot of additional elevation above sea level an athlete trains at, they will drop 0.009 seconds. Athletes training at 7000 ft, for example the city of Flagstaff, that is 63 seconds faster than an athlete at sea level. All other variables are positive, and thus negative for the athletes. None of the coaches seem to have an impact on the athletes, which makes sense because most training plans and runs are publicly available. Finally, miles is a positive value, which disagrees with most literature for professional runners. It is likely that because the lowest weekly mileage present was 75 miles per week, athletes in the lower tier of runners were overtrained in an attempt to improve, driving down the importance of weekly miles. Due to the overall insignificance of many of the variables, a reduced model is proposed to eliminate noise in the determination of the time output. The reduced model focuses on four variables that are expected to be very significant in determining marathon time.

Model 2: Reduced Linear Regression

The second model evaluated is another linear regression model, with a reduced set of covariates. In the first model, the final row of the output indicates the significance of each input on the result of the dependent variable. Each star represents a larger significance, meaning leetroop is significant at the 1% level, while the constant in the model is only significant at the 10% level. As only three variables are significant, the model does not succeed in predicting the output at a high level. The values of the coefficients are worth evaluating, but the model should not be used to draw important conclusions. A reduced model, containing some of the same variables improves the outcome.

High school and college championships again predict a fall in time, with high school championships again being more important. This indicates that high-level high school runners who transition into the professional ranks display better performance. Runners who can continuously build their mileage base and push their mental boundaries from a younger age are set up to succeed as professional runners. In the second model, miles are also negative, which is intuitive and matches professional coaches’ training plans. Each mile is almost 2 seconds faster in a race, meaning an athlete running 140 miles per week, as some of them do, would result in a 247 second drop, which is more than 4 minutes. With elevation only at -0.002, an elevation of 7000 will only represent a drop of 14 seconds under the second model. The four variables selected for the reduced linear model are more significant, and the coefficients have the expected sign; therefore, it is likely a better model to use to draw conclusions.

Model 3: Full Logistic Regression

The second set of models are created using a logistic regression, which seeks to predict a binary classification. The goal of both models is to predict whether, given a set of characteristics, an American runner will be the American national champion in the marathon. The first model is a full model, including previous qualifications in the Olympics and World Championships as well as college championships. Whereas in the first set of models the coefficients of the models represent changes in discrete values of time, a logistic model seeks to understand the changes in the odds. For example, for every additional college championship will result in an additional 56.9% chance of becoming an American National Champion because the coefficient of colchamp is 1.569.

Unlike the first set of models, previous qualifications in the Olympics and World Championships have a positive effect on the chance to become an American champion. This is especially true for former Olympians, who are 122.9% more likely to win the American marathon championship than non-former Olympians. Also, unlike Model 2, miles are negatively correlated with winning, although at a very low level. Since the coefficient is less than 1, it is expected that increasing mileage will decrease the probability of winning. This is not expected; however, since the athletes included in the study are running such mileage already, an increase in mileage is likely reflected in overtraining. Finally, time is negatively correlated, but this is to be expected. An increase in time means a slower race, and thus will decrease the probability of winning. It is important to note that none of the coefficients are significant, so it would be difficult to draw significant conclusions from this model. As a result, a reduced model is proposed in order to fully capture the significance of variables expected to have a greater effect on winning the national championship.

Model 4: Reduced Logistic Model

The final model seeks to continue the prediction of an American national champion; however, the reduced model seeks to understand the effect of fully controllable variables on the ability to win. At any one point in time, a professional athlete has control over the three variables used in the model. Someone new to running can increase their mileage, move to another city, and run faster. However, they are incapable of returning to the past to win a college championship or qualify for either the Olympics or World Championships.

Time is again negatively correlated with the odds of winning, which is expected. Mileage is also again negatively correlated with winning, likely for the same reason as explained above. In both logistic models, elevation has no effect on the chance of winning. This contradicts the findings of the first set of models, which is a positive sign for athletes training at sea level in cities such as Boston. A final important note is that hschamp was omitted from both models because of the exact correlation it possessed with amnatchamp. Every high school national champion in the study went on to win the American marathon championship, but not every American national champion was a high school champion. This is expected since high-level runners who can maintain their training for years are more likely to become the top national runners. Although the reduced logistic model does a poor job at predicting the chance to become an American national champion, the result that the potential for overtraining has a negative impact on winning potential may be significant for future professional athletes.

As a result of the four models created having a strong focus on athlete’s physical inputs, it is important to contextualize the results within the greater goal of running. Since the American national championships represent the pinnacle of American marathon running, athletes will prepare specifically for the single race. Training cycles will be calibrated with the goal of entering the race as fresh as possible. Therefore, these models represent only a snapshot of the running journey. A marathon preparation cycle lasts between 8–20 weeks, yet an athlete’s journey to even enter the cycle requires years. Previous research has explored the relationship between the mental grind of long term training and the possibilities of breaking mental barriers, see Parish 2024. However, more research remains to explore the intersection of the mental and physical. Future analysis should focus on understanding how the physical training during a single cycle is informed by long term mental training and vice versa. Based on the research conducted in this study, physical racing output is determined by a variety of factors, with no one input dictating a race entirely.

Works Cited

“Athlete Biography.” World Athletics, worldathletics.org. Accessed 19 Feb. 2024.

“Boulder Track Club Athletes.” Boulder Track Club Running Team, bouldertrackclub.com/elite-athlete/clint-wells/. Accessed 19 Feb. 2024.

“Brogan Austin Biography.” Tinman Elite Athletes, tinmanelite.com/pages/brogan-austin. Accessed 19 Feb. 2024.

“Conner Mantz Biography.” BYU Athletics, byucougars.com/sports/mens-cross-country/roster/season/2021/player/conner-mantz. Accessed 19 Feb. 2024.

“Galen Rupp Biography.” Oregon T&F Athletes, goducks.com/sports/track-and-field/roster/galen-rupp/3614. Accessed 19 Feb. 2024.

“Hanson Running Athletes.” Brooks-Hanson Running Team, hansons-running.com/content/meet-the-team. Accessed 19 Feb. 2024.

“Jake Riley Athletic Biography.” USA T&F Runners, www.usatf.org/athlete-bios/jake-riley. Accessed 19 Feb. 2024.

“NAZ Elite Atheltes.” HOKA ONE ONE NAZ Elite, nazelite.com/athletes/futsum-zienasellassie/. Accessed 19 Feb. 2024.

Ward, Jared. “Biography.” Jared Ward, www.jared-ward.com/. Accessed 19 Feb. 2024.

“ZAP Endurance Athletes.” ZAP Endurance Running, zapendurance.com/. Accessed 19 Feb. 2024.

Appendix

Data used: https://docs.google.com/spreadsheets/d/1-HjjTk6hUZSvif0QwN7IiL78jf7lSe4zVVbBpyG-moM/edit#gid=1152426162

--

--