How to Win the Masters

Zakary Krumlinde
The Sports Niche
Published in
9 min readJul 5, 2019

The Problem

Each year around the middle of April, PGA professionals all gather at Augusta National Golf Club in pursuit of the elusive green jacket at the Masters. This is the first of four major championships, and the only that is played at the same course every year. This post will take a dive into looking at which stats are most relevant to past champions and which did not have an affect. Golf however, is not like baseball where there are loads of stats easily accessible, golf is just now getting deep into analytics with ball speed, flight, spin, etc. They have kept stats such as Greens in Regulation, Driving Distance, Fairways in Regulation and Putting, and those of the stats that we will be taking a look at here. The question that I kept in mind while analyzing the data was: Is there a stat, or stats, that stands out amongst players that perform well at the Masters?

The Data

My personal agenda for this project was to learn how to web-scrape. I had no experience with this and wanted to learn the skill. The Masters has their own site with stats, https://2019.masters.com/en_US/scores/stats/gir.html. Each stat had a separate page and the tables all looked like the one below.

I was only interested in the players name and the final column, which was the average/total for all the rounds that player played in. To scrape this data, I used Selenium in my Jupyter Notebook. I am not sure it was the most efficient method, but I was able to get all the data that I wanted into a dataframe.

I created 6 individual dataframes and concatenated them all into one master dataframe at the end. Each stat had slightly different issues that had to be cleaned up. For example, the above table of Greens in Regulation, the final column for Ian Poulter is ‘55/72 = 76.39%.’ I did not want that all of that for my columns, I just wanted the ’76.39’ from the table. This same issue occurred with Fairways in Regulation, and Putting had the number of 3 putts in parentheses at the end of the stat (shown below).

Once I was able to get the players and final stat in the form I was looking for, I ranked where each player finished overall in each respective stat using the .rank() method in Python. I found ranking the players to be important because golf is an individual sport where you are compared to each other player in the field that year. The scores of the winners range from -18, when Jordan Spieth won in 2015 to -5 when Danny Willett won the next year, this would have been tied for 17th in 2015. Winning is based on playing better than everyone else, not an arbitrary number or score.

The final part of each dataframe was incorporating how the players finished in the tournament that year, their final position and score. The issue that arose here is that on the Masters website, the scoreboard page only listed the last names of the players as compared to full first and last names for 2017–2019 or first initial and full last name for 2014–2016. I had do some research to find other sites that formatted the names the same and gave the scores of all players, including the players that didn’t make the cut.

Upon further inspection, players that did not make the cut provided some issues as well. If the player did not make the cut, under the position columns of all the sites, it was labeled ‘CUT’ instead of the position that they came in. I could not just rank all the players by final score because some players may have made the cut, but then played poorly on Saturday and Sunday and finished with a worse score than some players that missed the cut. I removed the players that got cut, ranked their scores and concatenated it back to the players that made the cut.

Finally after this process was done for each year from 2014–2019, all the stats and years were concatenated into one master Masters dataframe. It consisted of 550 rows (total number of players) and 12 columns, all integers and floats.

Exploratory Data Analysis

There are not many features to look at, but to see where correlations exist; I created a pairplot (visual correlation matrix) using Seaborn. I only compared the rank for each stat because players are competing against each other, a number that was great one year, may not be great the next year. If it was raining and windy one year, maybe the leader only hit 65% Greens in Regulation, but the next year in ideal conditions, the winner hit 80% of greens. That 65% isn’t going to look very good next to the 80%, but compared to everyone else in the rainy and windy year, it was excellent.

There are two reasons that I wanted to look at a pairplot. The first, to see which stats are most correlated with the final position a player finishes in. Also to check for multicollinearity, this is a high correlation among the stats. With the visual representation, we are looking for diagonal lines, up and to the right is a positive correlation, down to the right is a negative correlation. Most of these scatterplots look more like filled in squares, not showing much of a correlation at all all, even among any of the individual stats. The one that appears to have some correlation is between the final position a player ends up and where they rank among Greens in Regulation.

Correlation Matrix

To investigate further, I created a correlation matrix, similar to the pairplot, except with the correlation coefficients instead of scatterplots. As observed from the pairplot, there is a strong relationship between the final position and Greens in Regulation rank, with a coefficient of just below 0.66. Nothing else even reaches 0.45, showing weak or no correlation between stats and final position.

Where Did the Winners Finish?

Another way I wanted to investigate which stats stick out was to extract just the winners from the data and see where they finished in their respective years and see if the correlations applied to the winners of this prestigious tournament.

Masters winners stats and rankings

It is quite clear to see that the winner always finishes very well in Greens in Regulation, as expected from the previous analysis. Only once has a player finished outside the top 8 in GIR and gone on to win the Green Jacket and that was Patrick Reed in 2018, finishing 21st. If we look closer into that we can see that Patrick Reed also finished 2nd that year in Putting, he may have missed some greens but made up for it by putting better than all but one other player. The most previous Masters may be the best case for showing the importance of Greens in Regulation as Tiger Woods finished first overall, but did not even break the top 50 in any other category!

To see how the winners faired in the stats rankings, I averaged the stats rankings of the previous 6 Masters winners. The average ranking of the winners in Greens in Regulation was 6.83, meaning the winner averages a rank between 6thand 7th. This was by far the lowest, with Putting coming in next (18.67), Driving Distance (28.0, Bubba Watson helped its cause finishing in first when he won), and Fairways in Regulation bringing up the rear (29.5). It is starting to become clear that a player needs to have a good iron and wedge game to be successful at the Masters.

I still wanted to dig a little deeper and see if there were some better indicators because just finishing first in Greens in Regulation does not guarantee a victory, it actually only happened once that and that was the most recent by Woods. When we looked into the correlation matrix, we proved there was no collinearity between the individual stats. We can combine the rankings between the individual, ie combine the Greens in Regulation rank plus the Putting rank, then rank that overall yearly. (Lets unfold all that, Patrick Reed finished 21st in GIR and 2nd in Putting, sum those up to get 23. After that was done for all players, the sum was then ranked by year, Patrick Reed ended up 3rd in 2018.) This was done with all the stats. The following were how the combined stats correlated with the final position.

Correlations to Final Position

When combining the Greens in Regulation and Putting ranking, the correlation to the final position improves to 0.86, which is a very strong relationship, actually higher than the score and combining all the stats together. The Greens in Regulation and Putting combined is the best indicator of the final position that a player will finish in. The following is the scatterplot of the GIR + Putting ranks verse the final position. We can see that it is much more correlated as it has a slope up and to the right.

One last thing to check is to see how the winners GIR plus Putting ranks. When we just looked at the GIR stats, only once had the player that finished first actually won the tournament. That is not the case when we combine it with Putting though. The results are much better and what I was hoping to get when we combined the stats rankings. From 2014 to 2017, the player that finished first in these combined stats went on to win. Patrick Reed finished 3rd, and Tiger in Tiger fashion, defied logic and won the Green Jacket while finishing 17th overall.

Conclusion

When I started this project, my goal was to learn how to web scrape some PGA data and do some exploratory data analysis. I feel comfortable scraping from a table and creating dataframes from it. I also gained some practice cleaning data, lambda functions, pandas rank function and more data visualizations.

However, I learned a lot more about how I am going to choose players in my office Masters pool next year. While researching players to pick looking at a combination of Greens in Regulation and Putting stats. The PGA has all player’s cumulative stats for the year readily available online to check. I will be looking for players that rank high in both Greens in Regulation and Putting, if there are players close in the combined ranking, the player that ranks better in Greens in Regulation will be my tiebreaker. With that being said players still need to be able to perform on the biggest golf stage in the world. Unfortunately, there are no stats available to show if a player can make an 8-foot with the prestigious Green Jacket looming, which is why we love sports!

All my code is available on my GitHub: https://github.com/zkrumlinde/PGA-Masters-Stats

--

--

Zakary Krumlinde
The Sports Niche

I am an aspiring data scientist that posts about projects completed along the way.