Who Were The Best Players On The PGA Tour in 2018?

Graham Albers
INST414: Data Science Techniques
5 min readDec 11, 2023

Introduction:

Golf is one of the most watched sports on earth and the new LIV golf league is becoming a huge rival to the current and old-school PGA Tour. Ever since the start of the Covid-19 pandemic, the number of golf players has significantly increased. Looking at the stats of players from the 2018 PGA Tour you can see similar trends between players performing alike. Also, you can see what it looks like to be in the top tenth percentile of the best golfers in the world.

Data Source:

This data set containing information about PGA Tour players came from Kaggle.com. The data source consisted of player statistics from 2010–2018 for any player who recorded a pro round during that time. The data originally contained a lot of stats on the golfers but since success wanted to be measured, the players were compared using, wins, their top 10 placement ratio, and their earnings for every round played in 2018. To compare the players K-means clustering was used.

K-Value:

To determine the k value a predetermined number was used. 10 was the chosen number to divide the 193 players who played a round in the 2018 PGA Tour Season. With 10 clusters there would be enough diversity amongst the players and each cluster could still be fairly represented.

Cluster Analysis:

Of the 10 different clusters, the leading one was the 7th one and consisted of only 4 players, those players were some of the best on the PGA Tour during the season they were Brooks Koepka, Dustin Johnson, Justin Rose, and Justin Thomas. Combined the 4 of them had 10 wins on the year and over 30 top-10 finishes. The 7th cluster was an extreme outlier and most of the other clusters consisted of at least 10 people. The 7th cluster however did consist of the best player group.

Koepka and Johnson playing a round together Creator: Streeter Lecka | Credit: Getty Images

Cluster 6 had 4 players consisting of Jason Day, Patton Kizzire, Francesco Molinari, and Bubba Watson. This cluster had a combined 9 wins and 18 top ten finishes becoming the second-best cluster. Cluster 1 consisted of 9 players and all of them had a win on the season this group had 19 top 10 finishes on the season.

Tiger Woods on the Green at Augusta National during the Masters Creator: Kevin C. Cox | Credit: Getty Images

The 4th cluster had 10 players and consisted of the GOAT Tiger Woods along with a few other notable golfers including Henrik Stenson, who led the PGA Tour that season in fairways hit percentage, and Rickie Fowler. However, no one in the 4th cluster won all season. The 10th cluster consisted of 8 players all of whom had a win on the season including the likes of Rory McIlroy, Phil Mickelson, and Bryson DeChambeau.

The 8th cluster had the most players by a dramatic amount there were over 30 players in this cluster and none of them won on the season. The 1st cluster featured players all with a win on the season, but the wins were in much lower-profile tournaments and were lower-profile players. Cluster 3 was rather surprising only having 3 players but all players performed well against expectations. The 9th cluster had a few top-ten finishes but nothing to compare to that of cluster 7 or 6.

Lastly, the 2nd and 5th clusters consisted of players who did not have any wins on the season either. Due to this these clusters were also rather large having over 20 players and did not have any super notable players in these clusters.

Software Used:

To do this analysis the Pandas Library in Python was used. Using pandas, I was able to read the PGA tour dataset into a data frame and visualize the data. To create the clusters and standardize the different features the sklearn library was used. Specifically, the Kmeans and StandardScaler were used from sklearn. After that, the matplotlib library was used to visualize the data and create a 2-D scatter plot with the clusters labeled.

Data Cleaning:

To clean the data, first, all null values were replaced with a 0 in the dataset. After that, the dataset was normalized to the players who had played at least one round in the 2018 season. After finding that 193 players met these qualifications a column called the top 10 ratios was created for each player. This column normalized the data by taking the number of top 10 placements from each player and dividing them by the total number of rounds played during the season. After that, an earnings per round column was created for each player that took the total money made during the season and divided it by the number of rounds played by each player. This allowed me to compare the players using wins, the top 10 placement ratio, and the earnings per round ratio.

Limitations:

This data analysis could be limited because it compares the players using their earnings and placement in tournaments with different player fields and financial implications. The data was normalized to account for this. It is also important to note that no round statistics like, fairways hit percentage, or average puts were used to determine any of the clusters and could further support the findings for why players were performing at a super high level during the 2018 PGA Tour season. Here is the link to my code!

--

--