Predicting the 2025 Ryder Cup Teams: A PGA Statistical Analysis

Elliott Bauer
INST414: Data Science Techniques
7 min readMay 15, 2024

For my final project for INST414: Data Science Techniques, I have decided to extend my Module 1 Assignment on the Professional Golfers Association (PGA). The motivating question that I am intending to answer is based on a variety of statistical categories from the past two PGA seasons, which golfers are most qualified to be selected for the 2025 Ryder Cup? The stakeholder for this research question would be the captain of Team USA and Team Europe for the Ryder Cup, although I do think golf fans of all ranges could benefit from my analysis as a whole. It is important that readers understand what the Ryder Cup is, prior to going more in depth on my analysis. The Ryder Cup is a tournament that happens every two years within the PGA tour. It is a tournament that essentially every qualifying golfer desires to compete in, as it is the United States versus Europe. The first day consists of a 2 v 2 format where each player plays their own shot, and then the lowest score on each hole is counted. For the second day, each player takes a shot, and then whoever has a better lay is where the next shot is taken from (“best ball” format). The final day is every man for himself. Each team is made up of twelve golfers, half of which consist of automatic qualifiers. The captain personally selects the remaining members to be on the team. There are 28 points available during the tournament, so 14.5 are required in order to win outright. Team USA has a better historical record, but Team Europe has had their number in recent years, winning ten of the past 14 Ryder Cups. With my analysis, I decided to extend to Module 3, by using nodes to visually display golfers who are most qualified to take part in the Ryder Cup. Below is a list of my column names within my main data frame:

  • RK: Rank
  • EARNINGS: Season-long earnings in US dollars
  • CUP: FedEx Cup Points
  • EVNTS: Number of events participated in
  • RNDS: Number of rounds played
  • CUTS: Number of cuts made; a ‘cut’ is when they take the highest performers of the earlier rounds, and they advance in order to reduce the number of players.
  • TOP10: Top 10 finishes
  • SCORE: Average score per tournament
  • DDIS: Average drive distance off the tee, in yards
  • DACC: Driving accuracy, as a percentage
  • GIR: Greens in regulation, as a percentage
  • PUTTS: Putts per hole
  • SAND: Save percentage out of sand traps
  • BIRDS: Birdies per round; a birdie is when the golfer achieves a score of one under the par

Since I am working with 2023 and 2024 data, I extracted both data sets from ESPN.com and merged the two together. Depending on the statistic, I either added or took the average from columns that showed the same stats for different years to see how each player has performed over the past two years. For example, for the ‘Cuts Made’ column, I added up the number of cuts that a player qualified for from 2023 and 2024, and added that to a new column. Then, I divided the number of cuts made by the number of events that each golfer participated in to determine the percentage of time they made the cut. I did a variation of this for all of the tables that I created. In terms of models that I applied, I essentially just did a count for how many tables each player appeared in. For all of the tables I created, I used the .head() method to only show the top n amount of names, and used sort_values() to sort in ascending order. This way, I would only have the golfers who were the best in that given statistical category as my output. In most cases, I took the top 25 golfers because that was about the top fifteen percent. That felt like an exclusive enough group, especially when you consider that some golfers outside of the USA and Europe would still appear in the data frames. However, it is also important to note that in some cases, people towards the end of the top 25 often had the same values as people just outside of the top 25 (based on how Pandas prints the data frame). To account for this, I adjusted the head() value so that the golfers who were previously outside of the top 25 would now be included. Below is an example of how a sample data frame I used looks. The general steps are to combine similar columns, create unique data frame, drop outlying values, sort by desired value in ascending order, take the top n rows, and print.

Now, I will give the list of golfers for each team that I believe are most qualified to compete in the 2025 Ryder Cup.

Team USA:

Scottie Scheffler, 27, Ridgewood, NJ

Wyndham Clark, 30, Denver, CO

Brian Harman, 37, Savannah, GA

Russell Henley, 35, Macon, GA

Xander Schauffele, 30, La Jolla, CA

Max Homa, 33, Burbank, CA

Sahith Theegala, 26, Orange, CA

Collin Morikawa, 27, Los Angeles, CA

Sam Burns, 27, Shreveport, LA

Denny McCarthy, 31, Rockville, MD

Patrick Cantlay, 32, Long Beach, CA

Brooks Koepka*, 34, West Palm Beach, FL

Team Europe:

Ludvig Åberg, 24, Eslov, Sweden

Rory McIlroy, 35, Holywood, Northern Ireland

Viktor Hovland, 26, Oslo, Norway

Tommy Fleetwood, 33, Southport, United Kingdom

Stephan Jaeger, 34, Munich, Germany

Matt Fitzpatrick, 29, Sheffield, United Kingdom

Shane Lowry, 37, Clara, Ireland

Sepp Straka, 31, Vienna, Austria

Tyrrell Hatton, 32, High Wycombe, United Kingdom

Aaron Rai, 29, Wolverhampton, United Kingdom

Vincent Norrman, 26, Stockholm, Sweden

John Rahm*, 29, Barrika, Spain

Below is an image of the USA and Europe node charts, with bigger nodes representing a higher player count.

The two asterisks above indicate my own personal additions to the rosters. There is a rivaling league to the PGA, known as LIV Golf. LIV refers to the Roman numeral number 54, which symbolizes the 54 holes played at their events. LIV golf aimed to pull in PGA golfers in attempts at growing their league by offering large sums of guaranteed money. However, recently, the PGA tour and LIV Golf agreed that golfers who sign with LIV are still allowed to play in Major Championships (The Masters, The US Open, The PGA Championship, and the British Open). As of now, there is nothing that prevents LIV golfers from joining the Ryder Cup teams either. Brooks Koepka and John Rahm both won major championships in 2023, and had been towards the top of the leaderboard for many PGA events prior to leaving. Their addition to these teams makes sense for those reasons.

Another feature I added to my analysis extends on Euclidean distances, back from Module 3. I wrote a couple of functions that allow a user to enter the name of a golfer from the data base. It will compare the golfers’ stats to everyone else, and take the Euclidean distance between them. It then listed the top 5 most similar golfers to the ones that the user gave as an input. This could be useful to my stakeholder, the Ryder Cup coaches, for a variety of reasons. For one, if they are trying to roster a diverse group of golfers with different skillsets, this could be a resource that they use for when certain players are unable to participate. For example, if the Europe coach entered in “Tyrrell Hatton”, but for some reason he was not able to play in the tournament, he would notice that Shane Lowry, an Irish golfer had the closest similarity to Hatton. He could then reach out to Lowry, and ideally fill the hole that was created from Hatton’s unavailability, as they seem to have similar strengths. Also, they could use it for the contradictory reason — ensuring that their roster is not geared towards one specific skillset. For example, if they chose one golfer, and also planned on choosing a number of the ones listed from the Euclidean distance function, they might want to rethink. This way, their talents can be diversified across a number of statistical categories.

My analysis certainly has some limitations that I want to go over. For one, it is currently the middle of the PGA season for 2024. Only one major tournament has been played. A ton of aspects could change from now until 2025 when teams are chosen. Injuries could occur, new guys could break out, and much more. Even new legislation could be passed that prohibits LIV golfers from playing in the tournament. Additionally, another limitation of the project is that it is very hard to statistically represent a lot of factors that occur in golf. For example, weather can affect a player’s performance but there was no data surrounding that. Lastly, it is definitely important to recognize how much talent across the board that encompasses the PGA tour. The skill gap between best golfer on the tour and the worst on the tour is much closer than one might think.

Below, I have attached a link to my GitHub repository:

https://github.com/elliottbauer99/INST414/blob/main/Final%20Project.ipynb

--

--