Predicting a PGA Tour Winner (Part 1 — Exploration and Regression Models)

Published in

Analytics Vidhya

5 min readApr 19, 2020

When I was at university back in 2013, I placed my first bet. It was the year that Adam Scott won the Masters. He hadn’t won a major before, but I thought he was rather attractive and if I was going to lose money betting on anyone, he was as good a guess as any.

Looking back, (and having since read Malcolm Gladwell’s “Blink”), I think having grown up with the sport there’s a level of intuition that must have swung my vote; I won £140, and it funded a few good nights.

Motivation

I approached my first ever application of Machine Learning, in the same way that I’ve been told to approach any project I’ve completed and any presentation I’ve prepared. Start with something you know.

Golf is that thing — so here goes.

My task was to find out whether I could be successful at predicting which PGA Golfers were having a good season, and whether this predicted their future performance.

Data Sources

My second specification for this project, was to spend the maximum amount of time concentrating on machine learning techniques, and given that I had a deadline — this meant I needed to spend the minimum amount of time possible concentrating on data cleaning and preparation.

Fortunately, there exists a great dataset on Kaggle detailing around 10 years of PGA results, downloaded weekly from the PGA Tour website, and detailing statistics for each player, and each game across hundreds of variables. From average driving distance, to % putts holed between 5–10ft. There are also some lovely people out there in the world who have discussed and shared the best ways to download, import and translate this data into a clean dataset, ready for to apply some Machine Learning Techniques.

You can find the minimal data importing and cleaning methods I decided on within my GitHub repo here.

Exploratory Analysis

This is the point that I nearly forgot that I was meant to be applying Machine Learning Techniques. I’m so interested in both golf and statistics, that I’m not sure how I made it this far in life without spending more time looking at golf data!

I used Plotly for the majority of this preliminary analysis in order to be able to interrogate specific data points, and get all the juicy details.

Some of my favourite trends were the negative correlation between average driving distance, and driving accuracy. You can see below that the longest hitter on tour in 2018 was Rory McIlroy, however this really hit his driving accuracy.

Distance vs. Driving Accuracy by Player and Season

Restricting this data to the top 50% by earnings, you can see that the majority of tour winners hit the fairway more than 55% of the time, and average over 280 yard drives. Seems manageable right? But there’s more to it!

**Top Earning 50%** Distance vs. Driving Accuracy by Player and Season

The strong correlation between Average Putts per Round vs. % Greens hit in regulation (below) shows that you need at least one or the other to succeed on tour… Preferably both!
Although Jordan Spieth circa 2015, (amongst the other high performers in red and yellow) seems to show that putting is slightly more important to a successful season than hitting Greens in Regulation.

Applying Machine Learning techniques

Following this exploratory analysis, I used a couple of ML Techniques on the 2010–2018 data. My aim here was to find the best fitting equations for past data, and apply it to the 2019 data in order to predict a winner.

Multivariate Linear Regression Model

My first attempt at a Linear Regression Model had an r² score of 0.50 vs. the training data, and 0.52 against the testing data. This shows that the model generalised well and wasn’t over-fitted, however on the other hand, it wasn’t particularly tight to the data and therefore wasn’t a great prediction model.

Multivariate Polynomial Regression Model

I then tried a Polynomial Regression Model, which was more successful. The training data r² score was much higher at 0.71, and despite being less well fitted to the testing data than to the training data (which is expected after all), the testing data r² score still came out at 0.61, higher than the Linear Regression Model.

I picked up a few more interesting insights along the way…

Have certain player characteristics become more important over time?

Calculating linear regression models year on year, and plotting by variable uncovered which trends have become more influential on earnings over time. The charts below show Birdie Conversion plotted against Season Earnings, the gradual increase in gradient shows how the importance of this variable has increased over time.

Birdie Conversion vs. Season Earnings from 2010 to 2018

It’s a similar story with % Greens in Regulation, getting more important over time.

% Greens in Regulation vs. Season Earnings from 2010 to 2018

However the importance of Average Putts per Round hasn’t changed too much over time.

Avg. Putts per Round vs. Season Earnings from 2010 to 2018

Across all of these variables, the data seems to have skewed more over time, with a few more really positive outliers each season. The implication is that being an all-rounder doesn’t seem to cut it any more, to be a PGA Tour Winner in 2019 - you need to be exceptional.

Mini Conclusion

Here ends Part 1 — Exploration and Regression Models. I really enjoyed this project, not least because I learned so much along the way. To avoid data fatigue, you can find the second part of my project in another article focusing on K-Means Clustering, Decision Trees and one final exciting prediction here.

Spoiler alert: I made some money. Needless to say, it’s worth a read!