Module Assingment 2: Missing or mis-shaped data

Noah Podolske
4 min readSep 30, 2019

--

For my second module assignment I was struggling to find a dataset with missing and/or mishaped data, so I decided to do something that was guarenteed to have something not tracked, and I arrived at the PGA tour dataset. It is very difficult to track all the data for all the golfers within a multi-year span. Additionally, some of the used statistics need multiple preformences with specific statistics that not all PGA tour events tracked during the specified time period. Many golfers did not participate in very many events, leading to missing data. Due to all these factors, I finally landed on the following question: has the PGA tour gotten better quality of play from its players over time?

This data was difficult to work with for two big reasons. First, this dataset is very big, somthing we all are going to need to learn how to deal with in the job market, but it makes it difficult to check your work. I spent hours slaving over this code because I couldnt tell if the answers I was getting were correct or not, and it was maddening to get different answers and not be able to check if which one is correct. The second obstacle I ran into is the golfers not participating in certin events. To fix this, I went right to cleaning and reshaping the data to fit my needs.

I was able to eliminate a lot of missing data from golfers by throwing out the years they did not play in the PGA tour. Then I cut out all the golfers that did not have stats in the variables I am analyzing. This cuts it down to just the golfers that made an impact the year they played. At this point, the data looks like the following:

You can see some trends here, but there is still a lot of unessesary data that makes it inefficent to look for consistant patterns. Once I clear out all the columns that we do not need to analyze. Finally, once its clean and filled, you get the following graph.

As you can see from the graph, some qualifiers stayed the same over time. One such statistic being the average strokes gained total. This is a golf stat that shows how a specific golfer compares to the rest of the field, basically how many times they mess up verses the average PGA tour player. This is a relitive statistic so I should have predited that this stays at zero the whole time, that makes sense based on how the statistic is calculated. Average putts stayed the same, this makes sense because putting is a difficult thing to improve on as a player after you get to a certin elite level. The first interesting stat on this graph is average scrambling. For non-golfers, scrambling is basically the ablility that a specific player could make par or lower after not making it onto the green in the amount of shots needed for par. This fluxuating is scary, because as of now it is trending downwards and this is a bad thing for the sport of golf. It means that the miraculous shots are becoming more and more rare, making the sport less exciting. This could also be attrbuted to varience in preformance in top golfers in years such as 2012 and 2017, but I would attribute the change in scrambling to the courses played in recent years. Fairway percentage seems to be going down as well, meaning at a very big picure scale, woods and high wedges are not being used as accuratly as they were in the early 2010’s. This does not concern me as showing worse play, because of the courses they have been playing on, but it should. Average putts and average score stayed the same like I predicted, and the dip in rounds in 2013 can be explained by (for some reason I could not figure out), there wer only 40 money events in the tour that year. So overall, no cause for concern, there is not enough evidence to show that PGA tour golfers have gotten worse.

--

--