Wake up and Golf!!

Nayana Kumari
Web Mining [IS688, Spring 2021]
7 min readFeb 11, 2021

An exploratory data analysis on the most suitable days to Golf in your area.

According to my father, Golf tops the list of most popular sports in the world. There have been times I debated with him that it does not take much talent to ride in a cart and drink beer and that many other sports require more strength and skill. Over the years, I watched this game and admitted that it is a challenging game, in all honesty. The outcomes of a golf game (success or failure) are all on your shoulder, and you cannot blame anyone else.

A specific field in a PGA tour has about 144 players but only one winner each week. Tiger, arguably the best golfer of modern time, has a winning % of .228 compared to players in other sports is .500. Yep, I am convinced this would be my topic for this exploratory data analysis. I hope Golf enthusiasts like my father will find it helpful and interesting :)

Golfers would be thrilled if every time they stepped out to play and the weather were ideal. Unfortunately, that is not the case. That is why I decided to understand how much weather impacts Golf and analyze major parameters to find suitable days to play Golf (outdoor) in the next 15 days (as my data was limited to that).

Data Collection

To start the analysis primary requirement is the data; after some research, I found this free API service(RapidAPI). I am using this service to collect weather forecast for Jersey City, NJ, for the next 15 days by applying the API URL: https://rapidapi.com/weatherbit/api/weather?endpoint=apiendpoint_6d494085-df7b-4efe-8f42-209851a6f9a5

I extracted forecast data by providing its latitude and longitude values (Jersey City, NJ) to the API call. I chose this city because I live here so, will be more relevant. Any city could be picked and use the longitude and latitude as input for the API call. The data I received is in JSON format and has a complex structure.

The forecast data includes weather condition values, e.g., temperature, humidity, visibility, precipitation, etc. a set of these parameters play a crucial role in predicting favorable conditions to play, which we will discuss later.

Data Retrieval

I used the python http library, a generic approach to get data by REST APIs over https using python. I also used json library to format the data in a pretty format for readability and easier visual analysis.

Using the PyCharm tool (IDE), I built a python script to extract the raw json format data. I used the longitude and latitude of Jersey City as input to get the weather forecast data.

Note: The API key is hidden from the image above on purpose.

The raw json looks like the below-

Data cleaning

Identifying the relevant columns: Based on some research, I derived the idea of favorable conditions to play outdoor golf. Many factors would contribute to the recommendation, but the major ones I found were the wind, precipitation, and visibility.

Creating a linear, two-dimensional data structure from the json structure, which is quite complex: The next challenge was to flatten the complex json data to two-dimensional data sets, which can be easily plotted in a graph and is more straightforward to analyze.

I used pandas library (python) to slice the data with a limited set of columns as follows:

Adding a derived column to show a flag depicting recommendation to play: After the data is flattened, I added a derived column (recommended) to this dataset.

To create this column, I used the below assumptions. I played with various baseline values, but they do not yield any favorable conditions to play. So, for this exercise and simplicity in the analysis, I chose the below logic:

So, the logic says if wind gust speed is less than 7 mi/hour and precipitation is less than 1 millimeter, and visibility is more than 10 miles, we will call it a favorable condition. We could use other baselines, but this one will show us some recommended days to play. So, the selection is subjective and purely in the interest of the exploratory data analysis understanding.

Date field conversion: To make the graph more readable, I converted the ‘Valid_date’ column (2021–02–11) to Day 1, Day 2…. Day 16 from the current date.

After the processing, the final dataset looks like this:

Here ‘Recommended’ column holds binary values:

0- Not recommended

1- Recommended

Exploratory Analysis

Now, let us see the above analysis in visual form as they say the picture depicts a thousand words.

Plotting the graph -I used matplotlib python library to plot the graphs.

I plotted two graphs for analysis –

1) Overall graph to depict the fluctuations of the various parameters involved in the analysis: Remember, we must have a convergence of all the parameters with their acceptable range as stated above, to be able to find a suitable day to play.

2) Final graph to show the days when playing golf is recommended: Now, as we can see, our program has found 4 days! In the next 15 days when outdoor golf can be played with a pleasant weather condition.

Summary

So, we have 4 days when outdoor golfs can be played in and around Jersey City golf courses in the next 16 days. Not bad!!

In this analysis, we focused on only a set of columns for golf playing conditions, but this weather forecast data has more to offer.

I found some insights while analyzing this data besides the obvious answer I was looking for. Interestingly, wind speed and precipitation follow each other, and visibility is inversely proportional to both wind speed and precipitation. That means when wind speed and precipitation rise, visibility reduces, and chances of playing golf outside decrease as well.

In the future, we can analyze the data for various other aspects. E.g., a professional golf player not only practices in favorable conditions but also wants to play in difficult conditions to master the skills. So, we could find days when pro golf players could hone their skills, think creatively and practice in challenging situations.

The same data could also be used to find safe farming conditions, fishermen’s travel plans or even freight fleets travel itineraries.

Limitations

Several factors limited the ability and accuracy of this analysis, and it can be improved.

Limited data set: We only had 15 days of forecast data as this is the max I could find from free API services. Limited data also influenced our baseline as we had to make assumptions based on the available range of field data.

I was restricted to Python libraries for graphs since I’ve yet to learn and exercise enhanced visualization tools like Tableau. I hope to create more meaningful graphical representations when I learn visualization techniques and tools.

References

--

--

Nayana Kumari
Web Mining [IS688, Spring 2021]

A Traveler at heart and Techie by profession!! Learn / Explore / Live Today.