DATA STORIES | SPORT ANALYTICS | KNIME ANALYTICS PLATFORM
Winning NFL Betting with KNIME, Wind Data and Regression Analysis
Discover the impact of wind speed on NFL games and learn how to automatically collect weather data to gain a decisive advantage along with other NFL key metrics to outsmart oddsmakers
Last time we saw that we could use logistic regression to estimate team ratings based on past results and some extra indicators such as rest days and whether it was a divisional game or not.
Furthermore, the additional consideration of wind speed led to a significant improvement of the model. The model improved by 5 percentage points.
The full article can be found here:
The downside of this additional indicator was that we could not (yet) use it for predictions, as we did not yet have any forecast data available for the weather data. Moreover, the data from nfldata lacked values that we imputed using the mean value of the wind speed.
For that reason, this time we will be looking at the automatic collection of wind speed data.
Getting the wind data from a Weather API
The website open-meteo.com seems to be made for us and offers both historical weather data, and weather forecasts for the next few days.
So let’s understand how to create the appropriate URL for the API calls, including the parameters required to load the necessary data into KNIME.
If you still don’t know KNIME, you can find here a good “Getting Started Guide” where you can also download the opensource software for free.
The visual programming language of KNIME is self-explanatory and therefore easy to learn.
The input mask for Open-Meteo requires two geographic parameters: longitude and latitude, along with the desired date range.
A wide range of weather data can also be called up, such as wind speed, precipitation, and more.
The output of your selection can be downloaded as an Excel or CSV file.
We only want the API URL, not the file.
To determine the required wind speed data, we will only need to adjust the longitude and latitude data and the start and end dates. The following figure shows where the corresponding values can be found in the URL.
The first step will be to create a table with the names of the stadiums of all NFL teams and search for the corresponding geographical values on Google Maps. The second step involves constructing a KNIME workflow to automate data collection.
A table containing team names, their respective stadiums, and geographical coordinates is imported into KNIME. A separate table holds the start and end dates for data retrieval.
The Cross Joiner node merges both tables, enabling us to loop through each stadium (blue node) and retrieve the corresponding weather data.
The following four nodes do most of the work. The String Manipulation node creates the URL path with the geographical coordinates of the corresponding stadium.
The GET Request node takes the URL path and sends a request to the open-meteo API. The output is a JSON file with all the needed information. Now we need to convert this JSON column into a suitable form. KNIME also offers a simple and efficient solution for this.
The JSON to Table node converts the JSON column into a table column and the Ungroup node finally converts it into a table with the date values and the respective weather information in different columns.
Show me a solution that can create this more easily!
The consolidated weather data for all NFL stadiums, spanning the entire season to date, is then written to a CSV file
Since we are interested in both historical data and wind speed forecasts for our nfl games prediction model, we will develop also a workflow to collect wind speed forecasts.
It looks practically the same as the workflow for collecting historical data. The difference is that it creates a different URL path that refers to the following page: https://open-meteo.com/en/docs
The KNIME workflows with all the examples can be found on my
KNIME Community Hub space.
Merging wind data with NFL games
Let’s go back to our NFL team rating workflow from last time and update it with the new wind speed data. In order to merge the wind data with our table of NFL games, however, we need to apply a special solution.
The reason for this is that, on the one hand, we have the match date and the start time for the game, while for the weather data we have values for all hours of the day.
Directly joining weather data to match timestamps is not feasible due to variable match start times and extended game durations.
Our solution involves creating two new columns: one representing the time 30 minutes before the match, and the other representing the match’s end time. Subsequently, we join the match data using the home-team column.
To ensure that we only refer to the relevant weather data for the game, we filter the data so that only weather conditions that occur within 30 minutes to 3 hours of the start of the game are taken into account. We achieve this using a Rule-based Row Filter node, in which we can set up the corresponding formula.
But we still have too many rows per game, because there is a row of weather information for every hour. In the example below, the Arizona Cardinals game started at 16:05. Within the two-hour period from 16:00 to 19:00, four distinct wind speed conditions were recorded.
How can we now select or aggregate the best value from these four to provide the best information for our predictions?
Maximum wind speed: This could be used to predict the likelihood of a turnover or other unexpected event caused by the wind. Mean wind speed: This could be used to predict the overall impact of the wind on the game, such as whether it will favor passing or running attacks.
We decide on a third alternative: median wind speed. The median wind speed would be a more appropriate choice for aggregating wind speeds in cases where there are significant variations throughout the game. This would provide a more representative and robust measure of the wind conditions during the game.
Predicting game winner with regression analysis
The model without the wind speed data fits 73.89% by the end of week 12.
By including the wind speed the model’s accuracy increases to 76.67%. This is an impressive increase. But this time we don’t just want to know who will win, but also how big the points differential will be.
To achieve this, we will create a metamodel with a further regression analysis that will use the win probability of the teams to predict the margin of victory (MOV).
Since we do have such a small data set, we will analyze the win probabilities from the first model into different aggregated bennings. By binning the probabilities of home wins into quantiles, we see that there is a non-linear relationship with the margins of victory.
So the next step is to build a polynomial regression model in KNIME to predict the margin of victory using the probabilities of winning. This model not only gives us the predictions for the margins of victory, but also outperforms the first model by more than three percentage points and achieves an accuracy of 80.56%.
Now we are ready to make our forecasts for week 13! Check a portion of the predictions in Fig. 18 — tadaaa!
Are you also interested in the predictions for all games in week 13? Then download for free the workflow from my KNIME Community Hub space to see the full predictions and let me know if you have any questions.
I wish you lots of fun!
Material for this project:
- KNIME workflow: KNIME Community Hub
- NFL data: https://github.com/nflverse/nfldata/tree/master
- NFL data dictionary: https://nflreadr.nflverse.com/articles/dictionary_schedules.html
- Weather data from https://open-meteo.com/
Thanks for reading and may the Data Force be with you! Please feel free to share your thoughts or reading tips in the comments.
Follow me on Medium, LinkedIn or Twitter
and follow my Facebook Group “Data Science with Yodime”