Sitemap
Low Code for Data Science

A journal of articles written by (and for) the KNIME Community around visual programming, data science algorithms & techniques, integration with external tools, case studies, success stories, data processing, and (of course) KNIME Software.

DATA STORIES | SPORT ANALYTICS | KNIME ANALYTICS PLATFORM

Winning NFL Betting with KNIME, Wind Data and Regression Analysis

8 min readDec 1, 2023

--

Fig 1: Winning NFL betting with Yodime (image by author).

Last time we saw that we could use logistic regression to estimate team ratings based on past results and some extra indicators such as rest days and whether it was a divisional game or not.

Furthermore, the additional consideration of wind speed led to a significant improvement of the model. The model improved by 5 percentage points.

The full article can be found here:

The downside of this additional indicator was that we could not (yet) use it for predictions, as we did not yet have any forecast data available for the weather data. Moreover, the data from nfldata lacked values that we imputed using the mean value of the wind speed.

For that reason, this time we will be looking at the automatic collection of wind speed data.

Getting the wind data from a Weather API

The website open-meteo.com seems to be made for us and offers both historical weather data, and weather forecasts for the next few days.

Fig 2: The Open-Meteo website (image taken from homepage).

So let’s understand how to create the appropriate URL for the API calls, including the parameters required to load the necessary data into KNIME.

If you still don’t know KNIME, you can find here a good “Getting Started Guide” where you can also download the opensource software for free.

The visual programming language of KNIME is self-explanatory and therefore easy to learn.

Fig 3: Getting Set Up with KNIME Analytics Platform (image from KNIME).

The input mask for Open-Meteo requires two geographic parameters: longitude and latitude, along with the desired date range.

A wide range of weather data can also be called up, such as wind speed, precipitation, and more.

Fig 4: Input mask on open-meteo (image from author).

The output of your selection can be downloaded as an Excel or CSV file.
We only want the API URL, not the file.

To determine the required wind speed data, we will only need to adjust the longitude and latitude data and the start and end dates. The following figure shows where the corresponding values can be found in the URL.

Fig 5: Generated API URL (image from author).

The first step will be to create a table with the names of the stadiums of all NFL teams and search for the corresponding geographical values on Google Maps. The second step involves constructing a KNIME workflow to automate data collection.

Fig 6: Search for geographical values of NFL stadiums (image from author).

A table containing team names, their respective stadiums, and geographical coordinates is imported into KNIME. A separate table holds the start and end dates for data retrieval.

The Cross Joiner node merges both tables, enabling us to loop through each stadium (blue node) and retrieve the corresponding weather data.

Fig 7: KNIME Workflow to retrieve historic weather info (image from author).

The following four nodes do most of the work. The String Manipulation node creates the URL path with the geographical coordinates of the corresponding stadium.

Fig 8: The String Manipulation node creates URL path (image from author).

The GET Request node takes the URL path and sends a request to the open-meteo API. The output is a JSON file with all the needed information. Now we need to convert this JSON column into a suitable form. KNIME also offers a simple and efficient solution for this.

Fig 9: The GET Request node sends request to API (image from author).

The JSON to Table node converts the JSON column into a table column and the Ungroup node finally converts it into a table with the date values and the respective weather information in different columns.

Fig 10: Converting JSON column to suitable table (image from author).

Show me a solution that can create this more easily!

The consolidated weather data for all NFL stadiums, spanning the entire season to date, is then written to a CSV file

Since we are interested in both historical data and wind speed forecasts for our nfl games prediction model, we will develop also a workflow to collect wind speed forecasts.

It looks practically the same as the workflow for collecting historical data. The difference is that it creates a different URL path that refers to the following page: https://open-meteo.com/en/docs

Fig 11: The complete NFL Weather KNIME Workflow (image from author).

The KNIME workflows with all the examples can be found on my
KNIME Community Hub space.

Merging wind data with NFL games

Let’s go back to our NFL team rating workflow from last time and update it with the new wind speed data. In order to merge the wind data with our table of NFL games, however, we need to apply a special solution.

Fig 12: Joining the wind speed data to our rating workflow (image from author).

The reason for this is that, on the one hand, we have the match date and the start time for the game, while for the weather data we have values for all hours of the day.

Directly joining weather data to match timestamps is not feasible due to variable match start times and extended game durations.

Our solution involves creating two new columns: one representing the time 30 minutes before the match, and the other representing the match’s end time. Subsequently, we join the match data using the home-team column.

Fig 13: Metanode “Weather” with a Workflow inside to join the weather data (image from author).

To ensure that we only refer to the relevant weather data for the game, we filter the data so that only weather conditions that occur within 30 minutes to 3 hours of the start of the game are taken into account. We achieve this using a Rule-based Row Filter node, in which we can set up the corresponding formula.

But we still have too many rows per game, because there is a row of weather information for every hour. In the example below, the Arizona Cardinals game started at 16:05. Within the two-hour period from 16:00 to 19:00, four distinct wind speed conditions were recorded.

Fig 14: Different wind speed conditions for ARI game (image from author).

How can we now select or aggregate the best value from these four to provide the best information for our predictions?

Maximum wind speed: This could be used to predict the likelihood of a turnover or other unexpected event caused by the wind. Mean wind speed: This could be used to predict the overall impact of the wind on the game, such as whether it will favor passing or running attacks.

We decide on a third alternative: median wind speed. The median wind speed would be a more appropriate choice for aggregating wind speeds in cases where there are significant variations throughout the game. This would provide a more representative and robust measure of the wind conditions during the game.

Predicting game winner with regression analysis

The model without the wind speed data fits 73.89% by the end of week 12.

Fig 15: Model fit without wind speed (image by author).

By including the wind speed the model’s accuracy increases to 76.67%. This is an impressive increase. But this time we don’t just want to know who will win, but also how big the points differential will be.

To achieve this, we will create a metamodel with a further regression analysis that will use the win probability of the teams to predict the margin of victory (MOV).

Since we do have such a small data set, we will analyze the win probabilities from the first model into different aggregated bennings. By binning the probabilities of home wins into quantiles, we see that there is a non-linear relationship with the margins of victory.

Fig 16: Non-linear relationship between pWin and MOV (image by author).

So the next step is to build a polynomial regression model in KNIME to predict the margin of victory using the probabilities of winning. This model not only gives us the predictions for the margins of victory, but also outperforms the first model by more than three percentage points and achieves an accuracy of 80.56%.

Fig 17: Non-linear relationship between pWin and MOV (image by author).

Now we are ready to make our forecasts for week 13! Check a portion of the predictions in Fig. 18 — tadaaa!

Fig. 18: The first 4 predictions of the winning teams in week 13. To check the full prediction table for all games in week 13, download for free the workflow from my KNIME Community Hub space.
Fig. 19: Team ratings.

Are you also interested in the predictions for all games in week 13? Then download for free the workflow from my KNIME Community Hub space to see the full predictions and let me know if you have any questions.

I wish you lots of fun!

Material for this project:

Thanks for reading and may the Data Force be with you! Please feel free to share your thoughts or reading tips in the comments.

Follow me on Medium, LinkedIn or Twitter
and follow my Facebook Group “
Data Science with Yodime

--

--

Low Code for Data Science
Low Code for Data Science

Published in Low Code for Data Science

A journal of articles written by (and for) the KNIME Community around visual programming, data science algorithms & techniques, integration with external tools, case studies, success stories, data processing, and (of course) KNIME Software.

Dennis Ganzaroli
Dennis Ganzaroli

Written by Dennis Ganzaroli

Data Scientist with over 20 years of experience. Degree in Psychology and Computer Science. KNIME COTM 2021 and Winner of KNIME Best blog post 2020.

No responses yet