DATA STORIES | WEB SCRAPING | KNIME ANALYTICS PLATFORM

Scraping NFL Data with KNIME — Part 1

Learn how to use KNIME for efficient NFL data scraping, gaining a competitive edge in sports analytics and betting

Dennis Ganzaroli
Low Code for Data Science
9 min readOct 16, 2023

--

Fig 1: Scraping NFL Data with KNIME (image by author).

In the fast-paced world of sports betting and analytics, gaining access to accurate and up-to-date NFL (National Football League) results and odds data is crucial. Web scraping is a powerful technique that allows you to extract this valuable information from websites efficiently.

In this introduction, we will explore how to leverage KNIME, an open-source, versatile, and no-code data analytics platform, to perform web scraping of NFL results and odds.

This marks the beginning of a series of articles dedicated to sports analytics, with a primary focus on the NFL.

In this series, we will explore various aspects of NFL analytics, starting with web scraping to gather essential data, and subsequently delving into calculation of power ratings and advanced machine learning techniques to predict the results of the games.

These analytics will prove valuable to both football enthusiasts and those engaged in the world of sports betting.

Stay tuned for more in-depth coverage in the upcoming articles.

>> Read Part 2 of Scraping NFL Data with KNIME <<

1. Choosing what to scrape

In the world of web scraping, the initial step is pivotal. It all starts with the crucial decision of what specific information to scrape from websites.
For our purpose, we aim to first retrieve data on NFL game results and the associated point spread.

NFL game results play a pivotal role in our subsequent steps, as they are essential for calculating team power ratings. These power ratings are essential to assessing a team’s performance and making predictions for future games.

The point spread, on the other hand, is determined by sportsbooks to create a balanced betting environment.

Fig 2: Point spread of the NFL game New England@Miami (image by author).

For example, if the Patriots are favored by -7.5 points and you bet on them, they must win by more than 7.5 points for your bet to win. If you bet on the Dolphins at +7.5 points, they can either win the game or lose by less than 7.5 points for your bet to be successful.

The spread makes betting more competitive and balanced and is the reason why it is considered more attractive than odds. In theory, a handicap makes a game an equal proposition in the eyes of the betting public, with bettors just as likely to take one team as the other no matter how great the mismatch might be.

For our purposes, the point spread will be a critical factor in evaluating and refining our future betting models.

Furthermore, other statistics are also collected in the NFL in so-called boxscores, which evaluate the game event according to various performance criteria. These detailed statistics will help us at a later stage to implement appropriate rules and features in our machine learning model.

Fig 3: NFL Scoreboard on US Today of Week 14 2002(image by author).

For the reasons described above, we decided to scrap the US Today homepage, as it already contains all the necessary data.

2. Scraping the Data with KNIME

For those who are not yet familiar with KNIME, I recommend reading the following article. It gives a brief introduction to the installation and use of the freely available open source tool KNIME.

KNIME is available for Win, Mac OSX and Linux, ensuring compatibility across platforms.

Let’s now start and scrape some data. We’ll first look at an example based on week 14 of the 2022 season, then scrape all games of the 2022 season.

The following KNIME workflow does it all. The screenshot above shows the output of our scraped data.

The KNIME workflows with all the following examples can be found on my KNIME Community Hub space.

Fig 4: KNIME Workflow with scraped NFL data of Week 14, 2022 (image by author).

Let’s now take a closer look at the individual KNIME nodes. To “open” a node, you just have to double-click on it.

In KNIME, individual tasks are represented by nodes. They are the smallest possible unit in KNIME and have been created to perform all sorts of tasks, including reading/writing files, transforming data, training models, creating visualizations, and so on.

Fig 5: Intro to the Nodes and Workflows (image by KNIME).

In the “Table Creator” node we define the URL of the website we want to access. This URL is then passed to the “Webpage Retriever” node.

Fig 6: Loading the website into KNIME (image by author).

The “Webpage Retriever” node has some settings that we will not change at the moment, since we are only scraping one website. But later we need to adjust the connection delay.

The appropriate delay for scraping a website, often referred to as the “crawl delay”, can vary depending on several factors. As a general guideline, a delay of 1–5 seconds between requests is often considered reasonable for most websites. But more about that later.

Fig 7: The settings of the “Webpage Retriever” Node (image by author).

Now it’s the turn of the “XPath” node, where most of the scraping takes place. KNIME has a dedicated node for this task as well.

XPath (XML Path Language) is a powerful and widely used language for navigating and extracting data from XML and HTML documents. It’s a crucial tool for web scraping when you need to extract specific information from web pages.

In the image below you can already see all the XPath commands in the query to retrieve the corresponding data, such as scores and spread.

Fig 8: The settings of the “XPath” Node (image by author).

To extract the XPath queries you want, you don’t need to write all the code. It is enough to follow the steps below:

A good tool to test XPath queries efficiently is XPath Helper. This is a browser addon for Chrome or Firefox, which can be installed quickly. Search for “XPath Helper” for your browser.

Right-click on the element you wish to obtain an XPath query for and select “Inspect”. In our case it is the rectangle with the information about the Raiders@Rams game. Automatically opens the HTML web page inspector on the right side of the window.

Fig 9: Opening the HTML web page inspector (image by author).

Now you need to navigate to the information you want. In our example below it is the Raiders score.

If the Raiders’ score lights up, then the correct line has been selected. Now you can click the right mouse button in the HTLM inspector and select “Copy/Copy XPath”.

For control purposes, the selected XPath query can be tested in the black window of the XPath Helper in the upper left corner. As you can see, the score 16 is displayed. So we have found the right query.

Fig 10: Findig the right XPath Query (image by author).

But we don’t only want the score of the visiting team (Raiders) of this game, but of all visiting teams that played in Week 14. To achieve this we need to analyze our XPath query below in more detail and adjust it a bit.

Fig 11: XPath Query of Raiders score (image by author).

XPath allows you to precisely locate and extract data from specific elements on a webpage. The numbers in the brackets of the commands are like coordinates.

But when you replace a number in an XPath command with an asterisk (*), it acts as a wildcard, matching any numeric index. This means that it selects all elements at that level of the hierarchy, regardless of their position.

So in our case, replacing the query with the following command will result in our desired selection.

Fig 12: XPath Query of all visiting teams (image by author).

Now all scores of the visiting team for week 14 are selected. You can now copy the code into the XPath node of KNIME and create a new column “ascore” (=away team score) with it.

In the XPath node, select “Integer cell” as the “Return type” since the score is a number, and “Multiple rows” in the “Multiple tag options”. If the returned cell is a name like the team names, select “String cell” as “Return type”.

Fig 13: Creating columns in the XPath node (image by author).

Following the same principle we extract the other information of the game.
Nevertheless, in order to determine the links to the further referring boxscores with the additional statistics of the game, we need to make another adjustment to a query.

By simply copying the XPath we do not get the link to the score, but only the text of the cell.

Fig 14: Finding the link to the boxscore (image by author).

Here a bit of know how about XPath is needed. By adding “/@href” at the end of the query, the desired link is output instead of the text.

Fig 15: XPath Query for the boxscore link (image by author).

With this simple workflow, we were able to find all the required data for week 14.

Fig 16.: Result of scraping week 14 (image by author).

We now want to extract all the other weeks of the season. To achieve this, we need to understand how this website is designed.

As we see below, there are always different time periods in an NFL season.
At the beginning of the season, the pre-season games always take place first in week 1 to 4.

Then the regular season begins from week 1 to week 18, during which each team plays 16 games.

In the playoffs, a sudden death mode will determine the contender for the conference championship and finally the two finalists for the Super Bowl.

Fig 17: Links for different different time periods in a NFL season (image by author).

We will summarize the links as follows and create a table from them in Excel, which we will use as input in KNIME.

Fig 18: All links of the NFL Season 2022 (image by author).

We load the created Excel file into KNIME with the “Excel Reader” node and adjust the “Webpage Retriever” node so that it always waits 3000 ms or 3 seconds per query. With this “crawl delay” we want to prevent overloading the website server and minimize the risk of IP blocking.

Fig 19: Scraping the whole season using a “crawl delay” (image by author).

All games of the season could be extracted. To save the scraped data, we write them out in an Excel file.

And … it’s good!

Fig. X.: Results of whole NFL season 2022 (image by author).

In the second part of this web scraping tutorial, we will go further and scrape all the seasons with the games and their corresponding boxscores.

So stay tuned and join me on Medium and don’t miss the latest news.

Material for this project:

Do you want to know how American football is played, then watch this short video.

A Beginner’s Guide to American Football

Thanks for reading and may the Data Force be with you!
Please feel free to share your thoughts or reading tips in the comments.

Follow me on Medium, LinkedIn or Twitter
and follow my Facebook Group “
Data Science with Yodime

--

--

Dennis Ganzaroli
Low Code for Data Science

Data Scientist with over 20 years of experience. Degree in Psychology and Computer Science. KNIME COTM 2021 and Winner of KNIME Best blog post 2020.