DATA STORIES | WEB SCRAPING | KNIME ANALYTICS PLATFORM

Scraping NFL Data with KNIME — Part 2

In Part 1, we scraped game scores from an NFL season. This time we look at advanced scraping techniques to build an NFL database for further analysis

Dennis Ganzaroli
Low Code for Data Science
6 min readOct 27, 2023

--

Fig 1: Scraping NFL Data with KNIME — Part 2 (image by author).

Last time we saw how easy it is to use KNIME and XPath to scrape the results of an entire NFL season.

As you may recall, this KNIME workflow below was sufficient to obtain all the data needed.

Fig 2: KNIME Workflow with scraped NFL data of Week 14, 2022 (image by author).

This time we will go one step further and scrape all available seasons.

The KNIME workflows with all the following examples can be found on my KNIME Community Hub space.

We can see what seasons are available on the USA Today website. Now our task will be to select all the seasons one by one and execute the workflow we created in the last tutorial to save all the games with the results and odds in an Excel file.

Fig 3: NFL season selection un USA Today (image by author).

To achieve this, we need to construct the right URL link for each season.
Looking closer at our URL analysis of the last tutorial about the different time periods in a season, you immediately notice that the year of the season in the image below, marked in green, is always found in the URL.

So it should be enough to change the year in the URL to retrieve the correct season.

Fig 4: Links for different different time periods in the NFL 2022 season (image by author).

Furthermore, we would like to save the results of each season in the same Excel file on different sheets. The Excel file becomes a kind of database where we can store all our games.

We could of course use KNIME to store the games in a real database like mysql etc. But this way we get to know some interesting nodes in KNIME.

For those of you who want to know what data access capabilities KNIME provides, I highly recommend the following article:

Now let’s identify the different blocks of the URL path. As you can see in the image below, the URL path consists of three different components or blocks. The first block always consists of the first part of the page link from USA Today (Block 1). The second block contains the year of the season (Block 2). In the third block we always find the different periods of the season (Block).

Fig 5: Different blocks of the URL path (image by author).

The goal now is to compile the URL path for all seasons and time periods from these three blocks. Nothing easier to do with KNIME, but first we need to prepare an Excel file for this.

The Excel file consists of three sheets. In the first sheet, the base of the URL is given, in the second sheet the seasons are listed, and finally in the third sheet the periods during the season can be found.

Fig 6: Excel file for creation of URL path (image by author).

The following KNIME workflow does it all. The created Excel sheets are imported into KNIME using the Excel Reader node. Then, the String Manipulation node combines the individual blocks into the final URL paths we need for our scraper.

Since we want to save each season in the Excel file as a separate sheet, we use the loop nodes to accomplish this. In the Group Loop Start and Loop End nodes, the different seasons are executed in a loop so that the corresponding URL’s can be created.

Fig 7: Combining the blocks for the URL path (image by author).

The Webpage Retriever node allows you to send a request to a list of URLs provided by our String Manipulation node. Every URL will result in one request which in turn will result in one row in the output table.

Don’t forget to put a delay in the Webpage Retriever node. With this “crawl delay” we want to prevent overloading the website server and minimize the risk of IP blocking.

Fig 8: Using a “crawl delay” to prevent overloading of the website server (image by author).

Finally, the results of the scraped websites are saved in the respective sheet of the Excel file.

Since the odds are not always available for pre-season games and these games have no impact on the regular season results, we will exclude them from our scraping.

We will later use only the Regular Season and Playoff games for the calculation of our Power Ratings, as these games have a higher value than the pre-season games.

The following KNIME workflow has been adapted accordingly and has been restricted “only” to scraping the 2019–2021 seasons. This allows you to test the functionality. Scraping all pages is much more time consuming, but can be unlocked by adjusting the Row Filter node at the top left of the workflow.

Fig 9: The KNIME workflow to scrape all NFL seasons (image by author).

To maintain the KNIME workflow no big instructions and notations are necessary, because the workflow already acts like a map of the engineering process. This is the reason “Why every Data Engineer should learn a Visual Programming Language like KNIME”.

For those interested in this conclusion, please read the following article in more detail:

The resulting Excel file consists of the three scraped seasons saved in different sheets. In the last column you can also find the link to the boxscore statistics. In another loop, these could then be used for further scraping. Try it out!

Fig 10: resulting Excel file with scraped games (image by author).

In the next tutorial, we will create power ratings of the teams based on the scrapped games.

So stay tuned and join me on Medium and don’t miss the latest news.

Material for this project:

Thanks for reading and may the Data Force be with you!
Please feel free to share your thoughts or reading tips in the comments.

Follow me on Medium, LinkedIn or Twitter
and follow my Facebook Group “
Data Science with Yodime

--

--

Dennis Ganzaroli
Low Code for Data Science

Data Scientist with over 20 years of experience. Degree in Psychology and Computer Science. KNIME COTM 2021 and Winner of KNIME Best blog post 2020.