DATA STORIES | SPORT ANALYTICS | KNIME ANALYTICS PLATFORM

Creating Your Sports Analytics Dataset and Discover Insights

A no-code solution with KNIME Analytics Platform

Hans Samson

Published in

Low Code for Data Science

8 min readNov 15, 2023

Based on the current matchday and ranking, the heatmap shows the likelihood of finishing in the top 4.

A while ago, I wrote a blogpost where I explored the possibility of predicting the outcome of a football match (and making money). The research was conducted using KNIME Analytics Platform. In that blog post, I didn’t delve deep into the details of how to configure the different nodes. In this blog post, where the creation of a dataset takes center stage, I will provide that information.

To arrive at my prediction of the result of a football match, I had access to a vast dataset of match results and rankings. I won’t be sharing this dataset, but what I will share is my KNIME flow that allows you to reproduce the dataset yourself. This dataset serves as the ideal starting point for creating model features, but it can also be used to generate insights about teams, leagues, matches, goals and more.

This KNIME workflow is not only useful for processing football matches into a comprehensive dataset. It can actually be used for any sport in which teams compete against each other, such as basketball, hockey, and volleyball.

Getting Started

So the only thing you need to get started is KNIME Analytics Platform (https://www.knime.com/downloads), and the match results from some competition in e.g. an Excel file.

You will find the KNIME workflow in my publicly accessible space on the KNIME Community Hub. You only need to drag and drop the workflow into to your local KNIME Analytics Platform canvas — that’s it.

What the KNIME workflow does

The KNIME workflow reads in a file containing the results of league matches. The result is a file that displays the ranking of each team for each round of the competition. It transitions from (input) matches between teams to (output) the ranking of each team for each matchday. The final file is structured with the following fields:

The data fields after executing the KNIME workflow.

To go from the input file to the output file, I divided the KNIME workflow into 4 steps.

As you can see the actual KNIME nodes are not visible from the outside. They are hidden within so-called metanodes. You can open the metanodes by double-clicking on them or by right-clicking and choosing the “Metanode -> Open” option. Metanodes are useful to clean up messy workflows. You can identify isolated blocks of logical operations in your workflows and include them inside a metanode. You can read more on the topic here: Metanodes and Components.

Input

It all starts with data. You can create the input file yourself or look for a dataset on the Internet. I obtained my match results easily with the CSV Reader node from football-data.co.uk. With this node, you are not limited to local csv files, but you can read csv files from URLs too (see figure below).

The KNIME workflow expects a file with seven columns. These are (1) match_id - a unique identifier; (2) matchday, or competition round; (3) date, not necessary but helpful; (4) home_team; (5) away_team; (6) goals scored, gf; (7) goals conceded, ga;

It’s not a problem if the column names in your own input data do not match the names and format expected in the KNIME workflow. You can easily adjust this using the Table Manipulator node. The Table Manipulator node has multiple options to restructure your table in one go.

Adapting the input file to meet the requirements of the KNIME workflow using the Table Manipulator node.

Once this is fixed and the column names match, you can run the KNIME workflow. There’s no need to configure the nodes in steps 1, 2 and 3. In this blog, I will describe them step by step anyway.

Step 1: From games to teams

In this step, we go from games to teams. That means that the number of records that served as input will be doubled. This sets the foundation for creating the rankings at the end of each matchday.

The Rule Engine node (green) is used to create new variables (or modify existing variables). In the first Rule Engine node, the winner of a match is determined based on the goals scored and goals conceded. The other two nodes following this Rule Engine node are calculating the number of points earned by the home team and the away team.

The configuration of the first three Rule Engine nodes.

After the Match Formula node, where the absolute value of the goal difference is calculated, the workflow is split into two branches. The match result is examined from both the perspective of the home team and the away team. By making clever use of the Column Rename node, everything comes back together nicely in the Concatenate node. So, where the match between Ajax and Volendam ends in a 2–0 victory for Ajax, the result for Volendam is recorded as a 0–2 loss.

So we go from games:

to teams:

Step 2: Ranking teams (per round)

Step 2 focuses on the teams and creates the ranking per round per team. The position on the league table is determined by the cumulative number of points, the cumulative number of goals scored, and goals conceded. To achieve this, the results of the matches of each team (sorted by matchday) are processed individually. For this, the Group Loop node in combination with the Moving Aggregation node is very suitable.

The Moving Aggregation node is a super handy node with a lot of functionality. In total, there are 18 teams in the Dutch competition (Eredivisie). This means that the Group Loop offers a subset of the data to the Moving Aggregation node 18 times (for each team separately). There are various aggregation methods to choose from in the configuration of this node. Since we are looking for the cumulative number of points, the cumulative number of goals scored, “Sum” is the appropriate aggregation method.

The configuration of the Moving Aggregation node (do not forget the checkbox).

An example of the result after the calculations in the Moving Aggregation node are done for one of the 18 teams (in this case, Ajax), is shown in the table below (for readability, I have omitted a few columns).

Sometimes, matches are postponed for various reasons (e.g. unplayable field due to bad weather) and rescheduled for a later date. This means that a match is missing from a matchday, which has implications for generating the ranking for that matchday and the next. Therefore, it’s preferable to create a dataset that includes only matchdays in which all matches have been played.

Step 3: Calculate League Table

In the previous step (step 2), it was determined for each team how many points, goals scored, and goals conceded they had accumulated up to each matchday. In the Loop End node of step 2, this information from all teams in the competition came together. This data is the starting point for step 3, the compilation of the league table for each matchday.

Again, a Group Loop is used for this purpose. It’s crucial that the teams are presented in the correct order to the Group Loop. The Sorter node at the beginning of this flow creates the ranking. The ranking within a matchday is determined (in this order) (1) by the number of points earned (descending) up to that round, (2) goal difference (descending), and if that’s equal, (3) the number of goals scored (descending).

Each matchday, with the team (based on the sorting) in the correct order, goes through the loop one by one. Thanks to the Constant Value Column node, each record has been assigned a value of 1 in a new variable called “ranking.” As in step 2, by choosing Cumulative Computation in the Moving Aggregation node, all 18 teams are ranked from 1 to 18 in ascending order.

The result is a file with the ranking for each team at each matchday, together with a lot of other fields. Now, one last step to go, and you have a great starting point for finding various insights and other cool stuff or even creating features for predictive models.

Step 4: Final add-ons

In this step, you can configure a number of KNIME nodes yourself that provide variables with information to help keep track when you combine the results from multiple leagues and seasons. I recommend creating separate files for each season and each league; don’t mix seasons from different leagues. Due to the uniform structure of the files, it’s very easy to merge (Concatenate node) them back into a single file.

Using the Constant Value Column, you assign the same value to all records in one go. In this case, the value is “2023–2024” for season and “ned” for the variable “source”. Finally, a unique ID is created with the String Manipulation node.

The creation of the id_unique field, with te String Manipulation node.

So, let’s kick off your Sports Analytics journey!

This blog post has provided you with the initial steps to create your dataset, which can serve as the cornerstone for a wide range of insightful analyses. To get started, simply download the workflow from my public space on the KNIME Community Hub.

Whether you’re passionate about football, basketball, hockey, or any competitive sport, now is the perfect time to begin. Take the first step today, and let KNIME empower you to uncover data-driven insights that will elevate your sports analytics to the next level.