DATA STORIES | SOCCER ANALYTICS | KNIME ANALYTICS PLATFORM

Using Graph Theory and KNIME Analytics Platform to Predict the EURO 2024 Outcome

A network graph of different football clubs and national teams and the Authority Score to determine the winner

Martin D Aus A
Low Code for Data Science

--

It’s the time of the year again where a major football tournament is fast approaching. The UEFA European Football championships will kick of on June 14th with the first match between the tournament host Germany and Scotland.

It’s also that time of the year again where friends and colleagues start pestering you to join fantasy football leagues or tipping competitions in order to predict the outcomes of the different matches and to ultimately bet on who will be the winner of the tournament.

Over the years, I’ve encountered different types of players in these competitions.

Player sterotypes

From left to right in no particular order:

  • The Intuitive Enthusiast — This person relies on their gut feeling and passion for the sport.
  • The Stat Geek — This person uses math and statistics to make their picks.
  • The Lucky Novice — This person has no idea what they’re doing but somehow ends up lucky.

I’m part of team “Stat Geek” myself for sure and this articles goes out to all the Stat Geeks (and Lucky Novices). I am already looking forward to passionate discussions with the Intuitive Enthusiasts as to why they disagree with my predicted outcome :-).

The Graph Theory intuition for predictions

All joking aside — over the past years I’ve used opportunities like this to learn about approaches to predict outcomes and tried them out eventually with more successes than fails when it comes to my placement in different leagues.

This time it is no different — what is different though is the approach I am taking. I came across this blog post on the usage of graph theory to make 2022 World Cup predictions.

The blog post shows how the connections between different football clubs and national teams (in terms of how many players of certain clubs are selected for different national teams) was used to create a network graph. For the relationships in that network graph the “eigenvector centrality” was determined for every national team to establish a ranking. For the 2018 World cup, applying this method, out of the top 5 teams with the highest eigenvector centrality, 3 ultimately made it into the semi-finals, and France as the highest ranked team won the tournament. For 2022, out of the predicted top 4 teams, three made it to the semi-finals and the highest ranked team, again France, ended up loosing to Argentina in a penalty shootout.

A pretty impressive accuracy for a method that doesn’t involve analyzing teams previous match results etc.

In this article I’m exploring if and how this can be done using a low-code approach in KNIME Analytics Platform.

Project Overview

1. Technical Requirements

Let’s look into what we need first:

  1. KNIME Analytics Platform — I’m using version 5.2.3.
  2. KNIME Network Mining Extension — this contains functionality to create and analyze Networks.
  3. KNIME Rest Client Extension — we’ll grab data about the squads and matches from Wikipedia and this extensions has what we need to parse websites.
  4. KNIME XML-Processing Extension — the website data comes in XML Format and this extension has what we need to process it to extract the data we need and to wrangle it into the right format.

There might be some other “more common” extensions that need to be installed— should you decide to take a look at my workflow (linked at the end of the article), these will “pop up” and prompt you to install them when you first open the workflow. The above one’s are those that are definitely worthwhile mentioning in the context of this project.

2. Approach: A network graph and the Authority Score to determine every match’s winner

We first extract the players of all national teams that have qualified for the tournament, including the clubs they play for from Wikipedia.

We wrangle this data into a format where we know how many players, which club sends players to which national team, and we use this data to create our network (each club and national team will represent a node, every connection from club => national team will represent an edge). The resulting Network is then analysed to determine the “Authority Score” of each team. This is a slight deviation from the approach used in the blog post above that inspired me, but the Authority Score is a metric that comes out-of-the-box with KNIME, whereas the math behind Eigenvector Centrality somewhat escaped my capabilities ;-).

In the context of this project, I’d explain Authority Score like this:

“The Authority Score is like a strength rating for national football teams. It shows which teams are considered stronger based on the number of players the top clubs send to them. A higher Authority Score means a team has more players from well-known clubs, making them likely to perform better in matches.”

We then sort the national teams based on their Authority Score — that’s where the other article stops and I decided to take it a step further.

We also extract the different matches of the group stage (they are all known and scheduled already), as well as the bracket for the knock-out rounds and the logic to determine which third-placed teams advance to the main round (in general, there are 24 teams that qualified; 16 advance to the main round: the first and second team of each group move on, and in addition to that, the best four teams that ranked third in their group). The source is Wikipedia again.

We then simulate the full tournament and use the ranking according to Authority Score to determine every match’s winner (e.g., the team that is better ranked wins).

The reason for this additional step is that, especially once we get to the knock-out rounds, it depends on the sequence of opponents how far a team makes it into the tournament — e.g., if the team that ranked 4th meets the team that ranked 2nd already in the quarter-finals because of how the draw worked out, the 4th-ranked team will not make it into the top 4.

3. What’s covered in this article

This is quite a large workflow so for brevity I will:

  1. focus on how the Network Mining Extension was used to generate & analyze the network, and how I used it to visualize the knock-out bracket.
  2. share the outcome and reveal who will be champion according to the analysis.
  3. discuss some potential limitations of this approach.

There’s also a video covering aspects like data gathering and processing in more detail so feel free to check it out below if this project “tickles your fancy” :-).

Workflow Overview: Step-by-Step

Workflow Overview.

Above you can see the overall workflow, which contains various “Metanodes”, bundling a lot of the logic. Find below a high-level explanation on what is happening in sections 1 to 5:

Section 1: Using KNIME for Data Extraction

1. Fetch Team and Squad Data:

  • Wikipedia Pages: Use the Webpage Retriever node in KNIME to pull data from the Wikipedia pages.
  • Parsing Data: Use the XPath nodes to parse XML data and extract relevant information like team names, player lists, and their respective clubs.

2. Fetch Match Schedule Data:

  • Webpage Retriever: Again, use this node to fetch match schedule data.
  • XPath Node: Extract match details, including groups, knockout stages, and tiebreaker logic.

Section 2: Creation and Analysis of the Network Graph

1. Building Nodes and Edges:

  • Data Pre-processing: Use the Column Filter and Row Filter nodes to clean and format the data.
  • Creating Edges: With the Column Expressions node, create a new column to join club and country data, forming unique edges.
  • Object Inserter: Define nodes (clubs and countries) and edges (player connections) to build the network graph.
  • Network Viewer: Configure the Network Viewer node to visualize the graph, displaying node sizes and colors according to their authority scores.

2. Analysing the Network:

  • Network Analyzer: Calculate the authority score for each team. This score reflects a team’s strength based on the number of players from clubs.
  • Sorting and Ranking: Use the Sorter and Row Filter nodes to rank teams based on their authority scores.

Section 3: Simulating the Tournament

With the teams ranked by their authority scores, we simulate the tournament. For each match, the team with the higher rank is predicted to win. This method, while simplistic, provides a structured way to forecast the tournament’s outcome.

  1. Group Stage Predictions: Use the Joiner node to align match schedules with team rankings. Determine third placed teams, sort them by their ranking. The top four advance. Work out which scenario applies for the four third placed teams based on which groups they come from, to determine the round of 16 match ups in the knock-out round.
  2. Knock-out Rounds: Progress through the Round of 16, quarter-finals, semi-finals, and finals by comparing team rankings.

Section 4: Consolidate matches and outcomes from all rounds

Merge all matches including the opposing teams, their ranking and the winner according to the applied logic into one table. Write that table into Excel.

Section 5: Results and Visualization

Wrangle the Match Outcome dataset for the knock-out stage into a format that can be visualized in a Network View. Extract features to display the winning teams as green circles, the losing teams as red crossed and the winner of the final as yellow asterisk.

How to use the Network Mining extension

Let’s take a closer look at the Metanode “Create and Analyse Network” from Section 2 of the main workflow.

Create and Analyse Network Metanode.

1. Table Reader

We start by reading the data extracted from Wikipedia using a Table Reader node. The Wikipedia data was saved after extraction as it may change so in order to ensure reproducibility of the results I opted to save it and read it back in again in this Metanode. The data contains one row for every player that was selected into a national team. As the player name does not matter in this analysis, the saved table was reduced to three columns: Country, Club and count (although count is 1 for each row).

2. GroupBy

GroupBy Node and Configuration.

Next we use a GroupBy node to aggregate the data. We want to know how many players a club sends to a national team in total, so in the Configuration Dialogue we choose Country and Club columns as “Groups”, and in “Manual Aggregation” tab we add the column count using aggregation method count.

3. Column Expressions

Column Expressions and Configuration.

We send the aggregated data to a Column Expressions node. As mentioned earlier, each club and each national team will be represented as nodes in our network graph. The connections will be represented as edges. To make sure that we don’t loose the connections, we use the Column Expressions node to add a column “edgeID” which combines the Club and Country column separated by “<->”. We need this column later on to create our Network Graph and also to insert features that we can visualize using the Network Viewer to the network data.

4. Network Creator & Object Inserter

Network Creator and Object Inserter.

The table containing the Node and Edge data is sent to an Object Inserter Node. The Network Creator node does not require any configuration, but it’s green-squared output port needs to connect to the Object Inserter.

The Object Inserter node will add the Nodes and Edges from our data table into the network. In the configuration Dialogue under “Node settings”, we select the appropriate column (Note: Country goes into “Second node id column” as we want to connect Club to Country). We can use the Club and Country names also as the labels for the created nodes so that it is easy to understand what node one is looking at. In the “Edge settings” section, we select the edge column and use the count column as the label. This way every connection will be labelled with the amount of players that are sent from the club to the country.

We also check the box for “Create directed Edges” to make sure it is a “one-way Street” — i.e. Clubs send players to Countries, but not the other way around. Under Weight settings we set the radio button “Column” and pick “count” from the drop down. This ensures that the number of players send from Club to Country are considered in determining the importance of each node. Et voilà — our Network Object is ready to be analysed.

5. Network Analyzer

Network Analyser.

This is the node that enables us to get to the ranking of teams very quickly. In the configuration make sure to check the box for “Hubs & Authority” as this makes sure the Authority Score is calculated. The output will be a value between 0 and 1, with the “best” team having the value 1. All Clubs will show a value of 0 and only the Country nodes will show a score — this is good as the clubs going forward are irrelevant for want we need to do in other steps.

6. Sorter

Sorter.

The data that is contained in the top output port from the Network Analyzer is send to a Sorter node. The nodes in the input are in no particular order so we want to change that: The Nodes should be ordered from highest to lowest Authority Score. The Sorter node does this by selecting Authority Score under Sort by and selecting Descending from the radio buttons.

In the referenced article mentioned above, that is where the simulation stopped and the prediction of top 4 / 5 teams was made. We will progress from here to simulate the tournament!

In order not to spoil the outcome I’ve hidden the Node names contained in the first column for now — but I promise it is not too much longer until all will be revealed!

7. Row Filter and Column Expressions

Row Filter and Column Expressions.

I mentioned already that we don’t need any information on the Nodes that represent Clubs going forward. The Row Filter Node is configured to only keep rows 1–24 and to remove anything alse. Given that the table is ordered in descending order by Authority Score this makes sure we only keep the National Teams.

Then we use the Column Expressions node to do two things:

  1. Assigning a numeric rank based on the position in the table. We add a new column “rank” and use the rowIndex() formula.
  2. Some cleaning of the Country names — later on in the tournament simulation I noticed that there are some hidden whitespace remaining from extracting the data from Wikipedia which prevented combination with the match schedules via Joiner Nodes. We “overwrite” the Object id column that contains the Country Names and use a regular Expression in the regexReplace() function to target any whitespace and to replace it with “” (nothing…).

The output of the Column Expressions Node goes to the Metanode output port and is important input for the tournament simulation.

8. Size and Color Manager + Visualization Property Extractor

Size and Color Manager + Visualization Property Extractor

The upper branch prepares the data from the Network Analyser for pretty visualisation.

  1. Size Manager Node extracts the size for Nodes based on the Authority Score and scales it up by factor 45.
  2. Visualization Property Extractor Node is used to add this size as a new column to the data set named “scaling factor”.
  3. The data is sent to a Color Manager Node. Based on the “scaling factor” column we can assign a color range from lowest (0 = red) to highest (45 = green).
  4. Another Visualization Property Extractor Node is used to add this color as a new column named “color” to the dataset.

9. Feature Inserter

Feature Inserter.

Now that we have extracted properties for visualization we need to insert them into the network graph object. That is the job of Feature Inserter nodes.

Features can be inserted for any Node or Edge that is part of the network graph. The color and size properties are node related — we want Nodes to differ in size and color depending on their Authority Score. So in the first two Feature Inserter nodes, we select the Object id column that contains the names of Clubs and Countries — our Nodes —, assign appropriate feature names, and select the columns that the Visualization Property Extractor nodes added to the dataset.

The last Feature Inserter node illustrates how to insert an Edge feature — the player count of each Club in terms of players send to a national team. This is somewhat redundant as we selected this already in the Object Inserter node. However, for the sake of having an Edge example, I chose to include it. The last Feature Inserter node sends the green-squared network object output port to the Metanode output.

Next, it is time to move up one level again in the Workflow to configure a Network Viewer and to reveal the strongest team according to Authority Score!

Configuring the Network Viewer

The last step before looking at the result of the Network Analysis is a somewhat pleasing view to configure the Network Viewer node.

Network Viewer Node.

It’s a busy configuration dialogue as we have to make selection related to all the features we have inserted previously. In the top screenshot, you see all the different register tabs of the node dialogue with arrows pointing to the tabs that need to be configured:

  1. Layout Settings: The drop-down menu presents plenty of options on how to visualize the network. I found “cose” to be the best option.
  2. Node Settings: We add all our Node-visualization-related features: “label” was generated in the Object Inserter and contains the name of the Node, we use “color” for Node fill and Node outline color, and the “Size” feature to determine the Node size. I chose circle as the default shape for all nodes (we did not create a shape feature) and I played around with the font size for the Node label text.
  3. Edge Settings: We can use the count feature — I opted for the edgePlayerCount one that we added using Feature Inserter node.

Network Analysis Outcome

Now it is time for the first drum roles. The screenshot below shows the outcome:

Network Viewer.

Identify the winner and the top 4 teams

The largest and greenest Nodes are Germany and France, with a clear Edge for Germany indicating that Germany is considered the strongest team — no matter how we simulate the tournament in the next steps, this also means that Germany will be the winning team after the simulation. (Note: according to the logic that will be applied, the strongest team can’t lose to any other team).

However, it is not guaranteed that Germany will play against France in the semi-final (e.g., if due to the draw Germany and France meet earlier in the knock-out stage, France may be eliminated earlier). Other teams, pointed at with a yellow arrow above, that seem to be of somewhat similar strength, are Spain, Portugal, England, Netherlands and Italy. No surprises for me here. If you remember: The Metanode we discussed in a lot of detail also had a table with the full rankings:

Rankings Table.

The top four teams are Germany, France, Spain and Portugal. Looking at the Authority Scores, Germany and France are not separated by a large margin. However, after that, the score starts dropping more significantly.

Tournament simulation for Semi-finals

Let’s see next what the outcome of the simulation is to validate if the draw may prevent any of these top four teams from advancing to the Semi-finals.

Simulation Outcome.

After simulating the entire tournament three out of the four top teams make it to the Semifinals — Germany, France and Portugal. Spain is unlucky and meets Germany in the Quarter Finals and is eliminated, making way for Italy, which is ranked 5th. The Screenshot above only includes a snapshot of the knock out rounds — the Concatenate node also contains the full results of every group match.

The “Visualization Tournament Bracket” metanode contains some more processing and feature inserting based on the match schedules to create the connections from one match to the next in the knock-out rounds. The output is then visualized with another Network Viewer node.

Knock-out Round Development.

Unfortunately the “tree-like” structure is hard to see at first, however it is possible to follow each teams progress throughout the tournament by following the connections. In general, a green circle indicates that a team played a match and won to advance to the next round. Accordingly, the Teams name will also appear in the node it is connected to. A red cross indicates the Team is eliminated, and the yellow star indicates the winner of the final.

Let’s take Portugal as an example:

  1. The first Node with only one connection is green at the very bottom. The next Node is also labelled “Portugal” and is connected to a red Node “Croatia” indicating that Portugal beat Croatia in the round of 16.
  2. The second Node indicates the outcome of the Quarter Finals match. Portugal beats the Netherlands and advances to the Semi-final (Third Node, red cross labelled Portugal)
  3. The third Node is connected to a green Germany Node, indicating that Portugal lost the Semi-final.

Not perfect, but somehow the other alternatives I explored, including the Generic ECharts combined with K-AI did not work out!

Final thoughts

This project has been a lot of fun and I was really surprised how much insights just the simple connections between Clubs and National teams can provide. Bear in mind that in this process not a single line of “Code” was hurt, as not a single line of code was written!

I’m admittedly pleased with the analysis outcome as I’m terribly biased — having experienced what is still known as the “German Summer Fairytale” of 2006, I’m beyond psyched for the EURO in Germany and am very aware how much it can impact a team if every single game is played in front of a home crowed.

That said, a prediction, and moreover a prediction approach is only worth as much as what reality delivers.

There’s certainly more to football tournaments than just the players that were selected and the club they usually play for and on top of that the simulation approach is used in this article is very much binary and very simplistic. That said, the results in terms of predictions using Eigenvector for 2018 and 2022 look promising, so let’s see and wait how the network analysis-based predictions hold up against reality in less than a weeks time when the first matches will be played.

--

--