DATA STORIES | UEFA EURO 2020 | KNIME ANALYTICS PLATFORM

If you want to be a data scientist, change hobbies!

Dennis Ganzaroli
Low Code for Data Science
12 min readJun 10, 2021

--

Data Master Yodime shows Lucky how to use his interest in football as motivation to gain experience in Data Science by collecting football matches to predict the winner of the UEFA Euro 2020 with the open source data science tool, KNIME Analytics Platform.

Fig. 1: Data Analytics must become your first hobby! (image by author).

Sometimes people ask me on social media:
“What do I need to learn to get a data science job?”
and I have to answer them all the time:
“You have to change hobbies! Because learning is not enough.”

But let’s start from the beginning. Let me tell you the whole story.
Let’s call our interested friend: Lucky. He must be a distant cousin of Luke Skywalker from Star Wars. And I will humbly take the role of Data Master Yodime.

Lucky: I saw your post about the dashboard and would like some guidance on a job. Help me out!
Yodime: How old are you and what experience do you have?
Lucky: I am 28 years old and have 6 years experience.
Yodime: Hmm,...What kind of support?
Lucky: IT-Support
Yodime: And you’re interested in Data Analytics?
Lucky: Yes, I’m! Please guide me!
Yodime: What kind of hobbies do you have?
Lucky: Cooking, cars, travelling, football…
Yodime: Hmm,…it’s not part of your hobbies!
Lucky: Uhm? Sorry, didn’t get you.

Yodime: “Data Analytics must become your first hobby!”

Lucky: Okay.
Yodime: Learning is not enough. You must live it, you must love it!
Lucky: Okay.
Yodime: I work for a big telco as a Data Scientist, but I still do data science projects in my spare time.
Lucky: I want to start to learn. But I don’t know where to start. Should I learn first Python, SQL or Tableau?
Yodime: You first need a problem.
Lucky: A problem? What kind of problem?
Yodime: A real problem. Hmm, let me see your hobbies again…
Do you like football?
Lucky: Yes! Very much!
Yodime: Let’s start from there.
Lucky: But what does football have to do with Data Science?
Yodime: It’s not about football. It’s about the data behind.

And data is everywhere.”

Fig. 2: Data is everywhere (image by author/original source Unsplash by Markus Spiske).

Lucky: Please show me how I can combine football with Data Science!
Yodime: When you work with data, it’s always about answering questions.
And to answer the questions, you need to tell the right story.

Data Science is all about storytelling.”

Fig. 3: Data Science is all about storytelling (image by author/original source clevertize.com).

Lucky: Now I’m even more confused! Data Science, football and storytelling?
I need some examples, please!
Yodime: Hmmm,.. no, you need some questions. What questions would you like to answer about football?
Lucky: I see now… I would like to know which team will win the upcoming European Championship.
Yodime: Difficult question, but a good point to start. Let’s start!
First of all you need the data. You will be able to answer your questions with the results of the football matches.
Lucky: With the results of the games? How?
Yodime: Wait! First you must get the historical data. Like Confucius said:

“Study the past, if you would divine the future.”

Lucky: Con-Who?
Yodime: You need the results of the past matches of all participating teams of the tournament.
Lucky: I see… I need a database where I can fetch all this data. Where can I buy such a database? Do you know a good API?
Yodime: You can’t always start with a ready database. Sometimes you have to start from scratch and gather the data yourself.
Lucky: But I want to start building models with Machine Learning!
Yodime: Everyone wants to teach you Machine Learning and Deep Learning first. But without the right data, you won’t get anywhere. And often the data you need is on some website and you have to scrape it to get it into a database.
Web scraping is still underrated in Data Science education!”

Lucky: What tools should I learn to accomplish all these tasks? Everyone tells me that I should learn to program with Python.
Yodime: The tool is not important if you want to accomplish the task. Every tool can be good enough.

“The tool becomes important when you need to be fast”

Fig. 4: The tool becomes important when you need to be fast (image by author).

Yodime: I’ll show you how to do everything with the Data Science tool KNIME. The visual programming language of KNIME is self-explanatory and therefore easy to learn. Here is a good “Getting Started Guide” where you can also download the opensource software for free. The following article will also give you a good starting point: “Seven things to do after installing KNIME”.

Fig. 5: A simple KNIME-Workflow (image by KNIME).

But before we can really begin, we need to know better about our subject. Because never forget:

Business Knowledge is always key!”

Fig. 6: Business Knowledge is always key! (image by author).

Yodime: Do you still remember your initial question?
Lucky: Yes, I want to know which team will win the European Championship!
Yodime: So lets get an overview about the tournament. Wikipedia can help us here.

UEFA Euro 2020 was supposed to have taken place in 2020, but was postponed to this year due to the Covid pandemic.

The tournament has three stages:

1. Qualification for the final stage:
55 European UEFA national teams were divided into 10 groups and played twice against the same group members. 20 teams composed of the group winners and runners-up qualified directly for UEFA Euro 2020.

Fig. 7: Qualification for the final stage 55 teams in 10 groups (source UEFA.com).

Another 4 teams, which had failed in the group stage of the qualification, were still allowed to qualify for the final tournament via the play-offs. The selection for the playoffs was based on their performance in the 2018–19 UEFA Nations League.
This sums up to 24 teams playing in the final stage.

2. Tournament group stage:
The 24 qualified teams were drawn into six groups of four. Each team plays a total of 3 games and plays once against each other group member. Group winners, runners-up, and the best four third-placed teams advance to the round of 16 and so to the knockout phase of the tournament.

Fig. 8: The 24 qualified teams in the final group stage (source UEFA.com).

3. Tournament Knockout phase:
In the knockout phase, if a match is tied at the end of normal playing time, extra time is played (two periods of 15 minutes each). If still tied after extra time, the match is decided by a penalty shoot-out.

Fig. 9: Euro 2020/2021 Final Stage Match Schedule (image courtesy of ukcontracting.co.uk).

Lucky: I had never really realized how complicated this qualification process is. Is there no easier way to determine the winner?
Yodime: The goal of this process is to determine the best team from the 55 European teams. If every team played against every other team like in a national league, then according to Gauss’ famous formula, we would have:
n*(n-1)/2 => 1'485 matches
Since each team would play once at home and once away, you end up with 2,970 games. With UEFA’s qualification procedure, much fewer matches have to be played.
Lucky: But why should they have to play every team home and away? One time should be enough.
Yodime: This is a very important point. As many statistics have shown, there is a home field advantage in football as also in other team sports.
One would expect that home wins, draws and away wins would be evenly distributed with a share of 33%. But in football leagues across Europe, for example, prior research has found that home teams win an average of 50% of matches played in their stadiums. And this effect persists even without spectators as this study proved.
Lucky: Ok, I see. But how is it possible now to calculate who will win the tournament?
Yodime: We are taking a top-down approach here. Let’s assume we already knew which two teams would play in the final. Which of the two would be the favorite?
Lucky: The one with the most wins and scores must be the stronger one.
Yodime: There is a pitfall to this approach: What if the team with the most wins and scores simply had a much easier schedule than its opponent?

Therefore, we need to use a different approach here. We will model the outcome of a football game with the following formula:

Fig. 10: Model of the outcome of a football game.

where:
- d_ij is the difference of the scores of the playing teams i and j.
If team i beats for example team j by a score of 3–1 than d_ij is +2.
- h is the homefield advantage. (We use a simple approach here and assume
that every team has the same homefield advantage.)
- r_i is the rating of team i
- r_j is the rating of team j
- ε_ij is an additive error term which is normally distributed.

The goal here is to find the ratings of the teams whose predictions show the smallest differences from the effective results.

There are several scientific articles and some books on the topic of team ratings by mathematical models. However, for our purpose, we will apply this model as for the calculation of the ratings we will use one of the most famous method in machine learning: the regression analysis!

Fig. 11: Function of the Regression Analysis.

Lucky: I learned in a statistics class that regression analysis can be used to predict the outcome of a dependent variable Y with an independent
variable X. However, in your model there is only the dependent variable Y which is represented by d_ij.
Yodime: The independent variable X is also already in there. But we have to abstract a little bit. As you have already noticed, Y corresponds to d_ij and α is equal to h the homefield advantage. β will be the ratings of the teams which we have to calculate. For X, we need to code the teams as follows:

0: if the team has not played the match
1: if the team has played at home
-1: if the team has played away

So let’s take the following example with 4 teams (t1, t2, t3, t4) playing these 3 matches:

By coding them like in the description above we get the following matrix:

Fig. 12: Matrix of matches and results to use with the regression model.

So we have to feed our regression model with the independent variables t1, t2, t3, t4 and the dependent variable d. By fitting this model, we get the parameters β1, β2, β3 and β4 which are the ratings of the teams, and α which corresponds to the homefield advantage. Our regression analysis formula now looks like this for the 4 teams:

Fig. 13: Regression model for 4 teams.

As you can see, a good foundation in math is key to be successful in
Data Science.”

Fig. 14: A good foundation in math is key to be successful in
Data Science (image by author/original source by fr.phonekey.com).

We are now ready to start our analysis.

Lucky: But how many matches do we need to calculate the teams’ ratings? When we consider games that are far in the past, they do not reflect a team’s current skills. On the other hand, we don’t have enough games if we take only the qualifiers, since many teams didn’t play against each other.
Yodime: This is in fact a real issue and it shows the difference between theory and reality. And that’s exactly why it’s so important to learn Data Science using real-world examples.

“Because in the real world you will always have to deal with constraints.”

Yodime: Now let’s start gathering our data.
(If you want to skip this point, then you can work directly with the data provided on Kaggle. They have collected on their GitHub repository all international football matches starting from the very first official match in 1872 up to the present day.)

From previous analysis we know that we need matches from the last three years, because bigger competitions like the Euros and the World Cup take place every 4 years, with qualifiers and sometimes other competitions like the UEFA Nations League in between.
Lucky: What about friendlies? Do we take them into account as well, or do they have no impact because they have no competitive importance?
Yodime: While they have no competitive significance, they still have some importance and can help adjust the ratings since not every team plays against every other team in major competitions and qualifiers. We will include it in our analysis with a weighting factor resulting from the regression analysis.
And you will see that this will improve our ratings.

Everything is done with the following KNIME workflow:
From the web scraping of the football matches to the building of our matrix as input for our regression model to the calculation of the ratings of the teams, everything is done in KNIME.

Fig. 15: KNIME workflow to calculate the team ratings (image by author).

The workflow is divided in three parts:

  1. Web scraping (blue part) of the football matches.
    The “Webpage Retriever” enables us to perform typical Website information retrieval tasks.
  2. Building the Matrix (yellow part). We transform the information of the football matches into a matrix, which is suitable for our regression analysis.
  3. Calculation of the Ratings (green part) of the European teams.
    For the calculation of the rating we will use a linear regression model.

Web Scraping
We will scrap the football matches from the BetExplorer homepage by putting the desired URL links in the Table Creator node. We will use all matches of the Euro qualifiers, the matches of the last UEFA Nations League as well as the friendly matches between 2018 until today. Afterwards we use the Palladian nodes “Webpage Retriever” to put the input data in a suitable form to be crawled with XPath commands.

Fig. 16: KNIME workflow: Web scraping the football matches (image by author).

At the end, we get a table with all football matches with the gamedate, results and so on.

Building the Matrix
Now we have to transform this data in a suitable form for our regression model. First, we have to split the results in a hometeam score and awayteam score. Then, we build the difference of the scores (dfres) and finally we code the games so that for every team who has played a game, we create a column with the name of the team and code it with 0 for “no game played”, 1 for “game played at home” and -1 for “game played away”.

Fig. 17: KNIME workflow: Building the matrix (image by author).

Calculation of the Ratings
Finally we have to calculate the ratings of the teams. We use for this task the “Linear Regression Learner” node in KNIME. As target variable we use the difference of the result dfres, and as input the columns with the name of the teams.
Since we also want to give less weight to friendlies, we also add a column with this information.

Fig. 18: KNIME workflow: Calculation of the team ratings (image by author).

And voilà there are our ratings!

England is a the top with a rating of 7.045 followed by Italy with 6.965. The homefield advantage is 0.456. This would mean that if you want to predict the result between Belgium and France, for example, Belgium would win at home by 6.412–6.071+0.456 = 0.797 scores, and if they played in France, the difference in scores would be only 0.115, which would be more like a draw than a win. The model fits quite well. The correlation between the predicted and effective results is 0.81.

Lucky: Hurrah! I will put all my money on England!
Yodime: Be careful, young Data Scientist!

Models are always only an approximation of the reality.

But they can often help you gain helpful and important insights.
And that’s why it’s so exciting to deal with the force of Data Science.

May the Data Science Force be with you!

Material for this project:
knime-workflow:
knime-hub

References:
KNIME

- KNIME Getting Started Guide
- Seven things to do after installing KNIME
Euro 2020
-
UEFA Euro 2020 Wikipedia
Regression Analysis
- Basic Regression Models
Football Rating
-
Pronostics footballistiques — Diego Kuonen 1996
-
Was France’s World Cup win pure chance? — Diego Kuonen 1998

Follow me on Medium, LinkedIn or Twitter
and follow my Facebook Group “
Data Science with Yodime

--

--

Dennis Ganzaroli
Low Code for Data Science

Data Scientist with over 20 years of experience. Degree in Psychology and Computer Science. KNIME COTM 2021 and Winner of KNIME Best blog post 2020.