International Football Results- An Exploratory Data Analysis

7 min readMar 1, 2022

source: https://www.dfb.de/fileadmin/_processed_/201401/csm_wmpokal_gi_2720_e3ee390a5b.jpg

Analyzing International football results from 1872 to 2021

- An up-to-date dataset of over 40,000 international football results

This report takes a look at some of the past results of international football matches using some Python Data Analysis tools such as Numpy, Pandas, Matplotlib and Seaborn. This analysis will also look into the future aspects of some of the performances of the countries with regards to the upcoming Fifa World Cup 2022 to be held in Qatar. We will take a look at factors like some of the best teams of a certain time, along with player and team performances with regards to the venue, etc.

About the Dataset

This dataset is collected from the dataset library of Kaggle.com, an interactive online Data Science project hosting site. The link to the raw dataset can be found here: https://www.kaggle.com/martj42/international-football-results-from-1872-to-2017?select=shootouts.csv

Methodology

The following steps have been undertaken to perform this EDA:

Data preparation and cleaning: finding and dealing with missing values, inconsistent data; outlier treatment, etc.
Non-graphical analysis: variables, data types, basic metrics.
Exploratory Analysis and Visualizations: using tools such as charts, graphs and images.
Asking and Answering questions.
Inferences and Conclusions.
Scope for future work.

Downloading the Dataset

Now, there are three ways of downloading the dataset:

importing urllib.request module in Python and then using the urlretrieve function from that module to download CSV files from a raw URL
downloading the dataset directly from the website link and then uploading the files directly in the Jupyter directory
using a helper library, e.g. opendatasets and using it’s download function

Downloading using opendatasets:

The dataset has been downloaded and the directory has been set. Now we can see the files in the directory:

Data Preparation and Cleaning

Before we start our data analysis, we need to prepare and clean our data so that it is complete and can be manipulation easily for deductions. This process may include tasks like exploring the values and ranges; handling missing, invalid, incorrect data; and some other additional steps to render the data for further examination.

Loading the data:

We will first import the Numpy library for arrays and Pandas library for loading and manipulating dataframes.

Then, we will import the raw dataset of the ‘results’ table:

In the same way, importing the raw dataset of the ‘shootouts’ table:

The .info() method gives us an idea of the number of columns, any missing values, and the datatypes of the values. We can mainly see here that there are no missing values in any column, so we can deduct that the dataset is complete.

The .describe() method gives us some insight into the statistical analysis of the numeric values in the table. We can see here that the maximum point a home team scored against the away team is 31 points, whereas the maximum an away team scored against a home team is 21 points.

Another way to look if there is a missing value in a dataset is using the .sum() method on the isnull() function:

All the columns showing zeroes means that the isnull() function returned all False for each columns, and the .sum() of all False boolean is Zero!

One additional information that we may want to add to our results dataframe is to declare the winner team as an additional column into the results dataframe. For this we will be using the np.select() function of the Numpy library, which essentially returns the values of our given choices with regards to the conditions to be met.

This appends the winner column into our dataframe:

Finalizing and selecting our required dataframes:

Some fundamental information about our two dataframes i.e. the number of rows of data, can be found using the .shape function and returning the first element of the tuple:

Exploratory Analysis and Visualization

Before we ask some questions about the implication of these match results, it may be helpful for us to compute some interesting statistics about the winning and losing teams, and explore different distributions and relationships using plots and charts to make some interesting insights about the EDA.

We will begin by importingmatplotlib.pyplot and seaborn:

To start off, lets take a look at the top 10 teams who have played the highest number of home matches.

Now we will try to visualize this information using a bar plot which shows the teams playing the highest home matches and how many times they played these matches within this timeframe.

Now, we can do the same calculations to look at the top 10 teams who have played the most number of away matches i.e. at venues other than their home venues.

We will show this information in another bar plot.

From these two graphs, one thing we can infer is that England is one of the top countries in both lists along with Argentina, Germany, Sweden and Hungary. That means that these teams have good experience playing both in their home venues and in other venues.

Similarly, we may also want to take a look at the top 10 teams over the years who have won during the shootouts.

Let us also visualize this information, but this time, in a different manner to look at the data in another way.

It is quite clear from the visualization that teams like Egypt, South Korea and Argentina have very good penalty experts in their team, so they are more likely to win if the match goes to a penalty shootout.

Asking and Answering Questions

In this section we will ask some interesting questions about our data and gain insight into some of the different aspects of the world of football.

Q1: What is the average score by a home team compared to the average score of the away team in a particular match?

We are using the .mean() method here from the Numpy library to find the average values of the scores by the home team and the away team.

Therefore, the dataset suggests that in any given match, the home team on average scores twice as much as the away team and so may have a ‘Home Advantage’

Q2: Which team dominated the results in the last 10 years?

We will look at the winners for the last 10 years from the data frame. This process can of course be repeated to find the dominating teams from further before. First we will convert our date series to the year, month, day, and weekday columns using the Pandas datetime library.

Now lets query the results dataframe to find the recent results from 2012 till now:

It is quite clear from the list that Brazil has been the most dominant team over the past 10 years with regards to the number of matches won. We can plot a bar chart with this series information:

Q3: What is the percentage of winning for a home team compared to the away team and ties?

Now we will find out how many times the home team and away teams are declared the winner out of the 43,182 match results:

Now we can easily find out the percentages of the events:

Now, let us visualize this percentage information more clearly in a pie chart:

It is quite evident from the pie chart that the home team won the highest percentage of matches, empowering the hypothesis further that the home team might have a home advantage.

Q4: What is the trend of winning or losing for the most dominant team of the last 10 years?

Now let us look back at some statistics of one of the most dominant football teams of the past 10 years, Brazil.

We will look at the number of Brazils’ wins against a team:

Then we find the total matches played by Brazil in this timeframe:

Hence, we can find the winning percentage of Brazil for this time period:

Q5: How did the total goals (between home team and away team) vary throughout the year of a World Cup?

Let us simply take a look at the year 2018, the year of the latest Fifa World Cup.

Let us visualize this very interesting revelation using a ‘cool’ Heatmap!

To do that, we first need to plot this information as a matrix:

Now we can plot the data using the sns.heatmap() plotting tool:

From the above visualization, we can see that the highest amount of goals were counted during the month 6 .i.e. June, when the Fifa World Cup 2018 started in Russia.

Inferences and Conclusion

Now we can recapitulate some of the important insights we got from this dataset:

We found out that the dataset is complete and did not have any null values.
We appended the ‘winner’ column into the results dataframe, to find which team out of the home team and away team won.
We found the amount of different match and shootout results available to us in this database.
We surmised the top 10 teams who have played the highest amount of home matches(Brazil) and away matches(Uruguay), and also the amount of matches they played at home or abroad respectively.
Furthermore, we found out that out of all the countries, Egypt and South Korea have won the most matches, if the match went into a penalty shootout.
We also found from our results that the home team scored twice as much goals as the away team on average in a match.
We saw that Brazil dominated the winner board on average for the past 10 years from 2012 till now.
Then we saw that the match resulted in the home team winning almost as much as the match resulting in the away team winning or a tie combined.
Moreover, we deduced that Brazil won more than two-thirds of the matches it has played for the past 10 years from 2012 till now.
And lastly, we also found out that there are more goals on average in the month where a Fifa World Cup is held, compared to the other months when the other tournaments are hosted.

References and Future Work

References:

Numpy documentation: https://numpy.org/doc/
Pandas documentation: https://pandas.pydata.org/docs/
Matplotlib documentation: https://matplotlib.org/
Seaborn documentation: https://seaborn.pydata.org/
Data Visualization cheat sheet: https://jovian.ml/aakashns/dataviz-cheatsheet
Bar Charts: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html
Pie Charts: https://matplotlib.org/stable/gallery/pie_and_polar_charts/pie_features.html
Heatmaps: https://seaborn.pydata.org/generated/seaborn.heatmap.html
Dataset Collection: https://www.kaggle.com/martj42/international-football-results-from-1872-to-2017

Future Work: There is so much more to explore in this fascinating dataset. It might be interesting to look at some more aspects of this data like:

Which is the best football team of all time.
Which teams play better at home and which teams play better abroad.
Which teams play more friendly tournaments and which play more international championships, etc.