Analyzing International goals scored from 1872 to 2017 in Football

Omar Wanis
16 min readSep 16, 2023

--

Credit: Image by Jarmoluk from Pixabay

Introduction

International football brings together the finest players from every country, uniting them into teams that compete in tournaments across their continents and around the world.

International matches unfold within prestigious tournaments such as the Euro in Europe, Copa America in South America, AFCON in Africa, and the global spectacle of the World Cup. Qualifying matches determine which nations from each continent earn a spot in their regional championship or the World Cup. Additionally, friendly matches provide teams with a chance to refine their skills in preparation for significant upcoming events.

Data Set

I got the data set from Kaggle.com (https://www.kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017) using the goalscorers.csv file only.
This file contains data about each goal scored in each international match from 1872 to 2017. Each row represents a goal scored.

Columns:
date - date of the match
home_team - the name of the home team
away_team - the name of the away team
team - name of the team scoring the goal
scorer - name of the player scoring the goal
minute - minute at which the goal was scored
own_goal - whether the goal was an own-goal
penalty - whether the goal was a penalty

You can also check the files of this project on Github using this link:
https://github.com/OWanis/Data_Analysis_International_Football_From_1872_to_2017.git

EDA (Exploratory Data Analysis)

I downloaded the data set and used Excel to clean the data. First, I check for duplicates finding none, then I filtered out the null values found in the minute column and deleted them.

Analysis

Using MYSQL for queries, and Tableau for Data Visualizations, I started my analysis.

I created the schema, the table, and imported the data into MYSQL.

CREATE SCHEMA internationalFootballDb;

CREATE TABLE goal_scorers
(
date DATE,
home_team VARCHAR(50),
away_team VARCHAR(50),
team VARCHAR(50),
scorer VARCHAR(50),
minute INT,
own_goal VARCHAR(50),
penalty VARCHAR(50)
);

LOAD DATA INFILE 'C:/ProgramData/MySQL/MySQL Server 8.0/Uploads/goalscorers.csv' INTO TABLE goal_scorers
FIELDS TERMINATED BY ','
IGNORE 1 LINES;

I then selected all data from the table to preview it, here’s a preview.

-- PREVIEWING THE DATASET
SELECT * FROM goal_scorers;

Now that I have my dataset ready to work on, I first counted the number of rows indicating the number of goals scored.

-- COUNTING THE NUMBER OF GOALS SCORED
SELECT COUNT(*) as num_goals FROM goal_scorers;

The number of goals scored is 40,855 goals.

Then I wrote a query to count the number of matches. A match is not uniquely identified by a row, as each row represents a single goal scored in a match, and each match can have 1 or more goals scored at. To identify a match, I need to group the data by the date, home_team, and away_team columns, as no 2 teams ever played more than 1 match at the same day.

-- COUNTING THE NUMBER OF MATCHES PLAYED
SELECT COUNT(*) AS num_matches
FROM
( SELECT date
FROM goal_scorers
GROUP BY date, home_team, away_team
) AS subquery;

The number of matches played in this dataset is 13,200 matches. Note that these are the number of matches that only saw a goal scored, any matches that ended in a goalless draw are not present in this dataset.

Now that I have a basic idea about my data, I decided to divide my analysis into 3 categories: team related analysis, player related analysis, and goal related analysis.

Team Analysis:

The goal here was to provide insights about the performance of the individual teams (countries) through goal scoring.

I wrote a query to isolate the top 10 teams that scored the most goals throughout the history of international football.

-- COUNTING THE NUMBER OF GOALS FOR EACH TEAM (MAX)
SELECT COUNT(*) as goals_num, home_team
FROM goal_scorers
GROUP BY home_team
ORDER BY goals_num DESC
LIMIT 10;

And here’s the result. Brazil and Argentina are leading the way, while countries from Europe occupy half the list.

Now I want to show the top 10 teams that scored the least goals throughout history.

-- COUNTING THE NUMBER OF GOALS FOR EACH TEAM (MIN)
SELECT COUNT(*) as goals_num, home_team
FROM goal_scorers
GROUP BY home_team
ORDER BY goals_num ASC
LIMIT 10;

And here’s the result. These teams are not popular in the world of football, and thus it makes sense seeing them at the bottom of the list.

Football is known to have matches where the difference in quality between the 2 teams is jarring. These ties can witness the biggest amounts of goals scored by 1 team in a match. As such, I decided to write a query to isolate the top 10 teams that scored the most goals in a single match.

-- COUNTING THE MOST NUMBER OF GOALS SCORED BY A TEAM IN A MATCH
SELECT
team,
COUNT(*) as num_goals,
CONCAT(home_team, " - ", away_team) as teams,
date
FROM
goal_scorers
GROUP BY
date,
home_team,
away_team,
team
ORDER BY
num_goals DESC
LIMIT 10;

Surprisingly the result here doesn’t show the top teams from Europe or South America, but most teams from Asia or Australia. As we can see in the result above, the Australian team bashed other team like American Samoa, Tonga, and Cook Islands.

Now, I needed to know teams goal scoring statistics each minute. So I wrote a query showing which team scored the most goals for each minute of the match.

-- SELECTING WHICH TEAM SCORES THE MOST EVERY MINUTE
WITH team_minute AS (
SELECT COUNT(*) as goals_num, team, minute
FROM goal_scorers
GROUP BY team, minute
ORDER BY goals_num DESC
)
SELECT a.*
FROM team_minute a
JOIN
(
SELECT MAX(goals_num) as max_goals, minute
FROM team_minute
GROUP BY minute
) b
ON a.minute = b.minute AND a.goals_num = b.max_goals
ORDER BY minute;

And here’s the result. Notice that some teams appear more than once like Brazil and Argentina.

Working on the last result, I wanted to show the top 10 teams that appeared on that list the most.

-- NUMBER OF MINUTES A TEAM SCORED THE MOST GOALS
WITH team_minute_number AS (
WITH team_minute AS (
SELECT COUNT(*) as goals_num, team, minute
FROM goal_scorers
GROUP BY team, minute
ORDER BY goals_num DESC
)
SELECT a.*
FROM team_minute a
JOIN
(
SELECT MAX(goals_num) as max_goals, minute
FROM team_minute
GROUP BY minute
) b
ON a.minute = b.minute AND a.goals_num = b.max_goals
ORDER BY minute
)
SELECT COUNT(*) AS num_as_most_goals, team
FROM team_minute_number
GROUP BY team
ORDER BY num_as_most_goals DESC
LIMIT 10;

And here’s the result. As expected, Brazil and Argentina lead this stats, as we witness 6 out of 10 countries in the list from Europe.

I plotted a bar graph showing Brazil at the top scoring the most goals at 32 different minutes, followed by Argentina who scored the most goals at 25 different minutes of the game, and Germany with 23.

Players Analysis:

The goal here was to provide insights about the performance of the individual teams (players) through goal scoring.

The queries will be somehow similar to the ones used with the countries, with some added tweaks.

The first query shows the top 10 goalscorers in the history of international football.

-- SELECTING THE TOP 10 GOAL SCORERS
SELECT COUNT(*) as goals_num, scorer
FROM goal_scorers
GROUP BY scorer
ORDER BY goals_num DESC
LIMIT 10;

And here’s the result. Cristiano Ronaldo leads the way, followed by Lewandowski and Messi.

As shown in the team-based analysis, some matches can have a team score so many goals against another. These matches often have a player or 2 scoring record-breaking amount of goals in a tie.
I wrote a query to show the top 10 players who scored the most goals in a single match.

-- COUNTING THE MOST NUMBER OF GOALS SCORED BY A PLAYER IN A MATCH
SELECT
scorer,
COUNT(*) as num_goals,
CONCAT(home_team, " - ", away_team) as teams,
date
FROM goal_scorers
GROUP BY
date,
home_team,
away_team,
scorer
ORDER BY
num_goals DESC
LIMIT 10;

As expected, the same matches featured in the team-based analysis of the same kind, are present here like Australia vs American Samoa where not only Archie Thompson scored a whooping 13 goals, but also David Zdrilic scored 8 goals in the same match.

Now I wanted to break down the statistics by years first, and then later by minutes. Starting with years, I wrote a query showing the total number of goals scored by each player each year.

SELECT
COUNT(*) as goals_num,
EXTRACT(YEAR FROM date) as date_year,
scorer
FROM goal_scorers
GROUP BY date_year, scorer

and here’s the result:

Afterwards, I wanted to use this data to know which player scored the most goals each year.

WITH goals_per_year AS (
SELECT
COUNT(*) as goals_num,
EXTRACT(YEAR FROM date) as date_year,
scorer
FROM goal_scorers
GROUP BY date_year, scorer
)
SELECT
a.*
FROM
goals_per_year a
JOIN
(
SELECT MAX(goals_num) as max_goals, date_year
FROM goals_per_year
GROUP BY date_year
) b
ON
a.date_year = b.date_year
AND a.goals_num = b.max_goals
GROUP BY
a.date_year,
a.scorer

And here’s the result of the query. Note that all players who scored the most amount of goals in the same year appear together.

Lastly, I can use this data to filter the top 10 years where the top scorer scored the most goals by just adding an ORDER BY statement. Here’s the result.

As you can see in the last graph, Karim Bagheri scored the most goals ever in a single year with 19 goals in 1997. Harry Kane in 2021, and Archie Thompson in 2001 also had a good tally with 16 goals each in a single year. You can also note that with the exception of Just Fontaine in 1958, the list was dominated by matches played in the last 30 years indicating the increase in the amount of goals scored as time passed. This might be due to the increase in the amount of matches, or simply because goal scorers are getting better by time.

Adding on the last query, I can now show how many times a player was the top scorer in a year. Here’s the whole query together.

-- TOP SCORER EACH YEAR
WITH top_scorer_count AS (
WITH goals_per_year AS (
SELECT
COUNT(*) as goals_num,
EXTRACT(YEAR FROM date) as date_year,
scorer
FROM goal_scorers
GROUP BY date_year, scorer
)
SELECT
a.*
FROM
goals_per_year a
JOIN
(
SELECT MAX(goals_num) as max_goals, date_year
FROM goals_per_year
GROUP BY date_year
) b
ON
a.date_year = b.date_year
AND a.goals_num = b.max_goals
GROUP BY
a.date_year,
a.scorer
)
-- NUMBER OF TIMES A PLAYER WAS A TOP SCORER IN A YEAR
SELECT COUNT(*) AS top_scorer_count, scorer
FROM top_scorer_count
GROUP BY scorer
ORDER BY top_scorer_count DESC
LIMIT 10;

and here’s the result. As shown, that stats didn’t happen much as many players were top scorers in different years. Only Muller and Petrone were able to achieve this stat 3 times.

Now, onto the minutes stats for players. Similar to the query created for the teams, I created another one for the players to show how many minutes each player scored the most goals, and how many times did each player achieve that stat (isolating the top 10 only).

-- NUMBER OF MINUTES A PLAYER SCORED THE MOST GOALS
WITH scorer_minute_number AS (
WITH scorer_minute AS (
SELECT COUNT(*) as goals_num, scorer, minute
FROM goal_scorers
GROUP BY scorer, minute
ORDER BY goals_num DESC
)
SELECT a.*
FROM scorer_minute a
JOIN
(
SELECT MAX(goals_num) as max_goals, minute
FROM scorer_minute
GROUP BY minute
) b
ON a.minute = b.minute AND a.goals_num = b.max_goals
ORDER BY a.minute
)
SELECT COUNT(*) AS num_as_most_goals, scorer
FROM scorer_minute_number
GROUP BY scorer
ORDER BY num_as_most_goals DESC
LIMIT 10;

And here’s the result. Ronaldo leads the way as he is the top goal scorer for 14 different minutes, followed by Muller and Keane.

Goals Analysis:

The last bit of analysis I did was for the overall goals scored stats themselves.

I created a query to show the number of penalty goals vs non-penalty goals, and the number of own goals vs non-own goals.

-- COUNTING THE PENALTY VS NON-PENALTY GOALS
SELECT penalty, COUNT(*) as count,
COUNT(*) * 100 / (SELECT COUNT(*) FROM goal_scorers) as percentage
FROM goal_scorers
GROUP BY penalty;

-- COUNTING THE OWN GOALS VS NON-OWN GOALS
SELECT own_goal, COUNT(*) as count,
COUNT(*) * 100 / (SELECT COUNT(*) FROM goal_scorers) as percentage
FROM goal_scorers
GROUP BY own_goal;

I also created a query to show the number of goals scored by a home team versus an away team.

-- COUNTING THE NUMBER OF HOME GOALS VS AWAY GOALS
SELECT
SUM(CASE WHEN home_team = team THEN 1 ELSE 0 END) as home_goals,
SUM(CASE WHEN away_team = team THEN 1 ELSE 0 END) as away_goals
FROM goal_scorers;

Then I created query to show the number of goals scored each year.

-- COUNTING THE NUMBER OF GOALS FOR EACH YEAR
SELECT EXTRACT(YEAR FROM date) as date_year, COUNT(*) as goals_num
FROM goal_scorers
GROUP BY date_year
ORDER BY date_year ASC;

I plotted the result in a line chart. First, you can notice that the number of goals scored in each year increases by time. Also, you can see big spikes happening every few years. These often happen in a year prior to a big event, like 1993 preceeding World Cup 1994, 2001 preceeding World Cup 2002, 2011 preceeding Euro 2012, and 2021 preceeding World Cup 2022 as more qualification matches are played in these years to identify the teams who will qualify to these big events.

If we filter the recent years from 1990, you can really notice these big spikes.

After zooming in on the graph, a few more things stand out. First, 2020 witnesses the least amount of goals scored in a single year since 1964 with only 97 goals due to the Coronoavirus pandemic that halted the entirety of football then. While 2021 had the most international goals scored in history with 2,051 goals making the difference between the 2 years probably the greatest different in goal scoring between 2 consecutive years in history.

You can also see a very irregular boost in number of goals scored in 1993 starting the first bigspike of the list. This was probably due to the sheer increase in number of matches then.
But to drive this hypothesis home, let’s plot the same graph for the number of matches played every single year.

I wrote a query showing the number of matches played each year.

-- COUNTING THE NUMBER OF MATCHES FOR EACH YEAR
SELECT COUNT(*) AS num_matches, date_year
FROM
( SELECT EXTRACT(YEAR FROM date) as date_year, home_team, away_team
FROM goal_scorers
GROUP BY date_year, home_team, away_team ) AS subquery
GROUP BY date_year
ORDER BY date_year ASC;

Now plotting the same line graph for the number of matches per year, we can find the following:

As predicted, the same spikes in the number of goals are matched here in the number of matches which further solidifies that the increase in the latter directly affects the increase in the former.

Let’s get a closer look here as well, filtering the data starting from year 1990.

Now we have a better look at the graph, and we can spot that the spike in 1993 goals is related to the increase in number of matches from 186 matches in 1992 to 344 in 1993 (almost the double), as well as the whooping different from the corona year 2020 with only 31 matches to 645 matches in 2021 scoring the most number of matches in a single year.

Since the number of goals and number of match appear to be heavily correlated, let’s build a line chart for the average number of goals per match each year.

First, I wrote a query to join both tables together.

-- COUNTING THE NUMBER OF GOALS/MATCH EACH YEAR
WITH matches_per_year AS (
SELECT
EXTRACT(YEAR FROM date) as date_year,
home_team,
away_team
FROM goal_scorers
GROUP BY
date_year,
home_team,
away_team
)
SELECT a.date_year,
COUNT(*) AS num_matches,
b.goals_num,
(goals_num/COUNT(*)) AS goal_ratio
FROM matches_per_year a
JOIN
(
SELECT
EXTRACT(YEAR FROM date) as date_year,
COUNT(*) as goals_num
FROM goal_scorers
GROUP BY date_year
) b
ON a.date_year = b.date_year
GROUP BY date_year
ORDER BY date_year ASC;

And here’s a preview of the result.

Now, let’s plot the same line graph for this stats as the last two.

The blue line represents the average number of goals per match each year, and the faint orange line represents the number of matches per year. As seen in the graph, the average numbers where very inconsistent as the number of matches were so few.

In the graph I highligted the extreme outliers: year 1925 with the most average goals per match (8.6) scored in 6 matches, and 1921 with the least average goals per match (2.3) scored in 3 matches. These extreme numbers happened as a result of a very few matches being played at that time and can’t be a good indicator to the goalscoring abilities of the teams.

Also, you can see the curve starting to flatten in the mid 1960s as the number of matches is becoming more consistent.

Taking a closer look into the line chart by filtering the data from 1990 onwards.

We can see that the year 1993 didn’t only see a boost in the number of matches, but also the number of goals. But the findings on the far right are more surprising since the huge gap in the number of goals between 2020 (the pandemic year) and 2021 is only justified by the sheer difference in the number of matches. Even though 2020 had only 31 matches played, and 2021 had 645 matches, the average goal ratio is almost identical at 3.1 goals per match.

Moving on, as minutes are an important factor in football, I moved to analyzing goals scored for each minute isolating the top and bottom 10 minutes where goals were scored.

-- SELECTING THE 10 MINUTES WHERE THE MOST GOALS SCORED
SELECT COUNT(*) as goals_num, minute
FROM goal_scorers
GROUP BY minute
ORDER BY goals_num DESC
LIMIT 10;

-- SELECTING THE 10 MINUTES WHERE THE LEAST GOALS SCORED
SELECT COUNT(*) as goals_num, minute
FROM goal_scorers
GROUP BY minute
ORDER BY goals_num ASC
LIMIT 10;

As shown in the graph below, the 90th minute (last minute of the original time) is a dramatic minute where teams push to score goals as to avoid drawing or even losing the game. This minute alone witnessed 1,606 goals almost double the 45 minute mark (last minute of the first half) with only 835 goals. However, minutes past 90 have very few goals as very few matches enter the Extra-time phase where the match time is extended to 120 instead of 90 as a result of a draw.

Football analysts love dividing a match into segments, one of the popular ones are 15-minutes segments. I created a query to show the number of goals scored per each 15-minutes segment.

SELECT
time_span,
COUNT(*) AS count
FROM (
SELECT
CONCAT(
FLOOR((minute - 1) / 15) * 15 + 1,
'-',
FLOOR((minute - 1) / 15) * 15 + 15
) AS time_span,
minute
FROM goal_scorers
) AS subquery
GROUP BY time_span
ORDER BY count DESC;

And here’s the result. As predicted from the last result as well, more goals tend to be scored towards the end of a match.

Lastly, I wanted to break down the goals scored for each half of a match, as a football match is divided into two 45-minutes halves with a 15 minutes break between them.

-- COUNTING THE NUMBER OF GOALS PER HALF (45 MINS)
SELECT
COUNT(*) AS count,
(CASE
WHEN time_span = '1-45' THEN 'First Half'
WHEN time_span = '46-90' THEN 'Second Half'
ELSE 'Extra Time'
END) as time
FROM (
SELECT
CONCAT(
FLOOR((minute - 1) / 45) * 45 + 1,
'-',
FLOOR((minute - 1) / 45) * 45 + 45
) AS time_span,
minute
FROM goal_scorers
) AS subquery
GROUP BY time_span, time
ORDER BY count DESC;

And here’s the result. As shown, the second half has more goals scored in, further agreeing with the late drama goal scoring theory mentioned above, while very few goals were scored in the extra time (the 30 minutes extended after the original 90 minutes “sometimes” when the result is a draw in knockout phases).

If you arrived here, thanks a lot for taking the time to read my analysis.

--

--