FIFA World Cup Statistics: A Data Science Perspective — Part I

Record Evolution
8 min readJun 11, 2018

--

Image by Wuilmar Matias-Morales

Data-driven Historical Stats from the FIFA World Cup between 1930 and 2014 — Data Modeling with Repods

Author: Rafael Queiroz, Full-Stack Developer and IT-Infrastructure Management @Record Evolution GmbH <rafael.queiroz@record-evolution.de>

Following our tutorial series on Repods, we can now introduce this slightly more complex example. If you are just starting to work with the platform, we suggest that you get started with our introductory tutorial. However, if you feel confident with data modeling, enjoy the following analysis — an analysis of the most widely viewed sports event in the world: The FIFA World Cup.

You can also check out the pod we used on this tutorial here.

Table of Contents

The dataset provided by the Fédération Internationale de Football Association (FIFA) is divided into three files containing detailed information about matches, players, and the World Cups in general. The goal of this article is to create an iterative and incremental data analysis starting from a model and readapting it according to features observed in the original dataset.

0. Planning the Model

Before starting an analysis, as mentioned in our introductory article, it is essential to be aware of the scope of our information and to keep in mind the relations between real-world entities.

Let’s get started by breaking down the World Cup into 9 entities as displayed in Figure 0. Now we can create a quick draft by sketching the relations using a simplified Entity-Relationship model.

As we don’t know which are the attributes of each entity, we simply describe them with an ID and a Name. Then we add attributes that we consider necessary for the interconnections.

Figure 0. World Cup Data Model Draft

We have arranged the relations around the entity Match — after all, this is what the World Cup is about. Now that we have created this overview, we can go to Repods and get to work.

Go to https://repods.io, log in into your account (or sign up if you don’t have an account yet), and create a free Data Pod. You can call your Pod FIFAWorldCupAnalysis.

1. Importing Files

Once this is done, you can download the datasets World Cup Players, World Cup Matches, and World Cups from here (provided as a courtesy of the FIFA World Cup Archive to Kaggle) and import them to Repods. If you're not sure where and how to do that, take a quick look at our introductory article first.

After uploading the files, you can select each of them, and click to have a look at the Raw Table, File Structure, and Column Datatypes. Take a quick look at how the data is structured; carefully observe the data types and the possible formatting/transformations.

Figure 1. Import Files

At this point, it is important to apply smaller transformations that help to specify the appropriated data types for what we have imported. For example, you might see that the WorldCupMatches.csv file contains a column Datetime with something like "12 Jun 1930 15:40" — you can convert it to a date or keep it as text.

But here comes the first tricky part: All dates are displayed in the current user time. However, the information about the match time is only relevant in the local time of the match (since it might indicate weather conditions and players' physical/psychological conditions). So how do we deal with this?

There are many options. The simplest option is to accept the time modification and tell whoever uses this data that the time presented is the user time. Another option is to ignore this fact completely, which may lead to misinterpretations. As we want to keep this information, we can simply postpone the data type formatting to the Pipes Transformation stage, leaving this field as a text.

Figure 1.2. Raw Table Icon

After the importing is finalized, we can open the raw table by clicking on the Raw Table icon and then adding a "Time" field with the transformation shown in Code Snippet 1.1:

to timestamp(substring("Datetime", 15, 21, 'HH24:MI:SS')::time

Code Snippet 1.1. Extracting Time from Datetime Column

Now we are ready to keep the time independently as in the original format and still be able to use the date later.

2. Start Modeling

In our approach, we start creating the tables based on the draft we have drawn in the beginning of this article and later incrementing it with more data and details. Once you have created and entered your Pod, click on Core Data (shortcut Alt+3) and then click on Add Table.

Figure 2.1. Model View

Let's start creating the World Cup table — here you can choose between an Event Table and a State Table. Even though the World Cup is an event per se, we want to track the evolution and events that happen within it. So we can opt for a State Table. On the other hand, matches are entities defined by a specific point in time, which is why we choose the table type Event Table here.

Figure 2.2. Match Entity Columns

Event tables have a default field called EVENT_DATE that we can fill in with the date from the raw sources later on.

The center of our model is the Match entity. It is the entity with the largest number of relations. We can observe this when we create it with columns that connect to Round, Team, and Venue entities.

So let's create all tables as described in the draft we have shown in the beginning of this article (Figure 0).

Since we are creating all tables at once, we can't establish relations until the draft is ready. The tables without relations will be shown as in Figure 2.3.

Figure 2.3. Overview after Adding Tables without Relations

Once we’ve created the entities, we can start interconnecting them using the draft fields. Let's start with the model table. Go to Relations to other Core Tables and select the columns and the tables to which the columns belong as shown in Figure 2.4.

We can see that Repods creates intuitive reverse names for the relations. This function is helpful when it comes to creating reports in the future or checking whether the created connections have semantic relations.

Figure 2.4. Establishing Relations between Tables

We can repeat the procedure for all entities following the model draft until reaching the model overview shown in Figure 2.5. In this view, it is possible to see the relation names and observe the relations from the point of view of any entity.

Figure 2.5. Model Overview after Creation of Relations

Once you have completed this step, let's start populating our entities with some data.

3. Pipes Transformation

Pipes is where data modeling happens in Repods. Here we can create SQL queries that combine different sources and modify the original data to fit into our model.

Let's go to Pipes (Alt+2) to see all created entities and imported files. It will look somewhat like this:

Figure 3.1. Empty Pipes

Now we can simply click on Add Pipe to populate our model. Since we have created the tables from the outset, we can toggle the button to Select From Existing Tables, choose the table we need, and Repods will fill out the appropriated Pipe Name automatically. After that, you can also select the correct source.

Let's start by loading the Matches as shown in Figure 3.2:

Figure 3.2. Creating a Pipe

After we press Create, we are going to be redirected to Pipe Transformation. In the present panel, we can use the Datetime field, mentioned in Importing Files, to be the event date. You can simply copy and paste Code Snippet 3.1.

SELECT
to_date(r1."Datetime", 'DD Mon yyyy') as EVENT_DATE,
r1."MatchID" as "ID", -- bigint
r1."RoundID" as "RoundID", -- bigint
r1."Home Team Initials" as "HomeTeam", -- bigint
r1."Away Team Initials" as "AwayTeam", -- bigint
null as "VenueID"
FROM
S_WORLDCUPMATCHES r1 -- replace by your table name
WHERE
r1."Datetime" is not null and r1."MatchID" is not null

Code Snippet 3.1. Draft of Pipe Transformation for Matches.

As you can see, we are converting the day, month, and year to a database date type, using it as EVENT_DATE. Also, at this point, we don’t have anything to be used as "Venue ID", so we are explicitly inserting null values here to remember to populate them later.

Figure 3.4. Flow State Button Indicating Blocked Flow

We can save this pipe transformation and flow the data into our table by unblocking the pipe with the button shown in Figure 3.4.

Now we can proceed by creating other pipes for the entities, following a sequence of snippets for each of them:

SELECT
to_date(r1."Year" || '-1-1', 'yyyy-mm-dd') as VALID_FROM,
to_date(r1."Year" || '-12-31', 'yyyy-mm-dd') as VALID_TO,
r1."Year" as "ID", -- bigint
r1."Country" as "Name" -- text
FROM
S_WORLDCUPS r1

Code Snippet 3.2. World Cup Pipe Transformation

To simplify, we select the year of the event as a validity interval of the World Cup. It is also possible to derive this interval from our data, choosing the day of the first and last matches as VALID_FROM and VALID_TO, respectively.

For the remaining pipes, you can copy and paste the code snippets from this file — use the code snippets numbered from 3.1 to 3.6.

Code Snippets for First Pipes Transformation: https://gist.github.com/querafael/413ec093de3dea1f5adc319a914ad884

After these basic pipes, we can see how the data flows and interconnects entities within Repods. If we want to run some analysis, we need to include more fields and more data on reports. Check out below for more transformations on Code Snippets 3.7 to 3.10:

Updating previous code snippets: https://gist.github.com/querafael/8f7609be1a4cc9bc7bc71bb632280876

These snippets together are enough to load all the data we have from FIFA. With that, we can create Reports and Workbooks. Once the process is finalized, the pipes will look like what we have in Figure 3.5.

Figure 3.5. World Cup Pipes

Feel free to play with the Analysis section to create Reports and Workbooks they are intuitive and easy to adjust. Follow Part 2 of our tutorial Data-driven Historical Stats from the FIFA World Cup between 1930 and 2014 — Reporting and Analyzing with Repods to learn how to insert more data and create compelling reports.

You can also take a look at the Infographics section of the Analysis panel to prepare for our next tutorial where we show how to Create Your First Custom Infographics — Data Visualization with Repods.

Repods is a data platform designed to create and manage Data Pods. Data Pods are compact data warehouses with flexible Storage, vCores, and Memory plus all the tooling needed for complex data analyses.

You can also check the pod used in this tutorial here.

--

--