Photo by Fauzan Saari on Unsplash

Applying Machine Learning clustering on soccer World Cup results

Using Python and Tableau

Bernard Kurka
4 min readFeb 17, 2019

--

The Data Science Process

In this post, I will walk you through the data science process to cluster soccer teams using unsupervised Machine Learning. All data science projects can be tackled using the 6 following steps:

  • Define the problem
  • Gather the data
  • Clean & Explore the data
  • Model the data
  • Evaluate the model
  • Answer the problem

Define the problem

Suppose we are creating a video game about soccer world cup teams. To build the game we have to grade the soccer teams based on their performance, and we only have 4 categories to assign any given team.

In other words, the problem is to separate international soccer teams into groups with similar characteristics, aka clustering.

Gather the data

Data was retrieved from Kaggle, it contains nearly 40,000 international soccer results from 1872 to 2018, including FIFA World Cup and friendly matches. Data was saved in CSV file and imported to Python using Pandas library.

This data doesn’t contain 2018 Fifa World Cup results!!

Clean & Explore the data

I performed my data cleaning and exploratory analysis.

  1. Checked for nulls and outliers
  2. Filtered only FIFA World Cup Teams.
  3. Calculated total number of wins, losses, and tie
  4. Calculated for each team win %, loss %, and tie %

Using Tableau I created some nice plots:

1) Winning percentage:

Created using Tableau

2) Map of Winning percentage:

Tie percentage:

Created using Tableau

Plot highlights:

Plot nº 1: Brazil and Germany with winning percentages over 60% and Saudi Arabia, Bulgaria, North Korea with lowest percentages around 14%.

Plot nº 2: Most of the high percentage winners are in Europe and South America.

Plot nº 3: Ireland, Paraguay, and Tunisia are the countries with the highest tie percentage. Ireland has done a really good job in this, it´s tie percentage is 67% which is near twice the percentage of the second place.

Model the data

I selected 7 features for the clustering model:

features = [‘world_cup_games’, ‘world_cup_wins’, ‘world_cup_ties’, ‘world_cup_losses’, ‘world_cup_win_pct’, ‘world_cup_tie_pct’, ‘world_cup_loss_pct’]

Scaled the data using StandardScaler from Sklearn. Generated the clusters using K-Means from Sklearn library, with number of clusters k = 4. This model creates 4 clusters using 7 dimensions distance between points.

How does the model work?

  1. Randomly select 4 points and set them as cluster centroids.
  2. Assign all points to it´s closer centroid using Euclidian distance.
  3. Calculate new centroids by finding the means of the coordinates of the data points that belong to a particular cluster.
  4. Check if centroids values remain unchanged, if not go to next interaction, repeating steps 2 and 3.

Click here to check out a complete walkthrough K-Means clustering.

Saved the model’s cluster assignment value for each observation.

Evaluate the model

To check how the model is performing I checked the silhouette_score (indicates if the object is well matched to its own cluster and poorly matched to neighboring clusters) and the inertia_score (gives insight on cluster density).

To visualize the clusters I’ve built a plot in Tableau:

From Tableau

Cluster number 0 has countries like Brazil, Argentina, Germany, and Italy. (All won world championships).

Cluster number 1 has countries like Spain, France, Uruguay, Mexico and so forth, only some have won a championship.

Cluster 2 has countries like USA, Cuba, Portugal, Poland, and so forth, these teams have several world cup appearances but a mild performance and none of them has won a world cup.

Cluster number 3 has countries like Australia, Canada. South Africa, Egypt, China, Ecuador, and so forth, teams that have had very few world cup appearances and very bad results.

Answer the problem

The clusters are described in the image below:

My interpretation for the clusters:

Cluster 0: Highest winning teams throughout history with lots of world cup games.

Cluster 1: Teams that have achieved success in some world cups but have not done it consistently. Medium number world cup games.

Cluster 2: Teams that have gone to some world cup tournaments but had not acheived significant results.

Cluster 3: Teams that have been to 1 or 2 world cups and had terrible performance.

Recommendation

Use game scores as features, and consider the location of world cup creating a feature to capture neutral field or home field for each team. Can also cluster teams on their offensive power and defensive power. Create features that capture the recent success of each team is also a good idea.

Check out my code.

Please comment if there are any questions and thanks for reading!

--

--

Bernard Kurka

Passionate about science, technology, and business. I love to use technology to solve problems and help people.