Applying Machine Learning clustering on soccer World Cup results
Using Python and Tableau
The Data Science Process
In this post, I will walk you through the data science process to cluster soccer teams using unsupervised Machine Learning. All data science projects can be tackled using the 6 following steps:
- Define the problem
- Gather the data
- Clean & Explore the data
- Model the data
- Evaluate the model
- Answer the problem
Define the problem
Suppose we are creating a video game about soccer world cup teams. To build the game we have to grade the soccer teams based on their performance, and we only have 4 categories to assign any given team.
In other words, the problem is to separate international soccer teams into groups with similar characteristics, aka clustering.
Gather the data
Data was retrieved from Kaggle, it contains nearly 40,000 international soccer results from 1872 to 2018, including FIFA World Cup and friendly matches. Data was saved in CSV file and imported to Python using Pandas library.
This data doesn’t contain 2018 Fifa World Cup results!!
Clean & Explore the data
I performed my data cleaning and exploratory analysis.
- Checked for nulls and outliers
- Filtered only FIFA World Cup Teams.
- Calculated total number of wins, losses, and tie
- Calculated for each team win %, loss %, and tie %
Using Tableau I created some nice plots:
1) Winning percentage:
2) Map of Winning percentage:
Tie percentage:
Plot highlights:
Plot nº 1: Brazil and Germany with winning percentages over 60% and Saudi Arabia, Bulgaria, North Korea with lowest percentages around 14%.
Plot nº 2: Most of the high percentage winners are in Europe and South America.
Plot nº 3: Ireland, Paraguay, and Tunisia are the countries with the highest tie percentage. Ireland has done a really good job in this, it´s tie percentage is 67% which is near twice the percentage of the second place.
Model the data
I selected 7 features for the clustering model:
features = [‘world_cup_games’, ‘world_cup_wins’, ‘world_cup_ties’, ‘world_cup_losses’, ‘world_cup_win_pct’, ‘world_cup_tie_pct’, ‘world_cup_loss_pct’]
Scaled the data using StandardScaler from Sklearn. Generated the clusters using K-Means from Sklearn library, with number of clusters k = 4. This model creates 4 clusters using 7 dimensions distance between points.
How does the model work?
- Randomly select 4 points and set them as cluster centroids.
- Assign all points to it´s closer centroid using Euclidian distance.
- Calculate new centroids by finding the means of the coordinates of the data points that belong to a particular cluster.
- Check if centroids values remain unchanged, if not go to next interaction, repeating steps 2 and 3.
Click here to check out a complete walkthrough K-Means clustering.
Saved the model’s cluster assignment value for each observation.
Evaluate the model
To check how the model is performing I checked the silhouette_score (indicates if the object is well matched to its own cluster and poorly matched to neighboring clusters) and the inertia_score (gives insight on cluster density).
To visualize the clusters I’ve built a plot in Tableau:
Cluster number 0 has countries like Brazil, Argentina, Germany, and Italy. (All won world championships).
Cluster number 1 has countries like Spain, France, Uruguay, Mexico and so forth, only some have won a championship.
Cluster 2 has countries like USA, Cuba, Portugal, Poland, and so forth, these teams have several world cup appearances but a mild performance and none of them has won a world cup.
Cluster number 3 has countries like Australia, Canada. South Africa, Egypt, China, Ecuador, and so forth, teams that have had very few world cup appearances and very bad results.
Answer the problem
The clusters are described in the image below:
My interpretation for the clusters:
Cluster 0: Highest winning teams throughout history with lots of world cup games.
Cluster 1: Teams that have achieved success in some world cups but have not done it consistently. Medium number world cup games.
Cluster 2: Teams that have gone to some world cup tournaments but had not acheived significant results.
Cluster 3: Teams that have been to 1 or 2 world cups and had terrible performance.
Recommendation
Use game scores as features, and consider the location of world cup creating a feature to capture neutral field or home field for each team. Can also cluster teams on their offensive power and defensive power. Create features that capture the recent success of each team is also a good idea.
Check out my code.
Please comment if there are any questions and thanks for reading!