# Clustering MLS Forwards Using K-Means

# Introduction

Major League Soccer (MLS) is the highest level of professional soccer in the United States and Canada. Fans often have favorite players and teams, and they may even know a lot about a variety of them. However, with over 700 players in a season, it’s difficult to track how each player performs, especially over the course of multiple years. In order to accomplish this task, I used k-means clustering to separate players into groups that are simpler and quicker to analyze. I focused my analysis on MLS forwards, but my techniques can be applied to other positions and even other leagues or sports. Both fans and team managers, of any league and for any sport, may find the techniques I used interesting or helpful for understanding how players perform.

# Data Collection

To collect player statistics, I downloaded the CSV file from https://www.kaggle.com/josephvm/major-league-soccer-dataset?select=all_players.csv, which was scraped from https://www.mlssoccer.com/stats/. The CSV file contains 15076 rows and 28 columns, with rows for each player for each year they played between 1996–2019, also separated by regular and post season. The columns that had particular usefulness for my analysis were Player, POS (position), MINS (minutes played), G (goals), A (assists), Year, and Season (regular or post).

I coded in Python in a Jupyter Notebook, reading the CSV file with the **.read_csv() **function from the pandas library.

I also retrieved player salaries by downloading the CSV file from https://data.world/dataremixed/mls-player-salaries-2010-2018/workspace/file?filename=MLS_Salaries.csv. The CSV file contained both total compensation and base salaries for each player for each year they played between 2007–20018.

I read the CSV file using the **.read_csv() **function again.

# Data Processing

In order to combine the data that I collected, I first created Player and Year columns for salaries_df to have columns and values in common with df.

I filtered df to only include regular seasons because post seasons often have different statistics. This makes the analysis simpler, but post seasons could be included in future analysis.

I did an inner merge on the Player and Year columns of the two DataFrames using the pandas **.merge() **function. The inner merge only includes the values of Player and Year that are the same for both DataFrames, so all_info_df only includes the years 2007–20018, and some players aren’t included due to their names not being in both DataFrames.

Next, I thought of ways to measure offensive performance because I felt that the data I have measures it better compared to defense, and I wanted to limit the number of dimensions for the k-means clustering in order to reduce complexity. I ultimately decided to create a column called “Offense Score” that adds the number of goals (‘G’) times 2 and the number of assists (‘A’). This weighs goals as double the value of assists because they are more important for offense in soccer. I chose to use this calculation because I consider goals and assists to have the highest correlation with offensive performance, and combining both reduces the number of dimensions for the k-means clustering. I felt that shots or shots on goal, for example, have more variable context because taking a higher number of shots or shots on goal isn’t necessarily a good or bad thing. Based on whoever wants to understand the offense performance of players, the weights of goals could be different, or different columns could be used in different ways.

I then decided whether to include midfielders, which also contribute to offense in soccer. Grouping by position with the pandas **.groupby() **function, and seeing the stats for each position with the pandas **.describe()** function, I noticed that midfielders have much lower offense scores though, and it’s not because they have that many more assists compared to goals.

Therefore, I only included forwards for this analysis, which makes the most sense for comparing salaries anyway due to offense being the primary focus for forwards but not for midfielders.

# Initial Data Visualizations & Analysis

Before performing k-means clustering, I checked that the columns I was interested in were appropriate to cluster. In addition to the “Offense Score” column that I created in my data processing, I intended to use the “MINS” column, which contained the total number of minutes each player played for each year they played. I thought that high minutes would reflect that the player is important to the team, and I also thought that clustering by minutes would allow me to compare the offense scores for those with lower number of minutes.

I grouped by “Year” for the “Offense Score” column to check that the offense score isn’t that different across different years.

While the offense score was a bit higher in more recent years, it was also a bit higher for minutes in more recent years, so including all these years was reasonable for comparing these two columns.

Using matplotlib’s pyplot library, I created a scatter plot of minutes by offense score to check that they were positively correlated but not too much.

The fact they follow this pattern means that they are good data to cluster. If they weren’t positively correlated, then the comparison of the two would be more complicated. If they were too positively correlated, then there wouldn’t be any need for clustering because the highest minutes would have the highest offense scores, and the lowest minutes would have the lowest offense scores.

I then checked whether the total compensation is different for different years, and there was indeed a large difference; the mean and standard deviations were about 5 times more in 2018 compared to 2007. There may be a way to standardize the total compensation for each year, but for now, I’ve excluded it from the clustering.

I still checked how correlated total compensation was to offense score for forwards because forwards should be paid more if they are scoring a lot of goals or getting a lot of assists for their team.

I created the scatter plot for total compensation vs offense score for 2007–2018, which showed no clear correlation.

I also created a scatter plot for only 2018 to ignore the differences in total compensation for different years.

The outliers of over $1 million made it difficult to see whether there was a correlation, so I excluded those, and there was still no clear correlation.

The fact that there was no clear correlation means that my analysis might be helpful for team managers keeping or acquiring players. There are factors for salary other than offense score, like how the forward’s offense score compares to how the whole team performs and how many times the forward gains/loses possession of the soccer ball. However, such a variety of offense scores for total compensations indicates that there are likely forwards that are worth paying more for or acquiring from other teams. Although the salary isn’t part of my k-means clustering, it’s information that’s important to show after creating the clusters.

# Unweighted K-Means

Without weights for the k-means clustering of Offense Score and MINS, the clusters will primarily be based on the MINS because the differences of MINS are much higher than those of Offense Score. The MINS are between 0 and about 3000, while the Offense Score is only between about 0 to 70. This was still a useful clustering because it allowed me to find the forwards that have the highest offense scores relative to their minutes.

I first created a dictionary containing all of the values of the columns I wanted to cluster, and I made the dictionary keys numbers, which would require less code if I were to add more dimensions for the clustering.

`cols = ['MINS','Offense Score'] `

data_dict = {}

n = 0

for col in cols:

data_dict[n] = forwards_df[col].to_list()

n += 1

I then matched each offense score value to its minutes value by zipping the lists of values from the dictionary. I used the sklearn library’s **KMeans()** function on these values to create numbers of clusters from 2 to 50.

Using the elbow method, the optimal number of clusters is 7 because that’s where the inertia values (the sum of squared distances of samples to their closest cluster center) begin decreasing in a linear fashion.

I therefore used the KMeans() function again to create 7 clusters, and I assigned their labels in forwards_df so that the players could be looked up to see all their statistics.

I created a scatter plot to visualize what the clusters look like.

Below is an example of how I was able to find the forwards with the highest offense scores relative to their minutes.

Cluster 2 (green) had the highest minutes and offense scores, and those with the top 10 offense scores are in fact MLS forwards that are famous for their skills. Any columns from the DataFrame could be included, but I felt that these would best reveal if there were any oddities regarding the minutes and offense score.

Similarly, I was able to find the forwards with the highest offense scores that play less minutes.

Cluster 0 (blue) is the cluster in the mean, with offense scores that are about as high as the average for even cluster 2. These are examples of players to look out for because if they played more they could contribute to goals even more. Unlike cluster 2, where the forwards start (GS) almost all of their games (GP), some of these forwards only started about half their games played, so perhaps they should be considered for starting games more often.

I personally don’t know most of these players, so after finding the forwards with the highest offense scores relative to their minutes, I filtered their name in forwards_df to see their historical data. I included the salaries and cluster labels this time, to see how they were paid relative to their performance and to more easily see whether their minutes changed, respectively.

This kind of filtering is especially helpful for team managers because they can see how players perform and are paid over multiple years. For example, Alan Gordon, who had the highest offense score in cluster 0, only had such a high offense score for that 1 year. The next year, he had a much lower offense score despite starting and playing more minutes.

Robbie Keane, on the other hand, who was also in cluster 0, had consistently high offense scores and relatively didn’t have that high of a salary, so he would be a forward to really look into. Because he scored so many goals, he is already very well known, but for forwards with less goals, like in another cluster, it would be more difficult for fans and team managers to realize how good those forwards are without this kind of analysis.

# Weighted K-Means

In order to find the forwards with the highest combinations of minutes and offense scores, I had to normalize the minutes and offense scores so that the minutes weren’t so much higher than the offense scores.

To normalize, I created a function to apply to the lists in the dictionary that I created for the unweighted k-means. The function divides each value in the list by numpy’s **linalg.norm()** function applied to the whole list.

The optimal number of clusters based on the elbow method was again 7, so I used the same code that I used for the unweighted k-means to produce the clusters and a scatter plot that visualizes the clusters.

For these weighted clusters, I considered the ones that were higher and more toward the right to be the most important forwards because they have the highest combination of minutes and offense score. This would be more important to fans than to team managers because these forwards aren’t necessarily the best or the ones with the highest potential; they are just the ones that are most likely to be well known for contributing to goals and playing a lot.

To see which players belong to each cluster, and for how many years, I used the **Counter()** function from the collections library on each player for each label.

Cluster 6 (pink) had the highest combination of minutes and offense scores, and as intended, it contained especially famous forwards.

Cluster 3 (red) had the second highest combination of minutes of offense scores, and most were again famous, but it also contains forwards that had a particularly great year like Alan Gordon from my unweighted k-means analysis.

With these weighed clusters, I was also able to analyze the forwards that had the highest offense scores relative to their minutes played. Unlike the unweighted clusters that primarily grouped by minutes and therefore allowed me to always sort by offense score, for the weighted clusters I had to look at how the clusters were organized.

Cluster 6 was not spread out that wide over minutes, so all of the forwards in the cluster could be considered important forwards. Meanwhile, for clusters 3 (red), 1 (orange), 5 (brown), and 0 (blue), which had the next highest combination of minutes and offense scores, there was a large spread over minutes with a lot of overlap between them. Therefore, for those, the ones with the lower minutes (most toward the left for the respective cluster) had typically maintained similar offense scores as others in the cluster despite their lower minutes.

So, for each of those clusters, instead of looking at the highest offense scores, I looked at the lowest minutes. For example, below are the results after filtering cluster 3. Many of the rows are the same as those of the unweighted k-means clustering’s highest offense scores; it’s just that they belong to different clusters.

# Conclusion

K-means clustering was very helpful for finding out and understanding which MLS forwards performed the best or were the most important over the years that they played. The clusters allowed me to do this without having to look at and compare the statistics of over 1000 records with several columns. The main limitation of my analysis is that it doesn’t factor how well the club performed. For example, a forward could be one of the best in MLS but not have that high of an offense score due to their team performing poorly. This would be reflected by statistics like the team’s total goals and wins, but the player statistics CSV contains only the MLS team (Club column) that the player last played for, even if the player happens to have played for different teams in the previous years. Therefore, I would need to extract the data myself or do some mapping to correctly provide the players’ teams’ performances.

Overall, however, I think that my analysis was successful; it mainly just had the potential to miss some forwards that perform well despite their limiting circumstances. It could be expanded on by clustering different columns or by including additional columns in the DataFrame filters after the clusters are made. A similar process can be used for other positions, like for the performance of goalkeepers by how many goals they save compared to the amount of goals that they don’t save. It can also be applied to more recent years, for more relevance to current players, or to older years for understanding historical performance. Lastly, my techniques can be used for other leagues or sports, as long as there is data available for them.