# Data science applied to sport: identifying players with similar performances over time

**Introduction**

I always wished to apply data science to sport. And I always wished to make a difference with data science applied to sport. Basketball is an amazing sport which is not much affected by isolated episodes: schemas and tactics do usually have an effective contribution on the final score. Therefore, this is a good arena to experiment some data modelling!

**Summary**

In this tutorial, we will understand how data science can be applied to identify **similar players over time**. We won’t focus much on the first part related to retrieving and parsing the data, but we will mainly consider some interesting investigations which can be performed on preprocessed basketball data. We will go through some basic descriptive explorations and then, most importantly, we will try to understand how basketball players **with similar performances over time **can be identified.

**Motivation**

This analysis might be useful at different levels. Basketball coaches might want to understand the characteristics of the opponent team players to study possible countermoves. A hiring manager might want to identify players with specific performances to improve the team skills during the next transfer window. An athlete could be interested in looking at his own performances to understand critical times in which his efficiency drops down. Finally, even supporters or gamblers might want to use this information for fostering their predictions on the match outcome. Hence, determine players with similar performances could bring **concrete benefits **to the multiple actors involved in the game.

**Downloading, parsing and preprocessing the data**

- The first step was to scrape the data. As it is in every good old fashioned data science process, this is surely the most annoying and demanding step. We won’t go through the details here, but data was scraped from Italian reference basketball server reporting
**summaries and a play by play**of match events starting from 2016. - The next step was to parse the data. This operation took me the majority of time, a good dose of patience (together with tears and desperate cries!) to come to a decent result.
- Finally, the different data sources were merged to produce a comprehensive table containing all the available information.

Here a small summary of the dataset obtained after these steps. *game_details *is a Pandas dataframe reporting information of each game at event level.

The first list of elements describes the columns available in the data frame. Specifically the following are included: the date and place of the match, both the final score and the scores after playing each quarter, the list of referees, and the player which performed the event at the specific time.

Next, a list of all the teams which played since 2016 in the major Italian Basketball league is reported.

Then, we have some basic statistics of the merged dataframe. There are **478** **games** corresponding to **~220.000 events**. Unfortunately, since on the reference web portal not all the games contained the play by play information, we lack of some matches in the dataset. However, this is still a considerable number of events. Moreover, games of the last season (2018–2019) are all included.

Finally, the list of all the events is presented. The last empty string refers to team events which the web portal did not assign to any of the players.

**Descriptive statistics**

Once the data is downloaded, parsed and processed, it is time to perform some exploratory analyses on the matches! In this regards, there are several hypotheses which might be tested and explorations to be accomplished. In the following we just consider some of them.

We might be eager to **compare home vs away team performances over time**. As we can see the overall and quarter scoring distributions are quite similar and it seems, as expected, that *teams tend to scores more when playing at home*.

Next, we investigate on how specific teams behave with respect to the same patterns. Look for example at what happens when we compare home and away performances of *a|x armani exchange milano *with the rest of the teams.

From the plots, it seems that **performances of the team when playing home** (*yellow line*) **significantly outperform those of the rest of the teams** (*blue line*). This especially occurs during the last quarter (last plot). In other words, *a|x armani exchange milano* demonstrates to have a disruptive impact on the game under favourable conditions (i.e. playing home) and even when players get tired (last quarter). This fact **might be one of the key reasons why they reached the highest positions** during the last two seasons.

Another simple analysis is to visualize the evolution of team events in time. For example, we can **explore how many substitutions are performed during the game**. Here the results considering the teams overall:

There is a clear pattern in the plot. During the first quarter there is a steep increase starting at minute ~5 of the game. Then during the second quarter the trend stays high, and we have a peak around the middle of the match, when probably coaches need the team to be more dynamic. Another increase occurs during the third quarter, followed by a decrease during the last part of the game, when probably **coaches do not want to affect too much the balance of the team since this is the most risky part of the match**. Also, notice the increase of substitutions as we proceed towards the final minutes of single quarters, when players get tired.

**#Identify similar players **We now focus on identifying players **having similar performances over time**. The basic idea is to first derive statistical distributions of player performances over time. Then, we define a **distance metric **among these probability distributions. Finally, using this metric we evaluate the distance among all the player distributions and perform an hierarchical clustering to identify similar players.

We now proceed to detail the exact sequence of steps which were undertaken, also sharing the ideas and some of the Python code involved in the process.

*Filtering the data considering only the matches related to last season*. This is mainly due to the fact that not all the games of the previous seasons had the play by play of the match. Information related to different players might be**incomplete and unbalanced**(with some players being more representative than others).*Filtering all the set of positive events*, such as a rebound or a strike. The player performance distributions were constructed considering these events only.

*Creating player distributions over time*. This step is accomplished by simply**considering the histograms of the number of positive events the players have along the matches**. To visualize these distributions, we can use Whisker plots. For example, in the next we plot the Whisker chart for the players for*germani basket brescia.*As we can see, Vitali Luca has for instance a much more spread distribution than Caroli Matteo.

*Creating player distributions over time removing substitution times*. The previous approach has a pitfall:**we are not taking into account the time players are substituted**. But in our analysis we are interested to the**performances the players have while being in the game**and not outside of it. Therefore, we should remove times in which athletes are not in the match. After this transformation, some changes can be appreciated:

By comparing the two plots we can notice **critical changes** at the distributions, which drastically modified our initial estimate. Note that the minimum minute for each player is now at minute zero, as expected.

*Filtering players and creating time bins*. Both these important steps are carried out to reduce noise,**making the cluster more stable**. The former consists in selecting players having a critical mass of positive events (i.e. filter out players with no enough positive events in the dataframe). The latter groups the distribution by binning on match quarters, so that instead of considering the number of events per minutes we considered the number of events per quarter. This allows**an easier, more stable and reliable computation**of the pairwise distribution distances.*Defining a similarity distance among player time distributions*. We are ready to define a similarity measure. Given two players, it returns a positive number which**indicates the distance among their distributions**. There are many options in this regards, but one good choice is to use the Wasserstein similarity measure. Scipy has an handy built-in method for the computation.

*Computing the distance among all the distributions*.**Given pairwise distributions we can compute distances according to this metric**. As a result, we obtain an upper diagonal matrix, which can be provided to the hierarchical clustering function of Scipy to build the cluster.

*Running a hierarchical cluster*. Bottom-up hierarchical clustering works by**successively**merging single objects into groups up to a final cluster is obtained.

At the very beginning, each point forms its own cluster. Then,**closest**points are iteratively merged to form larger clusters up to all the points are into a single cluster or certain termination conditions are met.

This method adopts a bottom-up approach in which a**dendogram**is formed from the leaves, representing single instances, up to the root, containing instead all the instances in the dataset.

To find groups of players having low distances among their distribution we can**run the Scipy hierarchical clustering**, which can accept a distance upper triangular flatten matrix as input:

The output of the function *linkage* summarises the steps taken in creating the clusters. Each row is formed by four columns which respectively indicate the** first and the second clusters merged**, the distance between them, and the number of instances present in the new merged cluster.

For example, the first row means the algorithm merged the instance number 37 with the instance number 86 because they had a Wasserstein distance of 0.002. The merged cluster is formed by 2 elements.

The algorithm proceeds merging iteratively closest clusters, up to all the elements are in one single group. The final result can be represented as a **dendogram**, a particular tree which describes the evolution of the hierarchical algorithm:

From the clustering, it is immediate to note the **existence of two big groups** (green and red) in which players differ. Also, notice the **presence of a an outlier** (player number 53) which has a **complete statistical different distribution** with respect to the other players. We can investigate more and have a look at this distribution over the game:

If we compare this player with the plots of other players we will carry out in a minute, it seems evident that it has quite different performances over time. Unlike other athletes, **he is much more constant in performances during the game**.

Now, we are finally ready to **carry out some comparisons among players**. The following code, taken two players, plots a whisker plot and a box-plot of their performances. Since, we are interested in using the cluster to compare the performances, we will just focus on the box-plot.

And here some plots, taken from some of the lowest leaves of the cluster:

That’s amazing! Along all the four quarters, the plots reveal a **close and strict similarity among the players we selected**. The distributions are indeed really close, and hence this allows to identify players having similar performances over time. Hooorrayyy!!

**Next steps**

In this journey, we just scratched the surface of all the investigations which might be performed starting from the data. Other ideas might consider questions such as whether some athletes have **better performances when playing together**, or bad performances when some other players are in the opposite team. It might exists a set of **charismatic players which are truly beneficial for all the team members** for example.

Another field of analysis might be considering the game events as a sequence, trying for example to understand if there are patterns, recurrent events where each team is defecting or strong (association rules might be needed here!).

Even starting from the analyses we performed, there is plenty of work to do. **Can we do inference from these clusters**? For example, are the team’s players balanced in the performances? What are the teams which are more balanced? Do exist some correlations of the good positioning of a team at the end of the season and the level of balance it has?

As we can see, there is a huge set of questions which should be investigated. However, still the analysis we performed can be beneficial for the different actors of a match. Also, it can be easily replicated for other sports (as football, soccer, tennis or whatever). After all, **it is just a matter of using a defined set of positive events recorded over time**.

Thank you for coming so far! I really hope you enjoyed the article as I did in carrying out the analyses and sharing the project with you. If you have questions, feel free to comment!

See you around next time, stay gold! :)