Enhancing Basketball Squad Productiveness : The Power of Multivariate Regression Modeling

Published in

Sports Analytics and Data Science

9 min readJul 17, 2023

What to expect from this article

In this article you will go through the full process of lineup optimization using Multivariate regression modeling such as Bayesian Mix Models , Time Series modeling, and other GLMs. Understand the effect of each lineup towards the targeted goal of a high differential in scoring vs the opposing team. You will understand how to incorporate limitations such as effectiveness as a function of time on the court , and learn how to build a game plan based on these lineups.

The demonstrations will be presented while using a repo I have created using Python and the official Israeli Basketball league data.

This article will be split into several parts.

Part 1: Intro to the problem + Challenges + Dimensionality

Part 2: MMM — modeling

Part 3: Simulation and Optimization

Introduction

A team can play with many different sets of combinations of lineups. Some are fast paced, some are intense, others may be conservative and momentum changers. What is the production and contribution of each lineup?

Is it safe to say that the lineup with the largest Adjusted +- metric is the most beneficial?

How would a coach build the optimal lineup while optimizing each active 5 player combinations minute share? We cant leave players in for too long or they will be ineffective, but on the other hand we want to make sure they are in the game as long as possible if they are indeed effective to maximize productiveness.

In this part we will go through the different steps of building the ideal model, in order to understand the underlying contribution of each set of 5 players to the total score differential. We will explore different approaches to building the model from Bayesian , to ARIMA and classic GLMs.

Once we understand the different contributions , we will go through how to build a simulator to a full games’ “puzzle” . This will answer questions such as “ should I play lineup A for 10 possessions and lineup B for 30? Or 20 and 20? “ . Obviously we will be looking at a multi dimensional set which creates a problem which is quite hard to solve .

The Challenge

Example: Given we have 5 different sets of players (in a small and limited universe) , can we understand the contribution of each lineup? How many minutes should we play with this lineup if at all?

One expected output would proclaim that lineup A contributes to 50% of the +- of the final score, resulting that this lineup may be key to winning, though its productive effect diminishes (fatigue) after 10 possessions, therefor we would max its produce about then . If lineup 2 , which in this example would have the second to highest contribution , contains all the same players in lineup 1 other than 1 player, would we play this lineup at max too? Or do we have to take in account the effectiveness of the players after they played in lineup 1? This becomes a problematic and NP hard problem to solve, as we have a classic optimization problem in the form of a knapsack problem. We will address this later on.

How do we find the optimal number of possessions that the lineups are effective at?

We will look at the problem on two levels:

- Team level

- Lineup Cluster level

Team level: Per team we will look at the given year of data and build a model per team to optimize its lineups in an individual matter.

Lineup Cluster level: In this part we will look at a broader granularity so that each cluster of lineups will “together” produce a certain outcome leveraging the entire leagues data.

There are several issues we will encounter.

Small data: We are dealing with small data , a season in the Israeli league contains close to 20–22 games a season prior to the playoffs. Given that we have 20+ sets of lineups, we would encounter a problem with the degrees of freedom. Moving to a cluster level , we can reduce the number of lineups to the number of clusters and multiply the number of observations by multiplication scale of close to 10 (times the number of teams in a league.)

Dynamic team lineups: Teams tend to change their lineups during the season and mostly at the end of the season. Therefore it is hard to stay with the same lineups throughout the year and the data changes constantly.

Dimensionality — The Good Fight

One of the challenges we encounter are lineups that change too frequently. How can we build a model when a large amount of lineups historically don’t even exists due to players that are traded, waived, injured etc?

If we have 30+ different sets of lineups played within a season of 20–25 games, we have a problem of dimensionality. (M >> N ) How can we reduce the number of features (lineups) in our set? How can enrich our dataset to pertain more datapoints?

One thought would be to remove all lineups that aren’t applicable, or sparse. But the tradeoff here would be that we may miss lineups or reduce our sample size which in any case is very small.

Another approach would be to reduce the number of lineups by clustering them together. How would that work? The assumption is that we can find similar lineups based on the actual players. Imagine 2 lineups exactly the same other than one player. That different player could be very similar between the two lineups, and it would be a shame to split the production of these two lineups due to that single player. Obviously if the player is very different between the two players then this would make sense to split.

Once the players are split into clusters , we can create combinations of 5. Instead of having a 1:1 match between lineups , having close to 20 observations per lineup, now we have N lineups (number of clusters which we force to be smaller than than the number of total lineups originally) but 10–30 X number of observations since on average each team will have have an entry of this clustered lineup per match. This way we reduce the number of lineups by about 3 times, and expand our dataset by 30!

In addition, any new lineup can be assigned to the closest cluster at any given time based on the distance metric of our choice.

The disadvantage here is that we won’t be able to defer between different lineups within the grouped lineup cluster. Meaning , if our model will later say to play 10 minutes with a lineup of players from the cluster “1–1–3–5–8” (lineup of 5 , each player and their cluster number id which will be explained in the following part), we wont know which players to play in precise. This wont be optimized but we have reduced our problem drastically and can re-iterate this method on a smaller “universe” of players internally for a team. In terms of management, this may give the coach a little more wiggle room and flexibility which may turn out to be an advantage. In addition, if we activate a hierarchical clustering model, we can at first cluster from a higher node, and once we reduce our problem, split to a lower granularity.

Image captured from https://towardsdatascience.com/hierarchical-clustering-explained-e59b13846da8

Note: From experience with working with management in professional teams, giving the coaching staff that room to make their own internal decision may be the connecting point of cooperating with these predictions and recommendations.

Another advantage would be for new players. A new player that joins the league/team mid year wouldn’t have to worry about a sample size. By joining the player into a sub-set we would be able to gain inference easily. One would need to carry over the past stats of the player (normalized of course) and assign the player its cluster for further analysis.

Player Clustering

This is tricky, players have their own style, contribution, physical form, mental “state” etc. How can we cluster these players close enough to create a closed set of clusters tightly fit?

This is a form of art. We can start by looking at different features within the box score plus different numerical attributes which may start to depart these players into an embedding space.

In my case I have started with a large set of features from the larger box score. This is very much dependent on the dataset you start with and how “rich” it is. In my case I used the data of the Israeli league (IBL) which left me with 127 features. Among those features were classic box score categories, play calling categories, opponent actions broken down to plays, percentages, and types of drives and shots. This is quite extensive and may require some refinement, but the initial set is very solid. My current theory and assumption is that there is no need to add demographic and personal attribute data since this may be underlying from the extensive set of features and combinations.

Several features need to be processed such as percentages etc, so there may be an exhaustive pre-process step. Normalizations and standardization , scalers might be needed as well depending on the set of features you have. The is crucial when running distance metric methods to remove any bias to a given feature.

In my case I tested how many clusters need to be created by calculating the silhouette , performing some sort of elbow method to find the optimal number of clusters. In my case I have found 17 clusters to be ideal, though the reduction of final feature set , may not be as drastic as stated in the previous part. Therefor I will be testing two splits, one of 17 and one of 9–10 clusters. I will then estimate the goodness of fit of both final models and choose between the two. (This can be set as a parameter within a hyper parameter tuning job)

When clustering by 9 labels and running a dimension reduction for plotting in 2D we receive a differentiation of the following:

9 clusters — Generated from notebook — LSA

While running with 17 clusters we manage to get a better split

17 clusters — Generated from notebook — LSA

Broadly speaking, there seems to be a fair split , but not a complete one. Reiterating through the features and refining them might perform better , and this can be set as a benchmark.

Regroup To Lineups and Ranking

Once we have a good enough split into player types, we can now move up to a lineup granularity which will contain lineups of 5 players which are active on the court at all times.

For the sake of the discussion the future references to lineups through clusters will be through the label ids. In other words, the label 1–1–1–1–1 will represent a lineup of five players, all from the cluster 1.

Each player is assigned a label, and historically we have all the lineups from the play by play level data. This means that per play/possession we have the data of which lineup was playing at that given time ( and later we will address the opposing lineup as well and its limitations).

Before we even start discussing diminishing returns due to fatigue and time limitations , lets first try to understand the general production of lineups over the recent history. At first we assign to each player id its cluster label.

Next we want to see the past dry stats of each cluster in terms of plus-minus, in total and per minute and possession. An example of an output would be as below.

Obviously this is not enough. We can see from here how the lineups were utilized in the past and the high level outcome of the games per each setup. But this can not be a recommendation for the future in a robust manner. If we would plot these lineups, there is a good chance that the production from them was positive for a large amount of possessions and then decayed resulting in an aggregated negative production.

Whats Next?

To summarize this part of the process, we reviewed the initial problem, offered a solution of clustering the data and prepared the data for the contribution modeling.

Next we will dive into the modeling methods to understand the underlying contribution of each cluster and usage “instructions” by identifying their diminishing return.

Stay tuned for the next part!