Examples of Data Science being used in Basketball
The Data Science approach can be used to answer questions in various fields but one field in that I find to be particularly fascinating is how data science is being used in the NBA and Basketball. While I was exploring the application of data science in basketball I found various examples which I will go over. I am still learning about topics in data science and will try my best to detail the the various data science methods being used in each example.
Example 1: How ‘stretchiness’ can determine NBA potential
While height is a major factor in determining what position a player can play and how good of an athlete will be another variable NBA scouts focus on is wingspan. More specifically the wingspan to height ratio. The higher the ratio the greater potential a player has in certain offensive and defensive categories. For example a longer wingspan means a player can cover more surface area when guarding another player during defense. During offense a higher ratio can potentially mean the player can use certain techniqes and moves to adjust and shoot over another player. As per “Finding the perfect body in the 2017 NBA draft” by J.H. Yeh the best players have a wingspan to height ratio ranging from 1.08-1.10. Recently Finals MVP Kawhi Leanord has a w/h ratio of 1.12. Defensive player of the year Rudy Gobert has a w/h ratio of 1.09. The average h/w ratio in the NBA is 1.04. In the article mentioned the author consolidates information on the 2017 NBA draft, ranks the players by potential draft pick and then compares their wingspan to heigh ratio.
Data Science Methods Used:
In order to gather information on NBA statistics the author used the NBA draft express website. To collect data you can either utilize web-scraping on sites such as basketball reference or get an API( 1, 2) with information on basketball and then store the information onto a database. Next you can use python to modify the raw data collected and store information on the variables you are interested in. In this case it was a players rank, name, position, wingspan and height. After that you can create a new variable for wingspan/ height by defining a function which divides the wingspan and height for each player. After that use a Matplotlib library in Python to create visualizations such as scatter plots and histograms. You can also go a step further and execute SQL queries to sort which players have the highest wingspan and what is their potential draft rank. An interesting question to explore further down the road would be to gather information on the current NBA roster and compare their wingspan to height ratio metric to their offensive/defensive rating.
Example 2: Predicting the Career of NBA players
In order to understand how predictive analytics works a basic understanding of regression modeling is essential. Regression modeling looks into the relationship between independent variables and the outcome of interest which allows you to draw conclusions on which variables matter ,how much of an impact these independent variables have on the outcome, and approximate future predictions. Some simple examples of these models include linear regression and logistic regression. Picking the right type of model depends on the type of data you are working with and what you are trying to answer. Linear regression would be better used when the outcome is mesaured as a continuous variable while logistic regression would be better used when the outcome variable is categorical or binary.
Predicting the performance of an NBA player can be extremely beneficial to teams when determining which player to sign to make their team better and also to determine how much the player is worth. The article “We’re Predicting The Career Of Every NBA Player. Here’s How.” by Nate Silver uses similar but more complex modeling techniques to answer a players future performance. The author takes variables such as age, previous offensive/defensive metrics and players with similar playing style into consideration and then predicts that players true value further down their career. In this case the outcome variable or performance metric is defined by win above replacement (WAR). WAR is a better method to get an idea of a players value because it details how valuable a player actually is to their team. It measure how many wins a player is responsible for by seeing how many wins that team decreases by when the player of interest is replaced by a replacement level player ( a league minimum player, playing all of the player of interests minutes.) In 2018 an excellent player like Lebron James had a WAR of 17.5, while Blake Griffin had a WAR of 6.5 and Demar Derozan had a WAR of 2.5.
Data Science methods used:
The data science methods used to predict a players future performance in this case is time series forecasting. Time series forcasting is a model that predicts future values based on previous values while also taking time into consideration. Time series adds an order dependence between observations.
Furthermore the model takes the same independent variables into consideration and provides the top 10 players throughout history that are the most similar to the player of interest. For example Lebron is compared to:
The data science techniques used for this feature in the model is a nearest neighbor algorithm (KNN). A simple example of how nearest neighbors works:
Lets say I want to determine the weight of person 11 based on the information available and information I have available on other people.
Once you plot the age(x axis) and height (y axis) you can see how all of these individuals are related in proximity by the variable in question, weight. In order to get a good approximation of person 11’s weight first you calculate the distance between person 11 and the neighboring people (k). You can calculate distance by using Euclidean distance formula. Here is formula:
In this case let's say k=3. Based off of the distances calculated the 3 closest neigbors are taken into consideration, that being person 1, 5 and 6. Next the average weight for person 4,5,6 would be the answer for the weight of person 11.
k values can be determined by looking at validation or training error curves or by grid search technique. Examples:
The CARMELO model uses nearest neighbors to assign a similarity score. All players get a similarity score of a 100 and when comparing the player of interest to other players. Then points are minused for how different a player is on the various characteristic factors from the player of interest. In other words how far their distance is compared to the player of interest. The top 10 similarity scores are then displayed for each player.
Another cool feature of the CARMELO model is that it is available online and very easy to interact with. One thing I noted was that age is a major factor (as it should be) when determing future WAR. For example a good rookies projection will continue to increase over time while an exceptional player who’s at the tail-end of their career WAR value will drop. Also an exceptional player’s future WAR may still be better than a decent player who is still in their prime.