Using KNN Methods to Predict the Cleveland Cavalier’s Outcomes Based on Halftime Data

Henry Myers
Kenyon College Sports Analytics
7 min readApr 4, 2018

The NBA is often considered to be a 4th quarter league. Because points can be scored so quickly, almost no deficit is insurmountable at almost any time during a game. As a result, most games are won and lost in the 4th quarter. However, statistical methods will be used in this paper to attempt to predict the outcomes of Cleveland Cavaliers games by only considering the state of the game at halftime. The analysis will include three different types of models. Most interestingly, a K-Nearest-Neighbors (KNN) machine learning model is employed to make predictive inferences. Then, a logistic model is used to make predictions. Finally, two rudimentary models are entertained to provide a barometer for the quality of the predictions from the logistic and KNN models: the first uses ESPN’s in-game win probabilities to predict final outcomes and the second simply uses which team is winning at halftime to predict which team will win.

Image taken from https://ftw.usatoday.com/2018/01/lebron-james-lebron-son-retirement-career-thoughts-cavs-reflection-nba-family

Methods

The KNN Model

The KNN model is a nonparametric predictive technique which is exceedingly applicable to real-world data. Fundamentally, the KNN model uses a new observation’s information to make a prediction regarding the outcome of the observation by calculating which previous observations most resemble that of the observation of central interest. The KNN model calculates a “distance” between the new observation and all other previous observations. The model then averages the outcomes of the k nearest neighbors in order to make a prediction for the observation in question.

It is a rather abstract idea to consider “distances” between observations, so let’s consider an instructive example in order to get an idea of what is going on here. Consider the (fabricated) data[1] presented in Table 1. Provided is a set of data for five games which have been completed and one game (the Rockets game) which is currently at halftime. For each game, the number of blocks that the Cavs recorded in the first half as well as the number of points by which they are leading at halftime is presented. Lastly, the result of the game from the Cavs’ perspective is recorded. The goal is to make a prediction of what the outcome of the game against the Rockets will be based on the information given by the Cavs’ blocks and lead at halftime. The KNN method calculates a distance between each completed game and the Rockets game. There are lots of different methods of finding this “distance.” The two most popular are the Euclidean (think Pythagorean theorem) and Manhattan distance (think city blocks.) In this article, the Euclidean distance will be used.

The Euclidean distance is calculated by finding the square root of the sum of squared differences between each value. For example, consider the distance between the Rockets game and the Bucks game.

The distance between the Rockets game and all of the other observations is calculated in the same way; the resulting distances can be seen in Table 2. Also in Table 2, we can see that the Celtics, Bucks, and Suns games (highlighted in green) are the three closest in distance to the Rockets game. If we do an analysis by considering the three nearest neighbors to the Rockets game, then we would use these three observations to make predictions. Note: here, the “k” in KNN is three. So, how is the prediction made? By referencing Table 1, it is clear that two of the three nearest neighbors to the Rockets game end in a Cavs win. As a result, the prediction is that the Cavs will beat the Rockets.

To determine which variables to use as the predictors in this model, the best subsets command in R was used to get an idea of which predictors have the most predictive power. After considering which subsets of predictors had the most predictive power, subsets of different sizes were chosen to create a few different models. Finally, I had some fun and chose some predictors for myself just based on my gut and what I know about the NBA. The four sets of predictors can be found below. These four sets of predictors are the predictors used in four KNN models.

Finally, let’s spend some time determining how to use the KNN models to come up with predictive results. Often times in KNN analysis, the full set of observations is split into two groups: the first group is used to create the model (training group) while the second group is used to test the model (test group.) The idea is that we can use the training group to make predictions about the test group, then we can compare the test group’s predicted outcomes to their actual outcomes. This gives us some indication of the quality of the predictive power. However, the Cavs have only played 72 games this season at the time of the writing of this article. If one splits the group into training and test sets, then quickly there are not enough observations to make accurate predictions. Rather, the methods in this article predict each game once by letting the other 71 games be the training set and using the solo game of intrigue as the test set. In this way, there is a prediction for each game based on the outcomes of the other 71 games. To determine the quality of predictions, simply tabulate the percentage of correct predictions. The results can be found in the Results section.

Other Models

A logit model and two rudimentary models are also considered which provide comparison values. The same predictors were used for the logit model as the KNN model for which I chose the predictors based on my own intuition. Because the probabilities associated with the predictions are the important part for this analysis, the logit is presented as a probit:

If the logit predicts the Cavs to win the game with greater that 50% probability, this is called as a predicted win. ESPN’s in-game win probabilities are used in a similar way. If ESPN predicts the Cavs to win with probability greater than 50% at halftime, this a predicted win. Finally, in the lead-as-predictor model, if the Cavs are winning at halftime, this is a predicted win as well.

Results

Consider how the seven models stack up against each other. In Table 3, each model is presented along with the percentage of the time that it made the correct outcome prediction. We can see that the KNN model in which I followed my gut to choose predictors has the most correct predictions.

Analysis

The most interesting result here is that in the KNN models, more predictors is not necessarily better. The best way to account for this pattern is to consider the normalization process of in the KNN analysis. Part of the KNN procedure is to normalize each value into a percentage-of-max value rather than using the value itself when determining the Euclidean distance. This process ends up forcing each predictor to have equal weight. For this reason, more predictors is not necessarily tied to more predictive power in the KNN model. For most regressions, we give the regression a set of predictors and then ask the regression to show us which predictors are most important by giving us a t-statistic for each predictor. However, by adding variables, we would never decrease the total amount of variability explained in a standard regression. In the KNN model, we are not afforded that luxury.

This normalization aspect of the KNN process forces every variable that is included in the model to have equal weight. This forced equal contribution isn’t necessarily good or bad, but rather simply puts more of the choices into the hands of the user. What is most important in creating a high-quality, normalized KNN model is choosing predictors that appear to be harmonious. Each predictor added not only carries as much weight as the other pre-existing predictors, but diminishes the predictive contribution of the pre-existing predictors. For some additional reading, consider an article by Li, Rui, and Guan[2] which examines these issues. They consider KNN methods which use various weighting mechanisms to allow predictive contributions to vary amongst the explanatory variables.

Conclusion

The KNN model is capable of predicting Cavs game outcomes based on halftime team statistics. In fact, in the best predictive model, 71% of predicted outcomes were correct. This can be compared to the 67% that a logit model predicts correctly, the 70% that ESPN’s in-game win probability predicted correctly, and the 68% that can be predicted correctly by simply predicting that the team winning at halftime will win the game. The primary limitation to the KNN models presented in this article is that all variables are treated as equally important when considering predicted outcomes. As a result, more predicting variables does not necessarily lead to a more accurate model.

Henry Myers is a senior economics and mathematics double major at Kenyon College.

Works Cited

Li, Rui, and Guan Gong. “K-Nearest-Neighbour non-Parametric estimation of regression functions in the presence of irrelevant variables.” Econometrics Journal, vol. 11, no. 2, 2008, pp. 396–408.

Thirumuruganathan, Saravanan. “A Detailed Introduction to K-Nearest Neighbor (KNN) Algorithm.” Word Press, 17 May 2010.

Willems, Karlijn. “Machine Learning in R for beginners.” DataCamp Community, 25 Mar. 2015, www.datacamp.com/community/tutorials/machine-learning-in-r.

[1] The Data provided in Table 1 is fabricated. It is only intended to be used for instructive purposes and does not represent any data used in this article.

[2] Formal citation in Works Cited section

--

--

Henry Myers
Kenyon College Sports Analytics

Senior Math and Economics Major at Kenyon College. Physics Minor.