Using A.I. To Hack Your Fantasy Lineup

Published in

The Sports Scientist

8 min readJun 29, 2020

In my last blog, I showed you how to scrape seasons worth of NBA data. In this tutorial, I will show you how to use that data in your very own machine learning model. I will implement three types of models in this tutorial: (1) Linear Regression, (2) Artificial Neural Network (ANN), and (3) Multivariate-Recurrent Neural Network (LSTM-RNN)

Before we get into coding each model, I will briefly go over the advantages and disadvantages of each and how they can be useful for predicting fantasy sports.

Multivariable Linear Regression Model

As the name implies, this model fits your data in terms of a linear relationship. Like the famous equation we all learned in high school: y = mx + b, a multivariate regression applies this concept within multiple dimensions. This method is used to measure relationships between more than one independent variables (inputs) and one or more dependent variables (outputs).

In our case, this model is useful because we will be using multiple inputs (player average stats: rebounds, assists, etc.) in order to predict an output with a continuous value (fantasy points scored).

Advantages

Very easy to create
Requires minimal computational power when compared with Neural Networks

Disadvantages

Can only capture linear relationships
Deals with noise and outliers poorly

Artificial Neural Networks (ANN)

AAN is the most basic type of neural network architecture. In Neural Networks, patterns are learned using input, hidden, and output layers. In each layer, there are nodes that compute an activation function, providing the next layer with their output. Think of the input layer as the number of inputs you use to create an output. In our case, this layer will consist of 16 nodes corresponding to 16 different stats per player (vectors). For the hidden layers, this varies depending on the complexity of the problem and the computational power you have available. Within each node inside the hidden layers, an activation function is defined (ReLU, LeakyReLU, Sigmoid, etc.), which computes a given weight for an input. This process is repeated across every layer, hundreds or even thousands of times until the network finds optimal weights for each vector. In our case, the network will attempt to place a weight on each input variable (stat types) and use that weight to make future predictions.

AAN Architecture (Photo Credit: TAVISH SRIVASTAVA)

Pros

great for categorical and binary classification
Can learn complex and non-linear relationships

Cons

Requires a lot of data and takes a lot of time to train
Very easy to overfit (too many epochs, low batch size, etc)
Requires substantial computational power and sometimes specific hardware depending on the model (GPU)

Multivariate-Recurrent Neural Network (LSTM-RNN)

Multivariate RNN is very similar to ANN and widely used for predicting sequential data. Instead of using a single feed-forward approach (like ANNs), RNNs use a chain-link which acts as a memory process for understanding conditional patterns. In order to do this outputs are shared between nodes, unlike ANNs where nodes act independently from one another.

RNN with 1 hidden layer (Photo Credit: Dhanoop Karunakaran)

Pros

Ideal for mixed inputs (continuous and categorical)
Popular for language processing and sequential time-series predictions

Cons

like other neural network architectures, these require a lot of data
Require more computational power than ANN as layers increase in size
Prone to over-fitting

**Check out my GitHub repository to follow the code implemented in this project**

In addition to the game stats we scraped in the first tutorial, we will also use fantasy stats (player salary, salary change, and fantasy points scored). A problem I found with incorporating this data is that fantasy websites like DraftKings and FanDuel do not post past stats, making web-scraping an unviable option. To get this information, I purchased data from rotoguru.com; a website that hosts fantasy stats dating as far back as the early 2000s.

Once I had the fantasy stats for each season, I performed an inner join on player_name, gameID, and date. Giving me a data frame with a shape of (48204,31)- 48,204 rows and 31 columns.

As input variables for each model, we will use the stat averages over the last 2 games. The reason I decided on 2 games is that injuries tend to skew the averages over longer time periods giving inaccurate predictions. For the output variable, we will use the amount of FanDuel points scores. To make this logic more clear, we will be using a players stat averages over the last 2 games to predict their current fantasy performance.

To get stat averages for each player, I iterated through each unique name in the data frame, making a new data frame consisting of every data point for a single player. Once I had that data frame, I looped over each row (skipping the first two), and calculate the average for each stat and appending it into its own column. On my GitHub, I show the full notebook which allows you to choose the number of games you want to use for the averages.

Stat Averages for Each Player

The data frame will now have almost twice as many columns, with the second half of the new table looking like this.

Now that we have formatted the data, we can move onto actually creating and training different models.

Load and Organize Data

In this step, we will load our merged data, split it into train and test set, then visualize the distributions for fantasy points scored.

Load and Visualize Data

Distributions:

Looking at the distributions, we see there are a lot of rows where the player scored 0 fantasy points. Although we can assume this will cause an accuracy problem, lets include these points in the linear model.

Linear Model

One of the benefits of creating a linear model is how easy and fast it is to do it. Shown in the code snippet below, this can be done in just a few lines of code.

Code Implementation for Linear Model

Below is a picture of a scatter plot for the predictions vs. actual values. The threshold for being a good prediction was 7 points; meaning, if actual points were 30 and the algorithm predicted anything between 23 and 37, it counts as a good prediction (green). Although there can be issues with this in terms of optimization, I only used it to create visualization so it is easier to see where where the model made a useful prediction.

Linear Model: Scatterplot for Predicted vs. Actual

A really great tool within Sklearn’s linear regression models is finding each vector’s coefficient. The coefficient for a vector is similar to a neural network’s weight, in that it is associated with its strength for predictions. Below is a bar graph illustrating each vector’s weight for predictions.

As we can see, averages for minutes played, total rebounds, assists, steals, blocks, 2-pointers, and fan duel points have the greatest effect on predictions.

ANN Model

For both neural network models of this project, we will use the same data except we will ignore every row where FanDuel points = 0. The reason we are doing this is to capture a better fit for the model and thus make better predictions. Before creating and fitting the ANN model, we must preprocess our data using Sklearn’s preprocessing module. In this project, we used MinMaxScaler but feel free to try others to see if you get more accurate predictions.

Code implementation fro ANN

After running the code above for 50 epochs, we see that it preformed almost identically with the linear model showing a loss of 110.53 validating the test set. We can also see that the model started overfitting our data at about the 9th epoch.

Multivariate RNN Model

Like the previous ANN model, you must first normalize your data using MinMaxScaler() and separate your data into a training and test set. Before plugging in x_train and y_train into an RNN model, you must first reshape each x array (x_train, x_test) to be 3 dimensional such that the shape is (# samples, time steps, features). In this example, # of samples will be the number of rows, time steps will be 1, and features are 15 (each player feature).

Although the model performed the best out of three, it still shows a pretty big loss. Similar to the ANN model, we see the RNN model began overfitting at around 10 epochs; however, validation loss consistently stayed below 100 MSE.

Code implementation for RNN (LSTM)

Conclusion

Comparing the loss between each model, the best turned out to be RNN (by a marginal difference). What we can conclude from these values is that there is no latent pattern within our dataset. A takeaway from this project is that predicting continuous values in AAN is no different than predicting values using a linear regression (in most cases). However, I do still believe there is an application of RNN in fantasy basketball due to the sequential nature of the data.

Future Directions/Optimizations

More data points! (Include more categorical data)
Reorganize the population data so that there are multiple samples rather than just one.
Increase timestep from 1 to 10 or more.
In the future, I plan to include another column using home and away, as well as the number of games each player participated in over the last week. This statistic will be different than the averages because it shows whether a player has played many games, consequently leading to poorer performance (fatigue). In the NBA, some teams will play 3–4 games a week while others can play two with larger breaks in between. These differences could illustrate a pattern that would be useful for training an RNN
We could also use an RNN to predict a single player’s performance. In this method, we will sample the population to gather hundreds of datasets comprising of only a single player, then run an RNN on that data frame. By doing this, the algorithm may be able to understand patterns of a single player rather than those of the entire league.