Machine Learning | Sports Predictions

Joe Sasson
Analytics Vidhya
Published in
6 min readFeb 3, 2021

Let’s start with an important question: What is Machine Learning?

Machine Learning is a subset of artificial intelligence that uses algorithms to detect patterns and trends in data.

Machine learning can be easily broken down into the following methods; supervised and unsupervised. This article will cover the former, and from there, there are two types of tasks; regression, and classification. This article is a classification task, we want to predict if an NCAA Basketball team will win or lose their next game.

In computer science, an algorithm is a sequence of commands the computer is programmed to follow. In machine learning, we “train” these algorithms on historical data, to make decisions and predictions — so when new data is received, the algorithm can provide accurate insight into the future. The better the algorithm, the more accurate the decisions and predictions will become.

Examples of machine learning are all around us in today’s world. Digital AI bots recommend us movies, songs, and even vacation destinations — based on the things we’ve watched, listened, or searched online.

How does machine learning work?

I. Select & prepare the data set

II. Implement the desired algorithm (build AI model), and train it on the prepared data

III. Identify model metrics

Select & prepare the data set

Step 1: Installs / Imports

Modules you’ll likely need to install before running this code:

category-encoders · PyPI

sportsreference · PyPI

Sportsipy: A free sports API written for python — sportsipy 0.1.0 documentation (sportsreference.readthedocs.io)

Step 2: Fetch & process data for model input

Now that we have a list of all NCAAB teams, we can fetch the data for each team, and begin building the dataset for the machine learning model.

BREAKDOWN:

  • The function above fetches the data for all NCAAB teams. Next, I split the data into ‘features’ (X), and ‘target’ (y).
  • The feature data contains the game statistics that will be used to predict win or loss. See below:
  • The target data contains what we are trying to predict, in this case, it is the result of the game. Note — the API returns the result of the game (win or loss), since we want the model to be predictive, we shift the result back one game. Therefore, the model is being trained to predict what will happen in the next game, based on what happened the most recent game.

The best way to conceptualize the process thus far, is this:

  • Remember — the target is the result of the next game. Example: Brigham-Young, Citadel, and Duke, all won game 2. We have it set to game 1, so our model is trained to predict the next game.
  • Imagine the target, as the last column of the big data frame above. The algorithm is then trained, based on the features (game stats), to classify a win or loss.

One of the benefits of machine learning classification tasks, is we can actually visualize our models training process, to understand how it is learning, and making decisions. I will show an example of this with an infographic in the following sections.

Build and train machine learning model

Step 1: split the data for training / testing

This will split our feature matrix (X), and target vector (y), into training and testing data sets. The purpose of this is to train a model and measure initial accuracy.

Step 2: Define baseline accuracy

The baseline accuracy can be defined as: the accuracy score if the model predicted the majority class (win or loss), every time.

The reason for evaluating this metric is to see how balanced our target classes are, in this case, we can see they are balanced pretty evenly. One occurs 56% of the time, and the other 44%.

Step 3: Build model and train it

BREAKDOWN:

  • Define model using sklearn.pipeline module. This is just more robust code — the model does not have to be defined in a pipeline.
  • Use the category_encoder module to ordinally encode the categorical features.
  • Instantiate machine learning model: RandomForestClassifier.
  • Fit the model on the training data.

This is not one of the steps outlined above, because it is not necessary to build the model, but it is important for understanding how machine learning works.

Visualize the model

This code does not pertain to the model building process, and is not the significance of this section(see below), therefore I am not going to do a breakdown — if you’re interested please read the comments in the cell.

This is the infographic I alluded to above

Earlier I mentioned that machine learning models use the features(game stats), to predict the target(win or loss).

The infographic above is a visual of this particular machine learning model — but, this is the same for all tree based models. What you are looking at above is the training, this is our model’s decision making process. What the model is seeking to do, is accurately classify a win or loss. In technical terms; minimize impurity(discussed in depth below), based on the splits.

What is a split?

Great question! The splits or “nodes”, are just the features from our data set. As you can see above, the root node is season_losses < 5.5 — and the resulting classification is 293 wins, and 379 losses. This yields an entropy or “impurity” of 0.988 (we expect high entropy at the root node! ). The model above is very simple, and was only created for the purpose of this visual, there are only 3 layers in 1 tree. Therefore, as expected, this model never successfully minimizes entropy (model is not learning).

What is entropy?

Entropy is the measure of impurity of a machine learning model.

How does the model minimize entropy?

When you call the ‘fit’ function, the model will split on every feature (column) of the training set, the nodes you see in the infographic above are the splits resulting in lowest entropy.

Models learn by minimizing entropy. The formula can be found below:

Shannon Entropy Formula

BREAKDOWN:

  • i/n represent the classes… repeat the computation above for 1 to n classes.
  • This model only has two classes(win or loss), also can be seen in the implementation below (each log2 function represents one class).
Entropy at the root node
Root Node

The model continues to split, running thousands of these computations in parallel, minimizing entropy, and therefore “learns”. When new data is presented, the model will use the same splits seen above to accurately classify a win or a loss!

Identify model metrics

Step 1: Calculate accuracy score

We use sklearn.metrics accuracy_score to score the model. The parameters are the models predictions, and the actual result.

Step 2: Identify feature importance

NOTE — the ‘features’ variable was defined in the first section: Preparing the data.

Very cool! We can visualize how important each feature (game stat), was in contributing to the prediction.

This model is ready for use! Although — you could tweak the hyper-parameters, choose a different classifier, engineer new features, etc… but, at this point, I hope the message of this article has been received! You should know how to build a ML model for NCAAB predictions, and if you are not a programmer; I hope you were able to learn a bit about AI!

Thanks for reading!

--

--

Joe Sasson
Analytics Vidhya

Senior Machine Learning Consultant @ Ashling Partners. Passionate about innovation, problem solving, and learning.