Beat the Bookmakers With Tree-Based Machine Learning Algorithms

Published in

Analytics Vidhya

20 min readApr 14, 2021

Introduction
Basics of Machine Learning Algorithms
Decision Trees
Tree-Based Algorithms
Data Acquisition and Preprocessing
Feature Engineering
Machine Learning Infrastructure
Initial Results and Optimization
Final Results and Testing Betting Strategies
Conclusion

All of the programming was done in MATLAB, but this article will not discuss any coding syntax nor anything specific to how to implement these ideas within a programming interpreter. This is simply the methodology behind the steps of creating a model. The data was gathered from here and here.

Introduction

Sports are one of the few types of entertainment overlapping in nearly every culture throughout all of mankind. The rivalry, teamwork, and seemingly unpredictability of them cultivate massive fans around the world and are established as one of many people’s top pastimes. However, through the advent of modern machine learning, we will try to do the impossible and unravel this uncertainty. Specifically, this article will attempt to predict whether the home team of a baseball game will win or lose with various tree-based algorithms.

Well, why baseball and why tree-based algorithms? The answer to the former is that baseball is a relatively popular sport with a large history of data. It is also a sport that involves many players; if one player is not feeling very well a particular game, the rest of the team will hopefully make up for him unlike in an individual sport such as tennis (although it’s certainly possible that one could use a model to predict tennis scores and even get great results). This will hopefully make predicting the winners an easier task. As far as the latter question, tree-based algorithms are one of the most popular types of algorithms used in classification machine learning, so this article will hopefully teach the reader about one possible way of applying a tree-based algorithm from start to finish.

Basics of Machine Learning Algorithms

There are two primary forms of machine learning algorithms: supervised learning and unsupervised learning (there is also reinforcement learning but we’ll ignore this for now). Supervised learning is when you give a model an input and an output, so that the model can learn how to estimate the output given the input. Supervised learning has two primary sub-categories: classification and regression. Classification problems have outputs that are discrete classes, such as win or lose, whereas regression problems have outputs that are continuous, such as price. On the other hand, unsupervised learning only uses inputs and tries to cluster them together using various metrics.

For example, let’s say I want to predict someone’s gender based on their weight and height. The inputs (also known as features, predictors, covariates, independent variables, or regressors) to the model would be the weight and height, whereas the output (also known as target, dependent variable, or response) would be biological sex. This particular problem would be supervised since you are giving the model an input and an output. More specifically, it would be supervised classification problem since biological sex only has two classes. If we were predicting weight instead, it would be a supervised regression problem since weight is continuous.

Decision Trees

There are many different machine learning techniques for classification, but some of the most popular ones are tree-based algorithms. Tree-based algorithms use collections of decision trees to make predictions. A decision tree is itself an algorithm that can be used for regression and classification. Essentially, a decision-tree is a sequence of questions that partitions data by asking whether certain features are under or over a specific value, then makes predictions of the target by utilizing the partitions. A standard configuration of a decision tree is depicted in Figure 1.

Figure 1: Decision tree with height and weight features and a biological sex target

There are three components to a decision tree:

Root Node: The first decision split in the tree
Internal Nodes: All subsequent decision splits
Leaf Nodes: The final layer of the tree that classifies the groups into the target variable

In the tree above, the root node is the node containing the split Height > 67.5 in., the only internal node is the node containing the split Weight > 160 lbs, and the leaf nodes are the nodes containing the target variables (pink diamonds in this case).

Split Criterion

But, how do decision trees decide what splits to use? To explore this, let’s use our example from above and imagine that we are trying to predict biological sex given someone’s weight and height with a decision tree. As noted in Figure 1, most decision trees split a set of data into two separate groups that best split the data according to some metric. For classification, the two primary metrics used are Gini Impurity and cross-entropy. Cross-entropy is similar to Gini Impurity, but it involves using the concept of entropy from information theory. This article won’t go in depth about it, but essentially, as the cross-entropy decreases, the homogeneity within the groups increases. Therefore, we want to minimize cross-entropy. Both decision tree splitting methods are widely used since they are differentiable, but Gini Impurity is usually preferred since it is less computationally expensive and therefore will be used in this project. Once a decision tree makes a split on the data, it continues to keep making splits until some criterion is met. The criterion can be the number of leaf nodes, the minimum splitting criteria value for each leaf, the number of splits, etc.

Gini Impurity

Gini Impurity measures the likelihood of incorrectly classifying a random element from a partition given that it was classified according to the distribution of the classes. The specific formula is given above, where K is the number of classes.

Image we have a group of 100 individuals, 60 of which are male and the other 40 are female. The probability of picking a male is 0.6 and the probability of classifying him incorrectly is 1-0.6. Conversely, the probability of picking a female is 0.4 and the probability of incorrectly classifying her is 1–0.4. Therefore, the Gini Impurity for this group would be 0.6*(1-0.6)+0.4*(1-0.4)=0.48. If we used the last simplification, the equation would be 1–0.6²-0.4²=0.48. This alternative formulation can also be thought of as the complement of the probability of correctly classifying each sample. The higher the Gini Impurity, the higher the probability of incorrectly classifying a member of the group, which is why we want to minimize this value. A decision tree using Gini Impurity as its split criteria will find which feature split minimizes the weighted sum of the two new groups Gini Impurity, where the weights are the percentage of observations in each leaf.

Tree-Based Algorithms

Figure 2: Collection of decision trees (Source)

A simple decision tree can individually produce very accurate results on the data it is trained on. However, this data is usually overfit on its training data and therefore would not have as low of an error on subsequent samples that were not used to train it. This is why decision trees alone are referred to as weak learners. There are two primary ideas to prevent overfitting: limit the number of splits on a tree and average the decisions of many trees. If we limit the number of splits on a tree to, say, 4, the tree will certainly be less overfit, but many complex relationships between the data would be ignored since only a maximum of 4 features will be used. Therefore, it is commonly preferred to create many overfit decision trees and find some way to incorporate all of them to make one decision. This is where tree-based algorithms come in to play. Tree-based algorithms are classified as ensembles in which many weak learning (decision-trees in this case) are combined to make a strong learner. Different algorithms have different ways of building these trees and combining them. The two primary algorithms used are ones that rely on bootstrap aggregation and ones that rely on boosting.

Bagging

Bootstrap aggregation, otherwise known as bagging, selects random samples with replacement from our original data set to fit predictions on. Imagine we have our original data set of predicting biological sex from our various features. If we fit a decision tree on our all of our data, well, that’s it. We don’t have any more data so that is the only decision tree that can be made. If, however, we used bagging, we would sample M data points from our original dataset with replacement and fit a decision tree on that new sample. M is generally equal to N where N is the number of original data points. If M is equal to N, then it can be shown that on average, 1/e or 36.8% of the original data points were not used. This is because we have sampled with replacement, so many points were chosen more than once. Now, repeat this process many times and choose some criteria for predicting the output, such as the majority vote of the all the decision-trees. This process has been shown to decrease the variance in a model and is usually preferred with some algorithms, like random forest.

Random Forest

Random forest is a machine learning algorithm that can be used for classification and regression. The algorithm essentially constructs many different decision trees using bagging and either averages out the predictions of each tree for regression or takes the majority vote of the trees for classification. It is widely used over decision trees since it tends to reduce overfitting. What makes it unique from previous majority vote bagging algorithms is that it only selects a random subset of the features at each tree split. Generally, for classification problems, sqrt(p) predictors are used and for regression problems, p/3 predictors are used where p is the total number of features. If there were a few strong predictors, they would be used in every tree making the trees strongly correlated. This makes utilizing multiple trees more futile since they will all have similar predictions. By selecting a subset of predictors at each split, the tree cannot always use the strongest predictors and is forced to try to find patterns within other features. This creates less correlated trees and usually causes an increase in test data accuracy.

Boosting

Boosting is also an ensemble-based machine learning algorithm based on trees. There are many different algorithms that each have their own unique properties so I will not go in depth about each one, but essentially, they iteratively train each tree by trying to improve the weaknesses of previous trees trained. In a random forest, each tree is trained independently of how other trees performed. However, in boosting, each tree tries to fix the mistakes previous trees made. This involves weighting certain observations differently, minimizing specific loss functions, and other nuances. To further prevent overfitting, these trees are also usually either tree stumps (one-level decision tree) or shallow decision trees (maybe something like 10 splits).

The in-depth algorithms would not be within the scope of this article, so it will not go in depth about them, but there are many resources online showcasing them. Some of the most common boosting algorithms are AdaBoost, LPBoost, TotalBoost, XGBoost, LogitBoost, and RUSBoost.

Data Acquisition and Preprocessing

Before starting, it’s important to understand that a large part of performing machine learning is simply gathering, cleaning, and processing data before use. The steps used in for this process are:

Gather Data (web scraping or API)
Fill missing data (or using surrogate splits)
Remove Outliers
Normalize or standardize data

Gather Data

First, we need to actually acquire the data. APIs are usually faster and more convenient, but may not always exist for your specific case or be too expensive. In this case, web-scraping was used with BeautifulSoup4 in python.

Fill Missing Data

Some common techniques to filling missing feature points are through the use of that feature’s mean, linear interpolation, shape-preserving piecewise cubic spline interpolation, and modified Akima cubic Hermite interpolation. However, some machine learning algorithms have methods of bypassing missing values, such as using surrogate splits. A surrogate split is simply a split that is used when the original value is missing. For example, let’s say that our decision tree says that height is the best feature for splitting a certain group, but has various missing values for some observations. A surrogate split could be made with the next best predictor, such as weight, and can be used whenever the value for height is missing. The advantage of surrogate splits is that all observations can be used without having to guess the values of missing data. Because of this, surrogate splits will be used for this problem.

Removing Outliers

Outliers are observations whose features’ values deviate substantially from the rest of the observations. Although outliers are not common, they could severely impact our model’s performance. Luckily, outliers do not affect tree-based models too much since the split criteria is not substantially affected if an outlier is included in the training set. However, we will still remove extreme outliers for good practice and to potentially get rid of erroneous data. Since multivariate outlier removal can be a fairly complex subject, we will use an outlier technique that simply removes all of the data points that include a feature value that is greater than 3 standard deviations from that respective feature’s mean.

Normalize or Standardize

It is also important to mention that with many machine learning algorithms, normalization or standardization is utilized. Normalization is usually thought of as transforming the data so that it has a mean of 0 and a standard deviation of 1 whereas standardization generally implies transforming the data so that all of the data points lie between 0 and 1. Luckily, these transformations are not needed for tree-based models since each split only depends on one feature rather than interactions with multiple features.

Feature Engineering

After the data is cleaned and processed, we will use feature engineering to create various features to be used in the model. The initial data types that will be used for the model are:

Data from both teams of the previous year: This includes results (such as the win rate, batting average, and points scored) for each game from the previous year that are then averaged.
Data from both team from that same year: This includes the same results as before (win rate, batting average, and average points) except they will be averaged from previous games that same year.
Odds of the game: These features will utilize the predictions from the bookmakers on how the game was going to go. For every game, the bookmaker gives odds for each team winning. These odds can then be converted to implied probabilities to show the chance a team has to win. If you’re curious to know a little more about odds and why they’re unfair, you can refer to my previous article written here.

Binning

Although binning was not used in this project, it is a common method of feature engineering. Binning takes continuous random variables and discretizes them into groups (turns them into categorical features). Binning usually bins features into groups with either a constant number of observations within each group or within groups that have a constant range. Binning can reduce training time and improve performance in some models.

Manipulation of features

Another form of feature engineering is manually manipulating features to make new features that seem like they could have predictive power. For example, the data from the previous year gives the total number of hits H from that year as well as the total number of runs scored R. A new feature would be to divide H by R to essentially get the hitting efficiency. This feature would show how many hits it would take to score a run on average. This form of feature engineering relies on knowledge about the subject at hand, so an expert in baseball could probably give more insight into other useful features.

Moving Averages

Another way to aggregate the data into features is to take a moving average of the data for the last couple games played and use those. For example, one of the pieces of data given is the total number of runs scored per game by each team. To implement this as a feature that I would use in a game for a specific team tomorrow, I could average the number of runs the team had in the last 5 games. This is a standard average, but other averages could be used as well (exponential, weighted, smoothed, etc.). The number of previous games to average is up to the user and an analysis of different numbers could be analyzed for optimization. This project uses a simple average of the previous 5 games for all of its same year features.

For all of these transformations, the difference between the home team’s values and the away team’s values are the features used. For instance, if the average number of runs scored last year was 7 for the home team and 5.6 for the away team, the feature used would be 1.4.

Machine Learning Infrastructure

Initial Setup

Now it comes time to actually create the model and apply machine learning. As mentioned, these previous steps usually take a majority of the time and are the more dry and monotonous portions, but now comes the fun. The steps used to create the model were:

Partition data into train/test splits
Determine which types of models will be tested
Initialize models with parameters

1. Train/Test Split

First, the training data must be partitioned into a training set and testing set. Sometime people also use a validation set, but we will use k-fold cross-validation as a substitution. With k-fold cross-validation, the training data is split up into separate folds in which each fold is used as a test set and the rest of the folds are used for a training set. This process is repeating until each fold has become a test set. The errors are then averaged out across each fold. It is important to note that throughout the entire model optimization process, the testing set should not be used until testing the very last model, or else we risk overfitting the test set. The training set will be 80% of the data and the test set will be the remaining 20%.

Usually, the data points that go into the training set and testing set is random, but we need to make sure that the data is stratified. This means that the target train and test sets have the same distribution. In our data, the targets are a 1 if the home team wins and a 0 if the home team loses. The mean value of our targets is 0.53, which means that the home team wins approximately 53% of the time. If we stratify our data, the targets for our training and test sets should each be comprised of about 53% ones and 47% zeros. If we didn’t stratify the data, we could get a distribution different in each class which could hinder our model’s performance for new observations.

2. Determine which types of models will be used

There are many different machine learning tree-based models we could use, so deciding on one is a difficult task. This is why we will simply try a couple of them and find the one with the highest accuracy. We will use a random forest model as well as various boosting models. We will also implement various tactics along the way to improve our models and analyze their performance. Before we do any optimizations, the models will all be tested with basic parameters to get a baseline of their performance.

3. Initialize models with parameters

The parameters used for each model are shown in Table 1. These are just the initial parameters used and will be optimized for each model over future steps.

Table 1: Initial properties of various models trained

Initial Results and Optimization

Confusion Matrix

The first step in understanding our models is through the use of a confusion matrix. A confusion matrix finds how exactly the models made guesses and where their errors came from. There are 4 different types of predictions a model can make with respect to the targets:

True Positives: The model predicts a win and the home team actually wins
True Negatives: The model predicts a loss the home team actually loses
False Positives (Type I Error): The model predicts a win for the home team but they actually lose
False Negatives (Type II Error): The model predicts a loss for the home team but they actually win

Each model has its own confusion matrix containing the distributions of the 4 types of predictions listed above. For each matrix, the cross-validated results were used. A confusion matrix for one of the models, namely the Random Forest model, is shown in Figure 6. The results imply that the model was able to find some sort of patterns within the data since it is greater than just guessing. If we only bet when the model says a team is going to win, we would be correct 57.9% of the time.

Figure 6: Confusion matrix for initial Random Forest model

Implementing a cost matrix

For our particular project, one strategy could be to bet on the home team only when our model predicts they will win. In this strategy, we would want to minimize the number of false positives since this would be bets made that ended up losing money. In other words, we want to maximize the probability that a prediction of the home team winning given by the model is correct. This value would be referred to as the positive predictive value (PPV). PPV is how likely a positive prediction yields a positive result. In order to improve our PPV, we can implement a cost matrix.

A standard cost matrix is shown in Table 2 and essentially shows the cost that each type of prediction would yield. If the model guesses correctly, the cost should obviously be zero. However, if the model guesses incorrectly, the various costs could be different depending on how the matrix is configured. Since we want to minimize false positives as much as possible, we want to set the cost higher for when the model predicts a win but the true result is a loss, shown in Table 3. Knowing exactly what to set this cost to requires trial and error and even differs depending on the model. Usually, I use a cost of around 1.1–1.5 and tweak it from there until I get a high enough PPV without sacrificing the number of positives being predicted.

Misclassification Rate

The next step in optimizing our models is to plot the accuracy vs the number of learners for each ensemble. We can look at the loss depending on the number of learners and find the optimal number of learners for each model. All of the comparisons only used cross-validated observation, with the exception of the random forest model which was also able to use the out-of-bag observations. Examples of the plots are shown in Figures 7 and 8.

Figure 8: Plot of the misclassification rate versus the number of learners for the AdaBoost model

Feature Importance

Finally, we can examine which features were useful for the models. Too many features can lead to overfitting and therefore a low performance of the model. There are two methods that will be used, namely out-of-bag predictor importance by permutation and change in node impurity. The first method permutes a feature’s observations and then sums the errors of each tree using the out-of-bag samples. If the feature is important, then permuting the observations should have a largely negative impact on the performance of the decision tree since its values are randomized. Because this method requires bagged samples, it can only be used on models that bag observations, which in this case is only the Random Forest model. The second method looks at each split that used a particular feature and measures how well on average the feature did at improving the split. Since these algorithms are using Gini Impurity, the algorithms determine how well each feature minimized the Gini Impurity on the splits it was used on. This method does not require bagging and therefore is used on the rest of the models.

Examples of each type of feature importance are shown in Figures 9 and 10. First, we plot the importance for each model. Then, we find a threshold that a feature must reach in order to be used in the optimized version. For the Random Forest Model, a threshold of 0.2 was used and for the AdaBoost model, a threshold of 5E-5 was used.

Figure 9: Feature importance for Random Forest model using OOB predictor permutation

Figure 10: Feature importance for AdaBoost model

Final Results and Testing Betting Strategies

Now that we have made adjustments to the models, we can retrain all of our models and keep repeating this process until we get a model that is to our liking. After optimizing our models and observing their results, the RUSBoost model had the highest performance. The confusion matrix for this model is shown in Figure 11. Although the results are not perfect, they provide an improvement over the original models.

Figure 11: Confusion matrix for RUSBoost model

One Last Feature

There is one last feature that will be added to the model. Our model is wrong around 40% of the time, but what if we could make another model to predict when the current one is wrong? Well, then we could use this as another feature in our model and it could potential be very useful. This idea is implemented using the same splits and optimizations strategies shown above. It is a useful strategy that can be applied to many different problems. The steps used to create this model are the same steps shown in the previous steps.

Testing Betting Strategies

So now we have a model that has seemingly good results, but how do we actually test this model to see if we could create a trading strategy with it? One potential strategy would be to only make bets on a team winning when the bookmaker says that team will team win in addition to a confidence threshold our model’s score must meet. Note, the model’s score is referring to the confidence our model gives to its prediction; this value’s range depends on the model. If the bookmaker says that a team will win with 60% probability, we might only bet if our model has a score of above 0.65, for example. We can test which score thresholds optimize the accuracies for each implied probability given by the bookmaker on our training set and test those results on our test set. The results are shown in Figure 12 for this method. Overall, with the exception of when the implied probability is 72%-75%, the score threshold improved the accuracy of the bookmaker’s model with an average increase in 9.2%.

Figure 12: Change in accuracy of bookmaker’s model when adjusted with RUSBoost model

Conclusion

Albeit its results were not flawless, the RUSBoost model was able to improve upon the bookmaker’s model. The model is not tested specifically for profitability, but a 9% percent increase in accuracy is almost certainly enough to optimize a strategy to make consistent returns. If more time was spent on creating interesting features and gathering more data, the results could potentially even be better. The results could also be improved by using different boosting algorithms, such as XGBoost, or even different machine learning algorithms altogether.

What makes machine learning beautiful is that the state-of-the-art algorithms that the most advanced companies in the world are using can be improved with a little bit of ingenuity and creativity in modeling, free of charge in our office chairs.

Thanks for reading this article! I hope you learned something and make sure to keep an eye out for future articles written — Thanks!