Combinatorial Optimization Using Genetic Algorithms on a Baseball Related Dataset

A powerful metaheuristic modeled off of Natural Selection.

Robby Sneiderman
The Startup
11 min readOct 27, 2020

--

Figure 1: Image Citation: http://www.gilmannews.com/currentyear/2017/6/9/baseballstatistics

In this article, we will demonstrate how combinatorial optimization (in particular, Genetic Algorithms) can be used for model selection.

One popular and rising use of Data science techniques is in the field of Sports Analytics. Baseball related statistics in particular are so popular, that they have their own name, Sabermetrics. We will use the popular sport as a chance to learn about metaheuristics and combinatorial optimization. In particular, we will analyze a baseball salary dataset in R.

Before we dive into the dataset, we need to introduce the general problem we are facing.

Introduction to Combinatorial Optimization:

In combinatorial problems, we have a large set of predictors, w

Figure 2: W is the space of all possible combinations of predictors. Image from Author.

Usually we want to find the set of values (M) that maximize some objective function f, or equivalently the minimum if we consider the negative.

Figure 3: We define M to be the set of values that maximize f. This could be one value, or it could be obtained by multiple w. Image from Author.

Suppose we are looking to maximize a function, f(w) where w=(w1,…,wp) is a p dimensional feature vector. Each set of w is a candidate solution of f, and as p grows, the number of possible candidates becomes astronomical. Even with sophisticated computers and smart code, it may not be feasible to try all possible combinations.

In this tutorial we will analyze a baseball related dataset and showcase how a Genetic Algorithm can assist in model selection.

The baseball dataset consists of 337 observations of baseball player (no pitchers), with the outcome variable being salary (in $1000s). For each observation there are 27 unique predictors (batting average, homeruns, triples, etc) which can help to predict the salary.

Figure 4: Please check out the textbook that covers these algorithms in more detail. https://www.stat.colostate.edu/computationalstatistics/

Much of the code to analyze this data is provided via Computational Statistics 2nd edition by Givens and Hoeting. So, please check this book out and their website, as it provides an excellent resource for computational statistics and optimization.

Now lets get into it, we will build a genetic algorithm in R from scratch.

Baseball Code:

Let us first load the data and examine the basic structure.

Figure 5: The first 15 rows and 10 columns of the original baseball dataset. Image from Author.

Let us first do some preprocessing to have our data ready to fit models to. Because the salaries appear skewed, we will work with a log transformed salary. We also form a dataframe of just predictors, as we don’t want the outcome to be included as a possible predictor.

It is very unlikely that any one set of predictors would provide a perfect prediction of salary. In the real world, it is often more likely that several candidate models could be proposed. How do evaluate how ‘good’ a linear model is? There are several possible definitions, but we will use AIC to tell us how well our linear model fits the data. Recall that AIC is used to compare nested models, and that lower AIC is better.

This is where Metaheuristics really shine. They can provide a list of candidate solutions, instead of simply outputting a single model. Many metaheuristics are based off of real world processes (such as Darwinian evolution, bee hive dynamics, etc).

We will work through the code line by line, and explain what is going on in each part.

Genetic Algorithms:

Genetic algorithms are a powerful form of search algorithms that are based off of the process of evolution via natural selection. It incorporates many of the themes to build a useful selection algorithm. For example, in genetic algorithms, chromosomes are represented by candidate solutions. Alleles are represented by the value at any specific point on the candidate solution.

Each possible candidate solution is represented by a chromosome. A chromosome in turns consists of alleles. We consider the most basic example, in which the alleles are a set of binary symbols, 0 or 1, with a 1 in position i meaning the ith predictor is included in the linear model and 0 meaning it is not. Each predictor is thus analogous to an ‘allele’.

Figure 6: The general process taken by a genetic algorithm. Image Citation: https://blog.floydhub.com/introduction-to-genetic-algorithms/

Figure 4 illustrates the general approach taken by Genetic algorithms. We will follow this pathway. We initialize our parents randomly, our fitness function is negative AIC, we select using a combination of random and rank (see below), and we reproduce using crossover and mutation. Our termination condition will simply be after 100 generations (99 generations of reproduction).

In Natural selection, each individual has two parents, and the child is some combination of each parent, and with some differences introduced from mutations. We will see how this is coded in our algorithm as well.

We will consider P=20 candidate solutions per generation, and will run our algorithm for 100 generations.

Fitness Function:

In order to compare models, we need some sort of measure of how ‘good’ a model is. We define this to be the smaller AIC, so a model with a smaller AIC is more fit than a model with a larger AIC value.

For example, consider a model (or a candidate solution / an ‘individual’) that included each of the M=27 predictors to make a linear model.

Then this candidate would be represented by the chromosome:

Chromosome_A=11111111111111111111111111

On the other end of the spectrum would be the empty model, where we don’t consider any predictors (i.e just an intercept).

Chromosome_B=000000000000000000000000000

Our goal is to find a combination of predictors that gives a minimum, or as near a minimum AIC as possible.

Note that simply trying every combination is extremely time consuming, indeed there are 2²⁷ possible combinations, which is well over 100 million possible ‘chromosomes’.

In each generation we will have P=20 candidate solutions, each with an AIC value. To prevent overfitting or getting stuck in local extrema, we also introduce the phi function. This provides a ranking of fitness among each generation, and we use this to assist in selecting which parents should breed with each other.

Again, phi is used to rank our individuals, instead of simply considering the fitness as AIC (a numeric value) we instead consider the rank compared to the other individuals. This ensures that we don’t get stuck in local optima. For example if we had four models with AIC’s of [-1000,-231,-400,-200], we would also store the phi values, calculated as:

Figure 7: The definition of the phi function used for ranking individuals among each generation. Image from Author.
Figure 8: We use the ranks to calculate the phi values, the phi values will be used to give a probability of selecting a parent in each generation. For example, we could define our new parents to be one chosen at random, and the other chosen with probability proportional to phi. In this example, P=4. Image from Author.

If we were to simply use numeric comparisons, model 1 with an AIC of -1000 would dominant future generations. Notice with the above formula, the best model will be selected with probability 2/(P+1), while the worst candidate will be selected with probability 2/P(P+1). The above method prevents one chromosome from dominating and pulling us far away from other forms of chromosomes.

Each generation will have P=20 candidate solutions, each with its own AIC value. How will we decide how to create the next generation? This can be done in many ways, but will do so using crossover.

Choosing Parents:

We randomly select the first parent from the list of 20 individuals in the current generation. We select its mate with probability proportional to the phi value, so that we choose higher fitness individuals with higher probability . We form a child via random crossover and mutation. Notice, this can result in a individual mating with itself but crossover and mutation will still result in changes, or we can simply select parents with removal to avoid this.

Crossover:

Once we have selected two parents, how do we form a new organism? The answer is through genetic crossover. To decide where we preform the crossover, we can randomly generate a number between 1 and M, the number of predictors.

Figure 9: We use crossover with a random cutoff point. We paste two parents together to form the next individual. Image Citation: https://becominghuman.ai/understanding-genetic-algorithms-a-use-case-in-organizational-field-2087c30fb61e

Example: We randomly generate a crossover point between 1 and 27 ( say it happens to be 16), and combine the two parents using a cut and paste mechanism.

Chromosome_A=1111111111111111|1111111111

Chromosome_B=0000000000000000|00000000000

Chromosome_C=111111111111111100000000000

Mutations:

As in real natural selection, changes occur via small mutations that allow us to explore the solution space. We thus incorporate this into the algorithm as well.

Figure 10: Similar to in Natural Selection, Mutations allow for unseen combinations of candidate solutions and can lead to unexpected improvement. Image Citation: https://geneticliteracyproject.org/2018/04/02/understanding-genetic-mutations-why-do-some-cause-disease-while-others-dont/

We do this for each child by looping through each allele(each of the 27 alleles in the chromosome), and randomly mutate (switch from 0 to 1 or 1 to 0) with a probability of mu= 0.15.

so Chromosome_C from above could become Chromsome_C=011111111111111100000000001 after mutation (if only its first allele and last allele were mutated).

Initialise Data:

Now it is time for the actual code of the algorithm. We start by defining several matrices that will be used to store the information for each generation (the chromosomes, the AIC and the phi values). N is the number of observations, M is the total number of predictors, P is the number of individuals per generation, we run the algorithm for 100 generations, with a mutation rate of 0.15.

For each generation, we will need a matrix to store the AIC and phi values of each of the 20 chromosomes, this is the matrix chromsomeAIC and the matrix phi.

Each generation has 20 chromosomes, each with a total length of M, so we store this a matrix of size PxM (matrix is called chromosome). We will also need to update each generation, so we need another PxM matrix (chromsome.next).

We initialize a blank best_chromsome, which will store the predictors corresponding to the lowest AIC. We also initialize a starting bestAIC of 0, and a cache Matrix that will store each generations AIC (cacheAIC). Finally, we initialize a matrix that will store the best AIC per each generation (bestAIC_eachGen)

Generate Parent Generation:

We will form the first 20 chromosomes by ‘flipping a coin’ to decide if each predictor (allele) should be included or not. We then calculate the AIC of each parent, save them and calculate the ranks to get the phi values.

For each of the 27 predictors, we use rbinom to decide if the predictor should be included or not for each individual. We then take a subset of the baseball predictors that only include those we want, we form the linear model and extract the AIC. We also save the best AIC from the generation, and the predictors that formed it. Finally, we calculate the ranks to compare and calculate phi, and lastly we update the best AIC from that generation in our matrix for plotting later.

Let us take a look at how our parent generation looks like, this is our chromosome matrix:

Figure 11: Our chromosome matrix, for our initial 20 parents. 1 indicates that the predictor corresponding to that column is included in the linear model, and 0 indicates it is not. Image from Author, created using the View function in R.

We also want to save the AIC value for each parent, and obtain their ranks and associated phi values.

Figure 12: A column bind view of our parent generation chromosomeAIC, their associated ranks and their associated phi values. (cbind(chromosomeAIC,r,phi)). Image from Author.

Now that we have initialized our parent generation, we can move onto the selection algorithm. We select one parent at random, and we select the other parent with probability proportional to phi, this means we have a higher probability of selecting a fit parent.

We then randomly obtain a position between 1 and M and cut and paste our parents at that point to form the new child. We then allow for random mutation on each allele.

Making New Generations:

And that is it, the algorithm only takes a few seconds to run.

We can now visualize the AIC values from each 20 individuals per generation. Notice , even at our later generations, there are still some higher AIC individuals.

Let us take a look at the best AIC per each generation.

Figure 13: The best negative AIC per generation. We already reach our best AIC at generation 50, but continue generating to obtain multiple ‘good’ candidate solutions. Image from Author.

We can also visualize all 20 AIC values from each individuals per generation. Notice , even at our later generations, there are still some higher AIC individuals.

Figure 14: Over time, we obtain consistently more negative AIC per each individual chromosome. This image shows the AIC values of each of the 20 individuals for each generation. Image from Author.

Final Chromosome and predictors:

In the end we can check the predictors of our best model , best_chromosome.

Figure 15: Our final best chromosome and its associated best AIC value. Image from Author.

Finally, we should check out which predictors these correspond to. The predictors we include according to best chromosome are: obp, runs, triples, rbis, sos, freeagent, arbitration runsperso, hitsperso soserrors, sbsobp and sbsruns.

Possible Enhancements:

The above code is a very basic implementation of a genetic algorithm. Some additional enhancements could be;

  • Multiple Starts. We could generate multiple starting generations and run the algorithm multiple times.
  • Multiple Cross over points. In the above code , we only randomly generated a single crossover point to form new individuals. In reality, we could have multiple crossover points.
  • Non constant mutation rate (possibly proportional to diversity). In our above example, we stayed with a constant mutation rate of 0.1. However, it may be wise to allow the mutation rate to increase if we are seeing a decreased amount of diversity in our candidates.
  • Generation Gap. In our above example, P stayed constant in each generation. That is, every generation had 20 candidates. This is also known as a generation gap with G=1. If G=1/P on the other hand, we would only form 1 new individual each generation, and the remaining 19 would be brought over.
  • Expanded Alphabet: Our alphabet was restricted to 0 and 1. This could be easily increased to allow for all sorts of different combinations.

Conclusion:

Genetic algorithms are a powerful and interesting metaheuristic. They take inspiration from Darwinian evolution and apply it to build or fit models. There are many enhancements and ongoing research, and there are millions of possible use cases. Genetic algorithms also provide us with several candidate solutions, not just one single answer. We may not always find the exact best, but with Genetic Algorithms we can generate many answers that are close, and possibly the best overall as well.

Thank you for reading!

Link to R script and baseball.txt:

https://github.com/Robby955/geneticAlgorithm

Sources:

[1] Baseball data from M.R. Watnik (1998), “Pay for Play: Are
Baseball Salaries Based on Performance”, Journal of Statistics
Education, Volume 6, number 2
(http://www.amstat.org/publications/jse/v6n2/datasets.watnik.html)

[2] Computational Statistics, 2nd edition, Goef H Givens and Jennifer A Hoeting. https://www.stat.colostate.edu/computationalstatistics/

[3] Holland J.H. (1984) Genetic Algorithms and Adaptation. In: Selfridge O.G., Rissland E.L., Arbib M.A. (eds) Adaptive Control of Ill-Defined Systems. NATO Conference Series (II Systems Science), vol 16. Springer, Boston, MA. https://doi.org/10.1007/978-1-4684-8941-5_21

[4] M. Srinivas and L. M. Patnaik, “Genetic algorithms: a survey,” in Computer, vol. 27, no. 6, pp. 17–26, June 1994, doi: 10.1109/2.294849.

[5] Practical Genetic Algorithms, Randy L Haupt and Sue Ellen Haupt

[6] Data: http://jse.amstat.org/v6n2/datasets.watnik.html

[7] Christensen, R (1996), Analysis of Variance, Design and Regression: Applied Statistical Methods.

--

--