Genetic Algorithms for Fun and Profit
Genetic algorithms… the name feels so familiar and is still intimidating. In truth, genetic algorithms (GA — singular, GAs — plural) are super powerful and not terribly complicated. That mysterious “genetic” qualifier is going to wind up being not so scary after all.
This is a technical topic but our goal will be to keep things accessible to folks who may not have a background in computer science. In this post we’re going to:
- Break down GAs — what they are, how they work, and when to use them.
- Showcase a non-trivial use of GAs.
Let’s get to it.
Breaking Down Genetic Algorithms
If ever you’ve ever given the phrase “survival of the fittest” any thought, you’re well on your way to understanding how GAs work. As mentioned above, GAs are all about finding solutions to optimization problems, problems where you have a whole bunch of possible answers and you’re looking for the best one — i.e. the fittest one. GAs take heavy inspiration from biology and the process of natural selection, as we’ll see from our terminology.
Suppose we have an optimization problem we would like to solve, at any point in time we’re going to be considering a bunch of potential solutions to that problem. We’ll refer to the current batch of solutions as the population and each individual solution as an… individual. Peeking ahead, our goal will be to cull that population down to the most fit individuals, breed them to refill the population, then rinse and repeat. With each new generation, we hope to raise the overall fitness of the population until, ultimately, a satisfactory solution emerges.
Let’s take a step back and talk about what an individual solution might look like. There are no hard and fast rules but a simple, and often convenient, strategy is to use sequences of 0s and 1s — i.e. bit strings. Often, these bit string representations are referred to as chromosomes 😁.
At iVantage, we do a lot of benchmarking within our performance management applications — i.e. how am I performing relative to a group of my peers? The peer group is a fundamental component of any benchmarking application. Our team of analysts and experts work with clients to craft peer groups that will provide a robust, apples-to-apples, basis for comparison. Finding the best peer group is an optimization problem.
If I were going to tell you which organizations were in your peer group, I would give you a list of names: Your peer group consists of Hospital-0, Hospital-3, Hospital-7, … If I wanted to represent that same group in the context of a GA, I’d use a sequence of 0s and 1s: 10010001...
. The GA sequence will have one digit for every org in our database. It’ll show a 1
in sequence positions corresponding to hospitals in the peer group, and a 0
for those not in the group. As long as we agree on which hospitals match up with which sequence positions we can translate between lists for humans and GA sequences pretty easily.
One of these such sequences is an individual, a whole bunch of them together make up a population. Often you’ll have a target population size in mind and a common way to fill out your population initially is to grab as many random sequences as you need. The individuals in our initial population may not represent very good solutions, but they’ll get us started. Once we have our population filled out, running the GA is a matter of selecting individuals to carry traits forward based on their fitness, breeding them using one or more genetic operators to spawn the next generation, and then repeating that process until we’re done.
We glossed over some sticky pointers there —how do we measure fitness? How exactly do we go about selecting individuals? What are these genetic operators? And how do we know when we’re done?
Measuring fitness is the art that lies at the center of a GA. How we measure fitness is going to depend on the problem at hand. At a minimum, we need to be able to look at any two possible solutions and say which is better — that lets us order them from best to worst. Ideally, we’d build ourselves a function that provides a numeric fitness score for individuals. Concrete scores are handy, they let us track the overall fitness of a population and can help us figure out when to stop the algorithm. As the GA runs we can ask questions like is the median fitness still improving? Measuring fitness is the most difficult part of writing a GA. On the flip side, it can also reward deep understanding of the problem domain. If measuring fitness still feels a little hazy, the example in the second half of this post will hopefully clear things up with a concrete example.
Selection can be as simple as lining up your individuals from best to worst and throwing away all but some number of the top ones. In practice, we will often see greater results using a selection process that allows for more diversity of fitness. I like to assign each individual a selection probability relative to its fitness rank, where fitness rank is an individual’s position in a best-to-worst list. Something like this:
Fitness Prob. ~ (Size of Pop. - Fitness Rank)/Size of Pop.
So, for a population of 10, the most fit individual (rank = 1) would have a probability of (10 - 1)/10=0.9 of being selected, the 5th most fit would be selected with probability (10 - 5)/10=0.5, and so on. The specifics of the selection process is another knob you get to play with as the GA author.
After selecting our champions, the next step is to apply genetic operators to them and rebuild the population. Genetic operator is a fancy term for a function that takes in one or more individuals and then spits out one or more individuals. Apparently, whoever came up with these terms felt like it was time to draw a line in the sand on the biology metaphor and refused to call this “reproduction” — but that’s what is going on. A mutation is a genetic operator, for this operator we randomly change some of the digits in an individual: 0000 --> 0010
. Another example is a crossover, where child solutions get pieces of their sequences from multiple parents. In a single-point crossover, each child will get half(-ish) of their sequence from one parent and half from the other: 00000 + 11111 --> 00111, 11000
. We’re just splitting each parent sequence at a single-point and then mix-and-matching those partial sequences to make new ones. Tools like mutation and single-point crossover are what we’ll use to bring us from generation to generation.
When are we done? If we’re using a numeric scoring function, and we have a target fitness score in mind, we can let the algorithm run until we reach that score threshold. Otherwise we can set a limit on the number of generations we want to work through, running time, resource consumption, etc.
A Non-Trivial Application
Solving toy problems is great but I also wanted to make sure we got to look at a non-trivial example from the honest-to-goodness wild.
In addition to benchmarking, iVantage provides market analytics to help organizations with strategic planning. This involves isolating the geographic area a care provider serves and generating reports using that area. Usually, service areas are defined as a collection of ZIP Codes or FIPS county codes. The map below shows a Nashville, TN based hospital in the midst of its ZIP Code service area.
Figuring out which ZIP Codes should be in your service area is a question we’ll turn to genetic algorithms to help us solve.
First, let’s get more specific about what we’re looking for. We want a service area that’s a collection of ZIP Codes, but we don’t want just any old list of ZIP Codes. For this example, we’ll say we want the smallest contiguous collection of ZIP Codes that represent at least 75% of our hospital’s patient volume. Patient volume refers to the patients that come to our hospital for care, each of those patients will reside in one ZIP Code — we want to make sure at least 75% of our patients reside amidst the ZIP Codes in our service area. We also want our service area to be contiguous and we want it to use the fewest number of ZIP Codes possible.
The folks following along at home will recognize that we have two primary tasks to complete before we can let a GA loose on this problem. Specifically, we need:
- A way to encode solutions as sequences of 1s and 0s.
- A way to measure the fitness of a solution.
Just like our peer group example above, 1s and 0s do a great job of flagging something as being in or out. Figuring out which ZIP Codes to even consider as being part of the final service area is an interesting question that’s out of scope for this post. Let’s just assume a friendly co-worker (me) went out of their way to whittle down your list of candidate codes from the 42,000 in the US to just the 341 we’re most likely to need. Cool — so each potential answer will be a sequence of 341 1s and 0s, with each digit corresponding to a particular code in our working subset.
Side note: How you order your sequences can matter depending on the way your genetic operators work. In the case of single-point crossover, positions next to each other are less likely to be separated by that operator. End aside.
Now that we have a way to encode solutions as sequences, how do we go about measuring fitness and comparing solutions? I always prefer to define a scoring function where possible. Getting quantitative feedback from the algorithm as we churn through generations is so valuable for all the reason mentioned previously. Here’s how we might think through framing our fitness function:
- The first thing we want is to make sure we get to the 75% volume mark. The function should reward accumulation of volume, but that reward can taper off after we hit 75%.
- Once we hit 75% volume, we should focus on solidifying the service area into a contiguous block.
- If the service area is contiguous and it represents 75% volume, we can start dealing out rewards for using fewer and fewer ZIP Codes.
If pseudo code (or fake Python) is more your thing, let’s try this:
As long as the scoring function is creating the right incentives we should see better and better solutions as the algorithm progresses.
This particular problem also provides plenty of opportunities to tweak steps in the process for better results or faster conversion. Some ideas:
- Instead of starting with completely random individuals in our initial population we could force codes with high volumes to show up more frequently.
- When mutating, instead of randomly flipping bits we could favor turning on bits that neighbor other codes in the service area (i.e. “grow” a contiguous block). Or we could favor turning off bits when they represent isolated codes.
- We could have our fitness function adapt to information gathered as we observe generations rise and fall. For example, as soon as we see any contiguous service area that achieves 75% volume, we no longer need to consider service areas larger than that one. To extend the biology metaphor, this would be akin to environment changes a species might need to adapt to.
- And lots more! Part of what makes GAs so beautiful is how malleable they are and how many opportunities they provide the author to leverage their domain expertise.
Wrapping Up
Hopefully you enjoyed this post and you’re leaving with a little more intuition around GAs than you had two thousand words ago.
The last thing I want to touch on is when it might make sense to reach for GAs as a strategy. When it comes to solving optimization problems, GAs are rarely bad. In fact, it is often said that the second-best way to solve any optimization problem is a GA. Of course, if there’s an efficient and direct way to find your best solution… you should do that 😄. Also, keep in mind that GAs don’t guarantee the absolute best solution.
GAs tend to really shine in scenarios where it’s easier to tell how good a potential solution is than it is to find the best solution directly. Our service area example is one such case. Without making additional assumptions, solving the problem directly would take exponential time and require some non-trivial logic to implement. Brute force isn’t really an option, the search space includes 2³⁴¹ possible solutions — that’s one for every atom in the observable universe with plenty to spare. On my modest laptop, using roughly the fitness function given above (read: fewer than 20 lines of code), we were able to find a production ready service area in the amount of time it takes to chug a medium iced latte through a crazy straw.
As a companion to this post, we’ve provided a minimal GA implementation complete with runnable demos here. The code is written in idiomatic Python using only built in modules. It’s meant to be read (not run in production!) and follow the process outlined here. Of course, contributions are more than welcome.