Adapting Genetic Algorithm Tools to Solve Restrictive Problems

Joseph George Lewis
Geek Culture
Published in
10 min readSep 15, 2022

--

Photo by Nolan Issac on Unsplash

Genetic algorithms are an amazing intersection of biology and computer science. I’ve wanted to write about it for a long time and finally, that time is here! So please enjoy the travelling coffee drinker — A coded walk-through of the solution I used to find the quickest route to every Starbucks in London!

The full code for this project is available in a notebook on my GitHub page. Link at the end of the article.

Introduction

Problem Statement

The travelling salesman problem (TSP) has been explored in mathematics for a long time. In its simplest form (the version solved in this project) the TSP was most probably first stated in the 18th Century and the full definition is given below:

“What is the shortest route a traveling salesman can take to visit n cities and return back to his home city, only going through each city once?” — Biron, 2006

So in a more modern adaptation of the problem:

“What is the shortest route a coffee fanatic can take to visit 165 Starbucks cafés and return back to his home café, only sampling each café once?” — Lewis, 2022

Exploration

The data used comes from Kaggle with a link at the end of this article, and luckily is all licensed as Open Source! The data shows that there are 975 total Starbucks Cafés within the GB country code. With most of those being in London (165):

Source: Author. Starbucks Cafés across the UK

Now how can we reach each one of these cafés whilst travelling the least possible distance? At this scale, it would take a long time to do the manual calculation for each possible solution (165 factorial possible solutions!). So instead we can optimise and use a Genetic Algorithm.

The Genetic Algorithm is brilliantly explained by user Vijini Mallawaarachchi in a link at the end of the article. However, what we are seeking to do is apply the processes of natural selection and evolution to a set of possible solutions, more on this in the methodology.

However, a generic genetic algorithm is not tailored to meet the constraints of a TSP, so is not always an appropriate solution. This article shows that genetic algorithms can be tailored solutions to restrictive problems like the TSP. But first, it is important to look at why the TSP is so restrictive and where a genetic algorithm could go wrong.

The Constraints

A solution to the TSP has to have:

  • The traveller begin and end their journey at the same café
  • The traveller not revisiting any intermediate café so each one in the solution must be unique
  • The traveller visiting all 165 + 1 (for returning) = 166 cafés

An out-of-the-box implementation of a genetic algorithm will truly mimic biological processes and is not fit for use on restrictive problems. This means that mutations are random (so could cause duplication breaking point 2 above), crossovers between solutions are also random (so could break point 1) and deletions are also possible mutations where a gene could be removed entirely (breaking point 3). So how can they be altered and developed for the TSP?

Methodology

In building the genetic algorithm existing packages like NumPy and PyGAD can be leveraged. PyGAD is a package dedicated to Genetic Algorithms written in Python. It is really useful and has many use cases most heavily weighted toward training neural networks. This means that it has not been fully adapted to solving the TSP. However, it does have very intuitive customisation options and parameters as you build a genetic algorithm instance. Which are explored below to make it more bespoke for the restrictive TSP.

Building the Genetic Algorithm

A genetic algorithm is made up of a set number of parts:

  1. An initial population
  2. A fitness function
  3. A crossover operation
  4. A mutation operation

These constituents in the case of this solution to fix the genetic algorithm for the restrictive TSP will be:

  1. A pool of solutions programmatically generated with a random order of stores starting and ending at a given store
  2. A function to determine the distance between each store in a solution
  3. A function to imitate breeding in a natural process, where parts of two very fit solutions are passed on to the children — using the partial match crossover operator which upholds restrictions
  4. A function to randomly change an element each solution to imitate mutations in natural processes — an inversion which again upholds restrictions

There is also parent selection however this is taken care of by PyGAD. Now it’s important to make sure our biology is up to scratch here too and we can start to replace some of the lengthier sentences above with shorter more precise ones:

gene — a store to visit in a given solution

chromosome — a set of stores strung together to make a solution

population — the pool of chromosomes

Using PyGAD’s documentation as a source the methodology when applying the package is given in the helpful flow chart below:

Source: PyGAD Docs (link in References). Flow Chart of PyGAD implementation

Implementation

Now for the code. There is a lot of code for this project so it won’t all be included here. Especially the boilerplate genetic algorithm code and operations. The most important takeaways are the application of a custom population builder, fitness function, crossover operator and mutation operator. As well as solution verification.

First things first the population needs to be generated. The simplest method for getting random values is making use of random in Python. So from the data, a list of stores is generated. A hash (dictionary) containing each store and some lon/lat information for where the store is, and some random selections courtesy of the random package in Python can be used to build the chromosomes:

Notably, the first and final genes are set to 0 indicating the traveller's return.

Now for the fitness function. To determine which chromosomes will make it to the next generation and breed, a fitness value is needed. The fittest ones will make it through. The fitness of a solution in this context is how long or short the distance that the traveller has to go before reaching the end of the line. So a solution that bounces the coffee drinker from North to South London and back again won’t be very fit. The work here makes use of GeoPy and measures distance as the crow flies:

This is where things get really interesting. Out-of-the-box genetic algorithms in PyGad come with several crossover operators. However, none of these is suitable for the TSP! All of the crossover operations would result in invalid solutions. So there are two options either carry on as normal and generate incorrect values but apply a mutation that rematches the chromosome to validate the solution. Or develop a custom crossover method that does not result in invalid solutions. This project uses option two and applies the Partially Matched Crossover (PMX) method.

The reason behind crossover is to combine very good solutions in an effort to make two even better child solutions. PMX does this by:

  1. Define a run of genes in parent 1
  2. Pass these genes on to the child
  3. Find the first gene in parent 2 that is not already in the child
  4. If this gene the position of this gene is already filled in the child then match to parent 1 to see what gene it is
  5. Iterate until a position is found for every gene that the child is missing from parent 2
  6. Repeat the process with parent 2 as parent 1

The diagram below comes from a paper based on the PMX crossover, in biology:

Image: Sun, Zhao, Zhu (2021) link in references. PMX Crossover illustrated

So the process is defining a “run” of consecutive genes to take into a child from parent one in the exact same position. Then fill in the rest of the child's chromosome with the remaining genes in the safe positions from parent two. To create the second child the parents just swap and the “run” of consecutive genes is taken from parent two and parent one is used to fill in the gaps. The code implementation is split into two functions below:

The final impactful function to design is the mutation function. For the TSP again there are no suitable out-of-the-box mutation functions. So the options are to build one or to build a custom mutation that re-validates a solution. The first option is more favourable because it is more in line with the concepts of genetic algorithms and involves less meddling. The mutation used is an inversion, in which a sequence of genes of random length is defined and then reversed so […, 1, 2, 4, …] would become […, 4, 2, 1, …]. The code implementation is more simple than the crossover:

Now the genetic algorithm ga_instance can be initialised and run below:

The key things to note are the implementation of the custom functions. As well as a few other parameters namely:

  • num_generations: the number of iterations to run the initialisation, selection, breed and mutation process
  • sol_per_pop: the number of totals solutions per generation
  • num_parents: the number of parents that will breed based on fitness rankings (this value is kept fairly high here to ensure diversity in the solution pool)
  • keep_parents: keep the two fittest parents at each generation (elitism) so that they can still be considered fit solutions until they are outranked and selected against later

Results

Analysing the results of the genetic algorithm is where using the pre-built PyGAD package is very useful. The results are super accessible with just a few simple function calls:

To show the results in a more simple format initially a version of the algorithm is trained on just 10 cafés. And the best solution found is given below:

Toy Solution with 10 Stores

Now at first this doesn’t look like anything too clever, but zooming into the middle section it’s clear how well the solution has been optimised. As there can be no repeated sections the algorithm has found paths on the separate sides of train tracks to minimise the distance even at this low level:

Example of optimisation

Now for the kicker. All 165 stores. The map for this one is a lot more cluttered:

There are some further optimisations that seem possible here for example on the right of the plot some shorter routes could be possible. However, given the scale of the problem, this seems like a very valid and well-optimised solution. In numeric terms, the total distance covered after 25 generations of optimisation is 590 miles (or 990km!). Not too bad for 165 cups of coffee!

Discussion

Advantages of the Genetic Algorithm Approach with PyGAD

  1. No need to build the boilerplate genetic algorithm code to find solutions, select parents and loop through generations while assessing fitness.
  2. The full simulation would require a user to test every possible solution, but the Genetic Algorithm application means only the best options are considered at each generation.
  3. Anyone inheriting the analysis can make use of all of PyGAD’s existing documentation!

Limitations and Developments

  1. There are other crossover and mutation operators that have not been tested and could be more appropriate for finding an optimal solution.
  2. The distance measured in GeoPy is as the crow flies, an API could be used to get actual travel times like Google Maps.
  3. The genetic algorithm still needs to consider many options so can be slower as the generations scale, PyGAD has a PyTorch extension that I hope to investigate and apply in future.
  4. To expand on this work directly more testing with higher numbers of generations, multiple mutation types and more diversity enhancement techniques would also be beneficial.

--

--

Joseph George Lewis
Geek Culture

Data Scientist focused on reproducibility, package design, data vis, data ethics and natural language processing. (AP Data Science BSc Hons, First Class)