Genetic algorithms, please teach my computer how to spell

Simon Copsey
The Curious Coffee Club
7 min readAug 13, 2018

We’re going to teach your computer how to spell. We’re going to use Genetic Algorithms. Spelling is a useful skill for a computer, and Genetic Algorithms is a fun form of AI.

And I promise it’s safe.

Well, mostly.

Er — yeah, best be careful when toying with genetics. Image credit: @PuraVida_Fotografie

Before we get into the detail, quick recap: we debunked Artificial Intelligence together as a set of technologies generally underpinned by Machine Learning:

Machine learning is different from usual programming. Usually, programmers write a piece of software similar to how we write a mathematical equation: they tell the computer what operations to perform on an input (e.g.: circle_radius) to create the desired output (e.g.: circle_area).

circle_area = π * circle_radius²

Machine Learning turns that on its head. With Machine Learning, programmers instead show the computer both the input and the desired output, and let the computer itself derive the operations that need to be performed to get from the input to the output.

Though the lines are blurry, we can say Genetic Algorithms are a form of Machine Learning. They are generally used to find the best solution to a single problem (such as NASA designing a single aerial with the best radiation pattern), versus creating a behaviour that can be used to solve many problems of the same type (such as being able to repeatedly identify photos containing a cat). Don’t worry, this will become clearer as we move into examples.

Genetic Algorithms are particularly fun because they’re a simple simulation of evolution. The computer simulates generations of solutions to a problem, taking the best performing solution from one generation to pollinate the next generation of solutions. This means solutions should slowly improve in quality generation-on-generation.

Generally, the algorithm boils down to three steps:

  1. Variation: Create a random set of potential solutions to the problem.
  2. Selection: Identify which of the potential solutions are the best.
  3. Heredity: Discard all but the best solutions. Splice and mutate those together to create an entirely new set of potential solutions. Go to step 2.

And that’s it. If we repeatedly perform steps 2 and 3, the potential solutions should naturally get better and better as we move from one generation to the next.

The Knapsack Problem

As always, an example makes this clearer.

Let’s solve the Knapsack Problem. It’s Friday night and we want to go camping for the weekend. However, we’re indecisive about what to pack. We’ve settled on stuffing as much as we possibly can into our knapsack which can only hold 45 litres. Unfortunately, all the things we want to pack have odd weights (2L, 5L, 1.5L, 3L, 7L, etc).

The question becomes: what is the combination of objects we should pack that would allow us to fill the 45L knapsack as much as possible?

OMG — what should I pack? THE CAT, ALWAYS THE CAT!

Given we’re so lazy, we decide to program a Genetic Algorithm to solve this for us. It follows those same 3 steps:

  1. Variation: Create random combinations of items we could pack
  2. Selection: Identify which combinations contain items that come closest to totalling the 45-litre capacity
  3. Heredity: Take the identified combinations and splice/mutate them into new combinations. Go to step 2.
Left-to-right: (1) Created random varied population, (2) select top-performers, (3) splice top-performers into new solutions (heredity) and repeat selection

So, how do we make my computer spell?

So now you’re awesome with Genetic Algorithms, let’s help your computer learn to spell!

Our aim with this example is to encourage your computer to type a sentence that contains as many real words as possible, with little or no gobbledygook. Let us think about those three steps again.

Step 1: Variation

We’ll randomly create a ‘sentence’ by randomly stringing together lots of letters and spaces — a little like threading beads in a random order onto a piece of string.

We’ll do this lots of times until we have lots of sentences (pieces of string) containing lots of letters and spaces (beads) in a random order.

It’s just like stringing together random letter beads to form sentences. Image credit: @linharex

Step 2: Selection

At this point we’ll look through all of those lovely sentences we created to identify how ‘good’ each sentence is. We’ll then take the best sentences and use those in the next step.

But, wait — how do we measure how ‘good’ a sentence is?

Well — we’ll look at every word in the sentence — where a word is a combination of letters surrounded by spaces. If the word is something that actually exists in the English dictionary, the sentence receives points. However, if the word isn’t in the dictionary and is just a meaningless jumble of letters, the sentence does not receive points. We will then total together the points from every word in a sentence — a little like Scrabble — and the sentences with the most points win!

The points are often called the ‘fitness’ of a solution and are determined by a ‘fitness function’. It’s just fancy talk, but I like it.

Step 3: Heredity

We’ll take the 5 sentences with the highest fitness and use them as inspiration for creating a whole new generation of sentences. We’ll do this by splicing — creating an entirely new sentence by combining pieces of two different sentences.

We’ll also sprinkle in some new randomly-generated sentences (generated in the same way as during Step 1) to ensure we maintain some biodiversity in the next generation.

We’ll take this new generation of potential solutions into Step 2. We’ll rinse and repeat Steps 2 and 3 many times, allowing optimal solutions to evolve and surface.

And, what does that look like in code?

First, the fun bit: let’s look at the output from running our Genetic Algorithm. The computer is simulating generation after generation of solution, pollinating the next generation using the best solutions from the prior.

OMG! THERE ARE SOME REAL WORDS THERE! Mum tho! Hot rice! Toy tub!

For brevity, you can see we are outputting the results only every 200 generations. You’ll see the generation number (e.g.: Gen 00200) followed by the details of the best solution in that generation — its score in brackets and then the sentence itself.

As you can see, we start to see more dictionary words (and less gobbledygook) as we move from one generation to the next.

Levers for change

The effectiveness of the solutions generated by the algorithm can be further improved by tweaking several aspects:

  • The dictionary against which to validate words
  • The fitness function (e.g.: to favour longer words over shorter)
  • How many solutions from one generation to use to pollinate the next
  • The number of solutions we generate as part of each generation

Coding it up

The [Python] code is available below for you to instantly play with and run (also here). I find the best way to understand the code is to press ‘run’ and see what happens. Then, tweak a few things and run again — particularly some of the numbers defined at the top (e.g.: GENERATION_SIZE, NUMBER_OF_SENTENCES_TO_SEED_NEXT_GENE).

What I love about Python is that it’s fairly human readable. The code is also written in a way that should be fairly intuitive given the above explanation.

The second thing to do is scroll to the very bottom of the file. You should see this:

# Load in the dictionary
dictionary = set(open("dictionary.txt").read().split())
# Step 1 - VARIATION: Create the first random set of solutions!
generation = generate_random_generation(GENERATION_SIZE)
# Let's start evolution!
generation_number = 0
while True:
# Step 2 - SELECTION: identify the best solutions
fittest = get_fittest(generation, NUMBER_OF_SENTENCES_TO_SEED_NEXT_GENERATION)
# Output the solution to the screen every 300 generations
if generation_number % 300 == 0:
print("Gen %05.f (%07.1f): %s" % (generation_number, get_fitness_score(fittest[0]), fittest[0]))
# Step 3 - generate the next generation, seeded by the fittest from the current generation
generation = create_next_generation(fittest)
generation_number += 1

Look — the three magical steps! Variation, selection, heredity! Each of these steps uses some of the functions further above to help power itself, and is a useful way of orientating how you read the code.

LET’S MAKE SOME WORDZ!

Though this article offers some depth on Genetic Algorithms, I realise it didn’t dive deep into explaining the code. I’ll certainly add this additional detail if it proves of interest to you and your fellow readers.

Also, the code please do not interpret the code as ideal by any means — it lacks unit tests and object orientation that could have made it even easier to modify— but the aim was to keep it simple.

Anyway. Hope that’s enough to get you interested in the wonderful world of Genetic Algorithms!

--

--