PyData Amsterdam 2019 was hosted last week by Booking.com in Amsterdam. The two days conference (plus one with workshops) allowed experts and users of data analytics tools to share their research, approaches, and mistakes.
During the conference, one challenge was posted by Vincent D. Warmerdam (GoDataDriven) with a Lego Set as a prize. In this post, I present my approach to the problem which arrived fifth in the final classific (No Lego for me 😢) but it reached that point with fewer requests compared to the other methods.
In a few words, the challenge was to find the correct order of all the alphabet letters present in the cipher. An API was available to return how good was the order provided. The easiest option would have been trying a brute force approach: generate all the possible permutation of letters and send them to the API till the good one was found. This requires way too much time, given the fact a max number of request per second was implemented in the service. I did not have an entire life to win the lego set, therefore my idea was “Evolutionary Algorithm will solve it!”
Evolutionary Algorithms (EA)are inspired by Darwin’s theory of evolution. The solution for a given problem can be reached through evolution.
The algorithm starts with a set of solutions (each solution is also called individual or genome) called population. Normally this population is randomly generated (Initialization). Some individuals from the current population are used to form the next population (Parent selection). Ideally, the new population will be better than the old one. Individuals are selected to form new solutions (Offspring) according to their fitness value, better they perform in the environment, more chances they have to reproduce. This step includes two sub-steps: Recombination and Mutation. In the recombination process, the genome of the “parents” is used to create children, meaning the children will have a piece of the genome from each parent. In the mutation step, we introduce a small change in the genome of the children probabilistically, in order to make them slightly different from the parents. Then Survivor Selection is applied, in order to keep the population size the same for the entire evolution
This process is repeated until some end condition (i.e. max number of generation reached, precision on the solution reached) is satisfied.
Definition of the problem
The first step in defining an EA is to understand what kind of representation of the data to use. In this case, it was easy, given the constraint of using the letters of the alphabet. The Permutation Representations is what fits this requirement, allowing all the genes (alleles) in the genome to occur only one time. This was tested with the API where having the same letter repeated more times, special accents, numbers, and symbols returned an error message.
Permutation Representation only allows special mutation operator, given the fact the gene must occur only once in the genome. There are four versions that can be applied in this case:
- Swap mutation: Pick two genes at random and swap their positions.
- Insert mutation: Pick two allele values at random and move the second to follow the first, shifting the rest along to accommodate.
- Scramble mutation: Pick a subset of genes at random and randomly rearrange the alleles in those positions.
- Inversion mutation: Pick two alleles at random and then invert the substring between them.
Many specialized operators have been devised which focus on combining order or adjacency information from the two parents (more information on ):
- Order 1 crossover
- Partially Mapped Crossover (PMX)
- Cycle crossover
- Edge Recombination
Another important step in defining an EA is the definition of a fitness function. In this case, the fitness was defined by the challenge and was a scored based on how the cipher submitted was close to the real one plus some noise.
Tackling the problem
The challenge also suggested some open source library to use to solve this problem, that basically facilitates the implementation of an EA. Unfortunately, I did not check the link during the conference (I was focused on following the talks) otherwise I would have definitely tried that. My approach was easy, few lines of code implementing the main step of the EA, the swap mutation and no recombination. I decided to not implement recombination due to not having code ready for the Premutation Recombination and the fact I wanted to follow the talks of the day (this approach was implemented the second day of the conference, also the last day of the challenge).
The main loop is super easy. All the component of an EA are listed here: initialization of the population, evaluation of the individual, offspring generation, evaluation entire population, and survivor selection. Two different end conditions were implemented: either the fitness was optimal (global optimum reached => perfect order found) or the algorithm reached the 100M iteration. A population size of 20 was chosen arbitrally.
The class implementing the individual is very simple. It has two attributes, the fitness, and the genome. It also implements a really bad implementation of the swap mutation (do not judge please): generates two different random indexes and swaps the letters in those positions. Then it returns the new genome.
Finally the core of the algorithm. The EA implemented is a (20 + 100), meaning from 20 parents, 100 individuals are generated. Then the fitness of all these 120 individuals is evaluated and only the top 20 individuals (highest fitness) stay alive for the next generation. The matching between the genome and the alphabet letters is kept in a dictionary, for a fast translation between the two. The evaluate fitness method translates the genome into the cipher and sends it to the API. If the API returns an error, the method simply sleeps and tries again after 1 second. The Survival Selection basically orders the individual based on their fitness and erases the one not needed in order to keep the dimension of the population fixed to 20.
Does this work?
Yes, it worked, even though it did not reach the first place and the dreamt LEGO set.
The model just described scored under the name “luke_skywalker” and reached only the 5 positions. I would say this is a nice result, given it was coded during a presentation and launched only the last day of the competition. An even more remarkable result if we consider it reached that point with only 10000 requests to the API, compared to all the other methods which required way more request. The first methods run for the entire conference and got pretty close to 1 with more than 1.5Million of iteration, that is crazy!
Take Away: EA can be used to solve a lot of problems, especially if the solution space is huge. Just take more time than me in thinking which operators to use, that will help a lot in finding the optimal solution faster (remember that EA algorithms are not optimal algorithms, therefore they are not always good if you want the optimal solution)
Evolution Picture from “https://www.inverse.com/topic/evolution”
Too Damn High Picture from “https://tenor.com/search/too-damn-high-gifs”