A Gentle Introduction to Genetic Programming
Introduction
GP is a systematic, domain-independent method for getting computers to automatically solve problems starting from a high-level statement of what needs to be done.
Given a task, GP starts with random programs and improves generation by generation using some genetic operations until it the program is fit for our use. It is like simulating biological breeding but for programs.
Genetic programming is commonly used to generate high-quality solutions to optimization and search problems by relying on biologically inspired genetic operators. Some of the most famous problems solved with genetic programming include symbolic regression, path finding, antenna design, game playing and so on.
Representation
Program — In GP, all possible solutions are called programs. Program is similar to a computer program. Sometimes it can be an expression like x*y + z, sometimes it can be an artifact of some type like a circuit. Every problem can be directly or indirectly solved by a program. If it cannot be done directly, we construct programs whose execution could construct the solution.
Programs are usually represented as syntax trees in GP. For example, the expression/program x + 2y will be represented like this.
Here x, y and 2 are called terminals whereas + and * are called functions.
Since the tree based representation can be memory consuming, the same tree can also be expressed in form of string by using prefix-notation.
Terminal Set
Terminal set consists of mainly 2 things:-
- Variables — The external inputs given to program (e.g. x, y, etc)
- Constants — Either pre-specified or randomly generated
Sometimes the terminal set consists of functions with no arguments. For example, rand() function which will return different values on different invocations.
Function Set
It consists of all the operations that can be done on the terminals. The function set is extremely domain specific. It can be a set of mathematical operations like {+, -, *, /, log, exp} or a set of actions like {move, turn}.
These two sets combined is called primitive set.
Initialization
For any population to evolve, you need some parents. In GP the initial population is generated randomly using different approaches. The individuals of initial population are usually unfit (unable to solve our problem). The population should contain programs of varying shape and size. Here are 3 common methods for initializing the population:-
Full Method
Firstly a maximum depth parameter is fixed. Then nodes are taken at random from the function set until this max depth is reached. After this only terminals are chosen. As the name suggests thing method generates full trees.
Grow Method
This method allows creation of trees of various shapes. Here nodes are taken at random from either the function set or terminal set until a maximum depth is reached or all the leaf nodes are terminals.
Here are examples of tree constructed using these two methods.
Grow method is quite sensitive to sizes of function and terminal sets. If there are too many terminals, mostly shallow trees will be produced. If there are too many functions, mostly full or almost full trees will be produced.
Ramped half-and-half Method
It is combination of above two methods to overcome the limitations of grow method. Half the initial population is constructed using full method and other half is constructed using grow method and a range of depths are used in both cases.
Bonus: Sufficiency
Primitive set should have this property. Basically the solution should be representable using the primitive set.
Note: Population size is an important control parameter
Genetic Operators
This is where GP actually gets interesting. After creation of an initial population we apply some genetic operators on them and create a new generation of programs. Below are the most important genetic operators and there examples.
Selection
If we try to apply genetic operators on the entire population then the set of programs would expand exponentially and its not feasible. So, we only allow these special genetic operations for some individuals. These individuals are chosen from a probability distribution based on their fitness (closeness to the solution). Fitness will be discussed in the next section. This is done in order to ensure that only good programs (or there combinations) move to the next generation.
For example lets say we are generating programs for self-driven cars, consider two programs A and B. If A is able to drive better (which can be measured by using various metrics like accidents, speed, time taken to reach destination, etc.) then there will be a higher chance of A being selected for genetic operations than B.
Crossover
This operation creates an offspring program by combining randomly chosen parts of 2 selected parent programs. It drives inspiration from sexual reproduction where the child contains traits of both parents. If the appropriate traits are passed on, the offspring would be fitter than the parents.
Mutation
This operation creates an offspring program by randomly altering a randomly chosen part of a selected program. This drives inspiration from biological mutation where DNA sequence changes during cell division. Many times these mutations enable an organism to survive better in an environment.
Reproduction
This operation simply selects a program and keeps it in the next generation as well.
Note: Probability of performing a genetic operation is an important control parameter.
Bonus: Closure
Type Consistency — because the mixing is random, all functions must be type consistent otherwise it wouldn’t be so easy. If not possible then an automatic conversion mechanism needs to be employed (e.g. conversion between boolean and integers) or GP system needs to be extended so primitives include type information and operations are constrained.
Evaluation Safety — Sometimes commonly used function can fail/throw an exception. e.g. dividing by zero or move_forward when there’s a wall ahead. This needs to be handled by using protected versions of functions or assigning a low fitness score if exception occurs in the program.
Fitness Evaluation
Once we have the population, we need a measure to evaluate the fitness/performance of individuals. A fitness measure is required to specify the end goal of our search. Fitness can be measured in terms of the amount of error between output of our program and desired output, the amount of time, fuel, money, etc to get a state, accuracy of a output, compliance of a user specification, and so on.
The fitness feature of GP can be multi-objective as well. i.e. many elements that are in competition with each other can be combined to assign a fitness score.
Termination
Once we have the fitness measure, we can specify when our search is supposed to terminate. A termination criteria needs to be specify to finally get a solution. It can either be a maximum number of generations produced or a specified level fitness reached by an individual program.
References
The entire content of this blog is based on this book. It is beautifully written after considering 450+ pieces of literature. Must read!
Poli, Riccardo, et al. “Genetic programming: An introductory tutorial and a survey of techniques and applications.” Univ. Essex School of Computer Science and Eletronic Engineering Technical Report No. CES-475 (2007): 1–112.