Symbolic Regression From Scratch in C# Part 2

Taran Marley
2 min readAug 17, 2019

--

Continuing from part one. We will now finish off what is required to implement the symbolic regression algorithm and see results.

The first thing to create is the computation engine that will compute the operations contained in the geneset and apply them to a given value. Allowing us to test fitness and therefore mutate chromosomes to better fit a given data set.

Computation Manager:

This function allows us to take in a list of genes and an x value so that we can compute the y value.

It’s worth pointing out that division has been deliberately left out to avoid a potential divide by zero error and is handled by decimal multiplication instead.

Program:

As a console application used for experimentation the implementation is somewhat rough.

Here I have entered enough datapoints from (sin(x*1.5)+0.5)*0.75 to generate an equation with the genetic alogrithm.

Below this I will put in the loop that runs out console app:

This code starts off by randomly generating a new chromosome called BestParent. After this it displays our starting chromosome and then enters a for loop in which a child Chromosome is cloned from the parent, mutated and then has its fitness tested against the parent. Survival of the fittest is applied and if a child is more fit than the parent it then becomes the new BestParent and the cycle continues a new.

Testing it out after a few runs I was able to get the following equation:

(0+sin((x * 1.49995247530749))-x+x+0.0444218926338581+0.455150786999218))*0.750109340413525

Which roughly simplifies to our initial equation:

sin((x * 1.49995))-0.49957)*0.75011

Remarkably close to (sin(x*1.5)+0.5)*0.75

Extension

The algorithm here can get caught in a local minimum fairly easily as its mutations have a limited scope. The genetic algorithm could implement an age measure so that if a chromosome becomes old with little improvement it can make more radical changes in an attempt to break out of a local minimum. More significant mutations or similar mechanisms like crossover will be required to break.

More obvious utility extensions would be things like loading in data points from external documents and saving equations during mutation by age to later compare and get better results.

A gene containing two mathematical operations would also be likely to see efficiency gains.

Conclusion

For me personally genetic programming in symbolic regression is one of the most fun things that can be experimented with in machine learning and it allows someone to experiment with machine learning principles and learn intuitively things that otherwise can be difficult to grasp.

I had fun trying out different equations and then copying and pasting the results into an online graphing calculator to the symbolic regression result.

--

--

Taran Marley

A programmer of machine learning systems for stock management. I work at www.esprofessionals.com