Symbolic Regression From Scratch in C# Part 1

Taran Marley
3 min readAug 16, 2019

The aim of this series is for an intermediate C# programmer to have built a useful symbolic regression algorithm that they can easily extend in whatever direction fits their needs. I believe the best way to gain understanding of any machine learning concept is by implementing that concept and that is what I seek to facilitate here.

I completed this using .net core 2.2 with C#, though the from scratch nature of this project means that using a different C# setup is unlikely to be an issue.

All the code for this project can be found at https://github.com/Benzidrine/SymbolicRegression/

Definition:

Symbolic regression is taking a dataset and have a machine formulate an equation that best matches the data.

An example dataset of x,y values of [0,1], [1,2], [2,3], [3,4] is easily solvable by a human as: y = x + 1.

However something like [0,0],[1,1],[2,0.14],[3,-0.98],[4,-0.27] presents a far more difficult challenge even though it is a simple equation: y = sin(x*1.5)

This is where a computer and genetic programming can help us produce equations from limited data sets. This can be useful in real world scenarios for determining patterns without human bias and in forecasting.

The Project:

I created a new .net console app for convenience in these early development stages. I generally start small projects by defining the models and that is the approach I’ve taken here. Three models are required for the project. A Geneset model that defines the types of genetic expression, a Gene model to contain the expression plus its strength and a chromosome to contain an array of genes.

Geneset:

The geneset for our purposes is going to be the set of mathematically operations available to the genetic algo.

Here we have the needed primitive mathematical expressions that combined can generate approximations to quite a few functions. This enum will neatly define the primitive that each gene will be able to contain. In later parts we will likely remove some of these and add others.

Gene:

A gene here is a single sample of one operation from the Geneset and a value which represents the strength at which that operation will be applied.

Chromosome:

The chromosome is what will store a list of Genes and will contain a lot of methods related to its function. Here we are getting to the heart of the project. The list of genes contained in the chromosome is the mathematical equation that we are trying to to make best fit the data.

The chromosome GenerateParent method is called to generate a random list of genes using the GenerateGene method. In our symbolic regression program we will call this method to generate a random parent and then mutate it for better fitness. This is how the process can get started.

Mutating the chromosome will be very easy using the same GenerateGene method we used to generate the parent except applied to a single random entry in the Chromosome’s genes.

Now we have a chromosome created we need to establish the fitness of the chromosome which where the results of its genetic expression will be compared to the dataset. This will involve a call to another class to compute the list of genes that we will get to later. The fitness in our project will be the amount of error with 0 being the least possible error. To find this we will find the difference between the Chromosome’s Y prediction for an X value and the known Y value.

We follow this with a few utility functions. A display function will print out the mathematical equation represented by the list of genes and the clone function will help us make copies of the chromosome that are not linked in memory.

Now we have all the object models defined we need only create a function to compute the outcome of the equation contained in the list of genes and a console interface that allows us to use it all to analyze data and generate the best fit equation.

This is handled in part 2.

--

--

Taran Marley

A programmer of machine learning systems for stock management. I work at www.esprofessionals.com