Python Symbolic Regression with gplearn: how to discover analytical relationships in your data

Published in

Analytics Vidhya

7 min readAug 3, 2020

In this tutorial I want to introduce you to Genetic Programming in Python with the library gplearn.

Symbolic regression is a machine learning technique that aims to identify an underlying mathematical expression that best describes a relationship. It begins by building a population of naive random formulas to represent a relationship between known independent variables and their dependent variable targets in order to predict new data. Each successive generation of programs is then evolved from the one that came before it by selecting the fittest individuals from the population to undergo genetic operations.

1. Introduction

Imagine you were a scientist working in any field. You can basically split your work in three steps: the first is to gather the data, the second is to propose a phenomenological explanation of those data. Third you would like to explain those data and phenomenological observations to first principles.
Imagine you are about to gather data and discover gravity. There are three scientist we should remember about this field: Tycho Brahe (data acquisition) who took extensive and precise measurements of the position of planets over time.
Johannes Kepler who, from Brahe’s measurements, derived analytical expressions that describe the motion of the solar system in a concise manner (data analysis).
Finally, the theorist, Isaac Newton. He realized the mechanisms underlying planets travelling around the Sun which could be formulated into a universal law (derivation from 1st principles).

Machine learning (ML) models are currently the tools of choice for uncovering these physical laws. Although they have shown some promising performance in predicting materials properties, typical parameterized machine learning models are not allowing the ultimate steps of scientific research: automating Newton. Take into consideration for example ML/Deep learning algorithms who can predict with a perfect score the number of infected people in a global pandemic: if they can’t explain the number, they are of limited use. That’s why simpler mechanistic models like SIR are still widely used. ML models can be predictive but their descriptions are often too verbose (e.g. deep-learning models with thousands of parameters) or mathematically restrictive (e.g. assuming the target variable is a linear combination of input features).

In this prospective paper, we focus on an alternative to machine-learning models: symbolic regression. Symbolic regression simultaneously searches for the optimal form of a function and set of parameters to the given problem, and is a powerful regression technique when little if any a-priori knowledge of the data structure/distribution is available.

Genetic programming flowchart depicting the iterative solution finding process. source arxiv.

2. Initial dataset and data analysis

We generate the data like we did in the regression tutorial. The only difference here is that we modify the amount of sample and the coefficients to give us with a complex function. We see that the result is the combination of a linear x, a sin(x) and x**3.

nsample = 400
sig = 0.2
x = np.linspace(-50, 50, nsample)
X = np.column_stack((x/5, 10*np.sin(x), (x-5)**3, np.ones(nsample)))
beta = [0.01, 1, 0.001, 5.]y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
df = pd.DataFrame()
df['x']=x
df['y']=y

The shape of the curve is illustrated below:

3. GPlearn imports and implementation

We will import SymbolicRegressor from gplearn and also the decision tree and random forest regressor from sklearn from which we will compare the results.

We import from sympy in particular we will use simpify to make the outcome expression more readable.

Basic libraries to import from gplearn, sklearn and sympy.

Since we will be comparing the results from different approaches, we split our dataset in train/test:

X = df[['x']]
y = df['y']
y_true = y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

Our first fit will be extremely easy: in the function set we import most, if not all, the default functions which comes with the library. Then we initialize the regressor with some standard parameters.

With this code we’re telling the regressor to use the function from the function_set list. The number of generations (evolution) of the expression will be 40, with an initial population of 5000. Each of those 5000 will combinate between them to form expressions which will be evaluated. The evaluation function by default will assess the “fitness” by using by default a MSE metric. There’s a stopping_criteria which by default will be of 0.01 in case we reach this MSE before the number of generations. Then there are three parameters linked with evolution: p_crossover, p_subtree_mutation and p_hoist_mutation. For more detail about the specific use of those parameters see the documentation. One thing you should note, however, is that the sum of the three parameters should be equal or less than 1. We can also try to control the total length of our expression by using Parsimony_coefficient. The higher this parameter will be, the shorter the regressor will try to keep the expression. With max_samples we’re telling the fitness function to compare the result with 90% of our data.

# First Test
function_set = ['add', 'sub', 'mul', 'div','cos','sin','neg','inv']est_gp = SymbolicRegressor(population_size=5000,function_set=function_set,
                           generations=40, stopping_criteria=0.01,
                           p_crossover=0.7, p_subtree_mutation=0.1,
                           p_hoist_mutation=0.05, p_point_mutation=0.1,
                           max_samples=0.9, verbose=1,
                           parsimony_coefficient=0.01, random_state=0,
                          feature_names=X_train.columns)

We define a dictionary called converter: its use will be clear later but essentially we need it to convert a function name, or string, in its equivalent python mathematical code.

converter = {
    'sub': lambda x, y : x - y,
    'div': lambda x, y : x/y,
    'mul': lambda x, y : x*y,
    'add': lambda x, y : x + y,
    'neg': lambda x    : -x,
    'pow': lambda x, y : x**y,
    'sin': lambda x    : sin(x),
    'cos': lambda x    : cos(x),
    'inv': lambda x: 1/x,
    'sqrt': lambda x: x**0.5,
    'pow3': lambda x: x**3
}

We then fit the data, we evaluate our R2 score and with the help of simpyfy we print the resulting expression:

est_gp.fit(X_train, y_train)
print('R2:',est_gp.score(X_test,y_test))
next_e = sympify((est_gp._program), locals=converter)
next_e

With verbose = 1 the output will be something like this: the first column indicates the generation number. Then we have the average length and fitness of the whole population (5000 units). On the right we have the best parameters for the individual.

With the code above, the result will be like:

R squared and analytical function from gplearn.

4. Comparing Gplearn to traditional ML approaches

We reached a good level of correlation with the original expression. However I must say that it’s particularly difficult to interpret this equation! Let’s see how it compares with standard ML tools:

est_tree = DecisionTreeRegressor(max_depth=5)
est_tree.fit(X_train, y_train)
est_rf = RandomForestRegressor(n_estimators=100,max_depth=5)
est_rf.fit(X_train, y_train)
y_gp = est_gp.predict(X_test)
score_gp = est_gp.score(X_test, y_test)
y_tree = est_tree.predict(X_test)
score_tree = est_tree.score(X_test, y_test)
y_rf = est_rf.predict(X_test)
score_rf = est_rf.score(X_test, y_test)

From scores above, the result is not impressive apparently: the ML tools both scored pretty well if compared to our analytical expression:

RF score:0.993
DT score: 0.992
GPlearn score: 0.978

However, if we plot the results, we can highlight that our analytical function is actually better in the middle range while it goes off at the extreme points.

fig = plt.figure(figsize=(12, 10))for i, (y, score, title) in enumerate([(y_test, None, "Ground Truth"),
                                       (y_gp, score_gp, "SymbolicRegressor"),
                                       (y_tree, score_tree, "DecisionTreeRegressor"),
                                       (y_rf, score_rf, "RandomForestRegressor")]):ax = fig.add_subplot(2, 2, i+1)
    points = ax.scatter(X, y_true, color='green', alpha=0.5)
    test = ax.scatter(X_test,y,color='red', alpha=0.5)
    plt.title(title)plt.show()

Comparison of full data (green points) and test data (red points) for three different approaches. Top left indicates original data.

Above: relationship between y_pred and y_true. Below: prediction error for the three algorithms.

5. Improving the result

By looking at the function and using some prior knowledge of mathematics, we can improve our algorithm by feeding it some functions to improve the result. gplearn allows us to create custom functions and we will implement a simple cubic expression.

from gplearn.functions import make_functiondef pow_3(x1):
    f = x1**3
    return fpow_3 = make_function(function=pow_3,name='pow3',arity=1)# add the new function to the function_setfunction_set = ['add', 'sub', 'mul', 'div','cos','sin','neg','inv',pow_3]est_gp = SymbolicRegressor(population_size=5000,function_set=function_set,
                           generations=45, stopping_criteria=0.01,
                           p_crossover=0.7, p_subtree_mutation=0.1,
                           p_hoist_mutation=0.05, p_point_mutation=0.1,
                           max_samples=0.9, verbose=1,
                           parsimony_coefficient=0.01, random_state=0,
                          feature_names=X_train.columns)
est_gp.fit(X_train, y_train)
print('R2:',est_gp.score(X_test,y_test))
next_e = sympify((est_gp._program), locals=converter)
next_e

The R2 is now really close to 1, also note how simpler our equation is! If we compare again to the traditional ML tools we really appreciate the improvement with this step.

6. Conclusion

I think symbolic regression is a great tool to be aware of. It isn’t perhaps perfect for every kind of approach but it gives you another option which can be really useful as the outcome is readily understandable.