Essential Python for Machine Learning: gplearn

The Formula Finder

4 min readJan 24, 2024

source: https://github.com/trevorstephens/gplearn

This is the 9th chapter of my book Essential Python for Machine Learning.

Introduction

Machine learning empowers us to extract insights and patterns from the vast amounts of data around us. While libraries like scikit-learn provide powerful numerical algorithms, there’s another fascinating approach to understanding data: symbolic regression. Let’s dive into how Python’s gplearn library unlocks its potential.

What is Symbolic Regression?

At its core, symbolic regression aims to find a mathematical formula that accurately describes a dataset. Unlike traditional regression techniques (like linear or polynomial regression), where you pre-define the model’s structure, symbolic regression searches for both the ideal structure and its parameters. This makes it remarkably adaptable to uncover hidden relationships that might be missed by conventional approaches.

Simple Example

Imagine you have data for the following:

x, y
1, 3
2, 6
3, 12
4, 18

A symbolic regression algorithm might automatically discover that the relationship between x and y is best described by the formula y = x² + 2. This simple example demonstrates how symbolic regression goes beyond fitting a predefined line or curve; it actually discovers the underlying mathematical relationship.

Symbolic Regression vs. Other Models

Numerical Regression: Numerical regression methods (e.g., linear, polynomial) excel when you have a strong intuition about the underlying relationship in your data. They are generally faster to train but require you to specify the functional form of the model beforehand.
Neural Networks: Neural networks are universal approximators. They can model complex non-linear relationships but often act as ‘black boxes,’ making it difficult to interpret the exact formula they represent.
Symbolic Regression: Symbolic regression bridges the gap. It uncovers human-readable mathematical expressions from your data, enhancing interpretability. This makes it valuable for scientific discovery and scenarios where understanding the relationships between variables is paramount.

When to Use Symbolic Regression vs. Other Models?

The choice of when to use symbolic regression depends heavily on your problem and your priorities:

Interpretability is Crucial: If understanding the underlying mathematical relationship between variables is paramount (e.g., in scientific modeling, deriving physical laws, or financial risk modeling), symbolic regression is invaluable.
Limited Data: Symbolic regression can be suitable for scenarios with smaller datasets, where traditional machine learning techniques might overfit due to the increased flexibility of finding the model structure.
Suspected Nonlinear Relationships: If you believe there are complex, nonlinear relationships within your data that aren’t easily captured by linear or polynomial models, symbolic regression can explore a much wider space of potential solutions.
Domain Knowledge: If you have prior knowledge about potential function forms or variables, you can guide the symbolic regression search by defining specific operations or building blocks.

When to Consider Other Models?

Simple is Sufficient: For problems where a clearly linear or simple polynomial relationship is likely, numerical regressions may be sufficient and often faster.
Interpretability Less Important: If your primary goal is high predictive accuracy, and understanding the exact equation driving the predictions isn’t essential, neural networks might be the better choice.
Massive Datasets: While gplearn is optimized, symbolic regression can become computationally expensive with very large datasets. Neural networks can often handle such scale more efficiently.

Genetic Programming for Symbolic Regression

Symbolic regression often employs genetic programming (GP) as its search engine. Modeled after biological evolution, GP works by:

Generating Populations: It starts with a random population of mathematical expressions.
Fitness Evaluation: Each expression is evaluated on how well it fits the dataset.
Selection: The best-fitting expressions are ‘selected’ for the next step.
Mutation & Crossover: Similar to biological evolution, selected expressions undergo ‘mutation’ (random changes) and ‘crossover’ (combining parts) to create a new generation of potential solutions.

This process of evaluation, selection, mutation, and crossover iterates until a highly accurate mathematical formula emerges.

gplearn: Your Gateway to Symbolic Regression

gplearn is a fantastic Python library built specifically for symbolic regression using genetic programming. Here's why it stands out:

Flexibility: It offers different fitness functions, selection methods, and genetic operators to tailor the search process to your specific problem.
Efficiency: Its implementation leverages scikit-learn for seamless integration into your existing machine learning workflows.
Customizable: You can define your own mathematical building blocks, ensuring the discovered equations align with your domain knowledge.

Code Example

Let’s see gplearn in action with a simple example. The code is available in this colab notebook:

import numpy as np
from gplearn.genetic import SymbolicRegressor
import matplotlib.pyplot as plt

# Generating synthetic data
np.random.seed(0)
x = np.linspace(-10, 10, 100)
y = x + np.log2(x**2) + 3*np.sin(x) + np.random.normal(0, 0.1, 100)

# Creating Symbolic Regressor
function_set = [
    'add', 'sub', 'mul', 'div', 'sqrt', 'log', 'abs', 'sin', 'cos', 'tan']
symbolic_regressor = SymbolicRegressor(population_size=5000,
                                       generations=20,
                                       function_set=function_set,
                                       stopping_criteria=0.01,
                                       p_crossover=0.6, p_subtree_mutation=0.2,
                                       p_hoist_mutation=0.05, p_point_mutation=0.1,
                                       max_samples=0.9, verbose=1,
                                       parsimony_coefficient=0.01, random_state=0)

# Fitting the model
symbolic_regressor.fit(x.reshape(-1, 1), y)

# Print the best program
print(symbolic_regressor._program)

# Predictions
y_pred = symbolic_regressor.predict(x.reshape(-1, 1))

# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(x, y, label="Actual Data")
plt.plot(x, y_pred, color='red', label="Symbolic Regression Model")
plt.legend()
plt.xlabel("X")
plt.ylabel("Y")
plt.title("Symbolic Regression Example")
plt.show()

This code snippet will train a symbolic regression model and reveal the equation that closely resembles the original function.

Conclusion

By harnessing the power of gplearn, you can unlock the secrets hidden within your data, gaining both accurate predictions and valuable insights into the underlying relationships. So, dive into the fascinating world of symbolic regression and empower your machine learning journey with interpretable and explainable models!