Auto Machine Learning Using TPOT

Harrison Miller
5 min readAug 15, 2019

--

TPOT is a Python Automated Machine Learning tool that optimizes machine learning piplines using genetic programming. TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data. TPOT is built on top of scikit-learn, so all of the code it generates will look familiar if you have used scikit-learn before.

Installation

To install TPOT on your system, you can run the command ‘pip install tpot’ in terminal. TPOT is built on top of several existing Python libraries, including numpy, scipy, scikit-learn, DEAP, update_checker, tqdm, stopit, pandas. Most of the necessary Python packages can be installed via the Anaconda Python distribution, or you could install them separately also. Optionally, you can also install XGBoost if you would like TPOT to use the eXtreme Gradient Boosting models.

How It Works

Knowing which model to use for your machine learning project can be a challenge as there are so many of them to choose from like Decision Trees, SVM, KNN, etc. Thats where TPOT comes in. TPOT uses genetic programming to find the best model for you. Genetic algorithms are inspired by the Darwinian process of Natural Selection, and they are used to generate solutions to optimization and search problems in computer science. Genetic algorithms generally have three properties:

  • Selection: You have a population of possible solutions to a given problem and a fitness function. At every iteration, you evaluate how to fit each solution with your fitness function.
  • Crossover: Then you select the fittest ones and perform crossover to create a new population.
  • Mutation: you take those children and mutate them with some random modification and repeat the process until you get the fittest or best solution.

The TPOTClassifier has a wide variety of parameters. The most notable ones are:

  • generations: Number of iterations to run the pipeline optimization process. (default 100)
  • population_size: Number of individuals to retain in the genetic programming population every generation (default 100)
  • offspring_size: Number of offspring to produce in each genetic programming generation. (default 100)
  • mutation_rate: Mutation rate for the genetic programming algorithm in the rate [0.0,1.0]. This parameter tells the GP algorithm how many pipelines to apply random changes to every generation. (default 0.9)
  • crossover_rate: Crossover rate for the GP algorithm in the range [0.0,1.0]. This parameter tells the GP algorithm how many pipelines to breed every generation.
  • scoring: Function used to evaluate the quality of a given pipeline for the classification problem like accuracy, average_precision, roc_auc, recall, etc. (default accuracy).
  • cv: Cross-validation strategy used when evaluation pipelines. TPOT makes use of sklearn.model_selection.cross_val_score for evaluation pipelines (default 5).
  • random_state: The seed of the pseudo-random number generator used in TPOT. Use this parameter to make sure that TPOT will give you the same results each time you run it against the same data set with that seed.

TPOT will evaluate Population_size + Generations x Offspring_size piplines in total.

classifiers currently in TPOT
preprocessors

Running TPOT

You run TPOT based on your data set with the ‘.fit’ function:

.fit function
example of TPOT running 50 generations with a population of 50

You can then proceed to evaluate the final pipeline on the testing set with the .score function.

Finally you can tell TPOT to export the corresponding Python code for the optimized pipeline to a text file with the .export function:

exported pipeline

Once this code finishes running, tpot_exported_pipline.py will contain the Python code for the optimized pipeline for use in the rest of your project. Or you could tune the hyperparameters even further from here to suit your needs.

TPOT can also stack classifiers, including the same classifier multiple times. The developers of TPOT explains how it works here:

The pipeline ExtraTreesClassifier(ExtraTreesClassifier(input_matrix, True, 'entropy', 0.10000000000000001, 13, 6), True, 'gini', 0.75, 17, 4) does the following:

Fit all of the original features using an ExtraTreesClassifier

Take the predictions from that ExtraTreesClassifier and create a new feature using those predictions

Pass the original features plus the new “predicted feature” to the 2nd ExtraTreesClassifier and use its predictions as the final predictions of the pipeline

This process is called stacking classifiers, which is a fairly common tactic in machine learning.

Limitations

TPOT can take a long time to finish its search based on the parameters you give it. This makes sense because it is considering multiple machine learning algorithms in a pipeline with numerous preprocessing steps, the hyperparameters for all of the models and preprocessing steps, as well as multiple ways to ensemble or stack the algorithms within the pipeline. With the default settings (100 generations with 100 population size), TPOT will evaluate 10,000 pipeline configurations before finishing. To put that in context, it’s like doing a gridsearch of 10,000 hyperparameter combinations for a machine learning algorithm and how long that grid search will take. Another limitation is that TPOT can recommend different solutions for the same dataset. This usually occurs when you don’t give TPOT enough time to work and the different runs didn’t converge due or that multiple pipelines perform more or less the same on your dataset.

If you would like to look into TPOT more for your future projects visit here.

--

--