Automating the machine learning model selection process

The traditional machine learning model selection process is largely iterative with data scientists searching for the best model and the best hyperparameters to fit a given data-set. Going with the philosophy I’ve learnt from the fast.ai courses, I’ve automated the machine learning classification process I have been using in a python class using elements from scikit learn, pandas, numpy, scipy and bokeh. This blog is an introduction to the process and a more comprehensive example can be found here.

The intended audience are data analysts learning data science with a few weeks of python experience with a basic understanding of numpy and pandas. For new learners, this can serve to learn the process using a top down approach to learning. This post along with the code example will let you

  1. Pre-process the data to build a model on
  2. Split the data into the train/dev base.
  3. Perform one hot encoding using a scikit learn’s dictvectorizer and save it for later use
  4. Use scikit learn’s learning curve to select the optimum training set count for hyperparameter optimization using a t-test to decide the optimum data size. This makes the learning process more scalable for large data-sets
  5. Perform hyperparameter optimization using k fold cross validation on the training set and model selection for the training task using a gradient boosting classifier, random forest, logistic regression, naive bayes or an SVM
  6. Get the accuracy/F1 score metrics for the models on the training and a test set
  7. Decide which model to apply on the final test set to get an estimate of real world performance
  8. Get performance estimates based on the lorenz curve for the models on the train/dev/test sets.
  9. Apply the model on the full data and save to use later using another class in the module

If all of the above feels daunting and assuming you’re someone who is just starting in ML, it probably does, you can find all the code and theory in Sebastian Raschka’s python machine learning which I used to learn all of the above

Instantiating the class

The module is meant to work as a python class and can be used with one line of code, at the time of instantiation and the only thing left to do is choose the best fit model.

Import Model_selection
ms = Model_selection.model_train(modelBase=train, target="outcome", learning_curve=False, scoring="neg_log_loss", automated=True, n_jobs=-1, models=["random_forest"])

The following are the most important parameters (the rest are described in the class doc-string)

  1. modelBase: the training data
  2. target: the target class column name to predict — the data must be integer
  3. scoring: any one of scikit learn’s scorers (the default is roc_auc) — http://scikit-learn.org/stable/modules/model_evaluation.html
  4. n_jobs: number of CPU cores to use. -1 will use all cores but may slow down the computer
  5. learning_curve: if true, scikit learn’s learning curve will be used to determine the ideal training set size for cross validation
  6. automated: if false, you’ll have to call functions apply_models() and evaluate_models() to get the performance characteristics. This will let you change the default target value from 1 to model the Lorenz curve characteristics
  7. models: the models we want to test. we can use any or all of gbm, svm, random_forest, logistic_regression and naive_bayes
  8. categorical_columns: a list of columns that will be used as categorical. Specifically useful for columns that are integer but need to be used as a category. Python changes the type at times so I prefer to pad these with an underscore

Selecting the best model and testing on a final test set

ms.select_retrain_model(["random_forest"]) # selected random forest
ms.test_on_holdout_data(X = , y= ) # to test on the holdout sample
predictions = Model_selection.model_apply(new_data, model="random_forest.pkl", predict_proba=False)
Select the best model; test on the holdout set and retrain on the full set to use in production

To save the model objects to use later, just use the function save_model_objects which can be called again using the helper function model_apply in the module.

This should get you started building pretty good classification models to use for work or on online competitions that you can then try to beat by modifying the defaults, adding your own functions or even building your own automated methods. A more comprehensive code example with link to the data used can be found here. Try on the sample kaggle data and then on your own. The module can be downloaded from here

If you have any questions or ideas to share regarding AI and machine learning, reach out to me at kanishkd4@gmail.com or on twitter @kanishkd4 or on linkedin