Comparing AutoML/Non-Auto-ML Multi-Classification Models

Zeineb Ghrib
The Startup
Published in
7 min readOct 9, 2020

Introduction

In this post we will show how to use Prevision.io sdk to create a multi-classification use case using white wine quality data from the UCI Machine Learning Repository.
The machine learning objective is to predict white wine quality from its chemical characteristics such as (acidity, ph, density, sulphates ..)

Furthermore we will compare prevision performances with other self made coding algorithms, and show how we can compare both approachs within exactly the same scope (same cross validation/ test evaluation) despite the black box characteristic of the auto-ml solution offered by prevision platform.

Checkout my previous post to see how to install the sdk, and if you want to test it you have free trial access on the public cloud instance

Auto-ML approach:

Let’s get the dataset:

import pandas as pd
df = pd.read_csv('winequality-white.csv', sep=';')

Create train / test subsets

Lets create a sub-sample (about 20% of the overall dataset) that we will use as a holdout data-set, in order to evaluate the generalization error of our models. this sub-sample will be put aside, and not used for training. Hence, we will find out how well our models will perform on new data (not seen during the training phase).

It exist many ways to create the sample, the simplest is to use train_test_split() of scikit learn

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)

During Feature engineering step you can construct two types of features:

  • Business derived features : example here if we have some knowledge in chemistry in chemistry we can extract new feature from a combination of fixed acidity, volatile acidity and citric acid to create a new explicative feature
  • Transformation based features : these features are derived from ML operations such as scaling, encoding, normalization, ACP components… to create new features more valuable to the models.

The second kind of feature engineering is supported by the platform: once you launch the use case on your dataset, you can select transformations that you want to apply on your dataset, and they will be automatically computed and added as stand-alone features to get more information about the feature-transformation supported by the platform consult this link

Launch auto-ml pipeline within 10 code lines:

Step 0 : Connect to your instance via the sdk

import previsionio as pio
import pandas as pd

URL = 'https://XXXX.prevision.io'
TOKEN = '''YOUR_MASTER_TOKEN'''

# initialize client workspace
pio.prevision_client.client.init_client(URL, TOKEN)

Check my previous post to know how to get your MASTER_TOKEN

Step1 : Register the train/test subsets as Prevision Data sets

To launch a new use case you need to stock them into your work-space:

pio_train = pio.Dataset.new(name=’white-wine-train’, dataframe=train_set)
pio_test = pio.Dataset.new(name=’white-wine-test’, dataframe=test_set)

Step2 : Dataset and training configuration

To launch a new use case, you need to define some configuration parameters, and then simply use the SDK’s BaseUsecase derived methods to have the platform automatically take care of starting everything for you. Example here we need tell to the platform that the target column is quality

col_config = pio.ColumnConfig(target_column=’quality’)

Now we will also specify some training parameters, such as which models are used, which transformations are applied, and how the models are optimized..


uc_config = pio.TrainingConfig(models=[pio.Model.LinReg, pio.Model.LightGBM, pio.Model.RandomForest],
simple_models=[],
features=pio.Feature.Full.drop(pio.Feature.PCA,
pio.Feature.KMeans,
pio.Feature.PolynomialFeatures),

profile=pio.Profile.Quick)

To get the exhaustive list of `previsionio.ColumnConfig` and `previsionio.TrainingConfig` parameters checkout this Churn tutorial.

Step3: Launch a multi-classification use case

Now we will use the `fit()` method of the `previsionio.MultiClassification` class: we select the log loss as a performance metric : Here a good post to understand the log_loss and binary cross-entropy metrics

# now we will launch 
# launch Multiclassif auto ml use case
uc = pio.MultiClassification.fit('wine_quality_from_sdk',
dataset=pio_train,
metric=pio.metrics.MultiClassification.log_loss,
holdout_dataset=pio_test,
column_config=col_config,
training_config=uc_config,
with_blend=False)

You can visit this post to get more information about the `fit()` options

Result analysis

For a multi classification use case, Prevision supports the OvA strategy (one-versus-all). The platform would generate probability for each modality of the target column and the selected one is the argmax.

let’s find-out the best performances found by the platform:

print("{} models were trained on this usecase:\n".format(len(uc)))
for model in uc.models:
print(model.name,' === ' ,'%.3f'%model.score, 'log_loss')

print("\n##################################################")
print('The best model found in this use case is a', best.algorithm)
print("the best cross validation performance is {} of log_loss".format('%.3f'%best.score))
print("\n##################################################")
print('The best model performance in the holdout-dataset is ', '%.3f'%uc._status['holdoutScore'])

=> The best model is a random forest and it’s cross val score is 0.93

Now we will train a random forest with scikit learn API

Scikit-learn based approach:

import numpy as np
# Define features X for train and test subsets
X_train = np.asarray(train_set.drop('quality', axis=1))
X_test = np.asarray(test_set.drop('quality', axis=1))
# Define target y for train and test subsets
y_train = np.asarray(train_set['quality'])
y_test = np.asarray(test_set['quality'])

To make sure that we have the same cross validation folds we will construct a customized iterator for the cv parameter of the cross_val_predict() scikit learn method, check this post for further information

The cross validation folds are readily available through the cross_validation property of the best model (from previsionio.model.ClassificationModel class ). This tutorial provides further details of prevision models attributes

best_cv = best.cross_validation
folds = best_cv['__fold__']
# create a customized cv iterator:
def custom_pio_folds(X, pio_cv=folds):
for i in sorted(pio_cv.unique()):
idx_test = np.where(pio_cv==i)[0]
idx_train = np.where(pio_cv!=i)[0]
yield idx_train, idx_test

Now we will construct a random forest model with Scikit learn default hyper-parameters, and evaluate its cross validation predictions within the same folds as Prevision with log_loss metric

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import log_loss
forest_clf = RandomForestClassifier(random_state=42)
custom_cv = custom_pio_folds(X_train)
y_pred_forest = cross_val_predict(forest_clf, X_train, y_train, cv=custom_cv, method='predict_proba')
sk_score = log_loss(y_train, y_pred_forest)
print('sckit learn default random forest log loss CV is {}'.format('%.3f' % sk_score))

it returns => 1.08

Now lets check manually the best model performance found by the platform:

predictions = best.cross_validation.drop(['ID', 'quality', 'pred_quality', '__fold__'], axis=1)
predictions = predictions.loc[:, ['pred_quality_'+str(i) for i in sorted(best.cross_validation['quality'].unique())]]
predictions = predictions.values
true_values = best.cross_validation['quality'].values
pio_score = log_loss(true_values, predictions)
print('%.3f' % pio_score)

it returns=> 0.938 which is the same value as the returned one by the sdk

Use Prevision hyper-parameters

Lets change the hyper-parameters of the random forest classifier with the hyper-parameter of the best model found by the platform :

forest_clf = RandomForestClassifier(**best.hyperparameters)
custom_cv = custom_pio_folds(X_train)
y_pred_forest = cross_val_predict(forest_clf, X_train, y_train, cv=custom_cv, method='predict_proba')
hyper_sk_score = log_loss(y_train, y_pred_forest)
print('%.3f' % pio_score)

=> it had returned 0.967!! which is quite better than the previous 1.087

It’s even simpler than the grid search or random search hyper-opt Technics, you had the platform already calculated it for you!!! (It is very clever trick to use in kaggle challenges ;) )

Holdout performances:

we had evaluated the cross validation performances. Let’s now check the generalization error using the test subset:

1- Random forest with scikit learn default setting:

forest_clf = RandomForestClassifier(random_state=42)
forest_clf.fit(X_train, y_train)
preds = forest_clf.predict_proba(X_test)
log_loss(y_test, preds, labels=forest_clf.classes_)

=> it returned 0.95

2- Random Forest with prevision best model hyper-parameters

forest_clf = RandomForestClassifier(**best.hyperparameters)
forest_clf.fit(X_train, y_train)
preds = forest_clf.predict_proba(X_test)
log_loss(y_test, preds, labels=forest_clf.classes_)

=> it returned 0.86 much better than the default one!!

3- Prevision.io best model:

uc._status['holdoutScore']

=> the result is 0.83 the best one!

This little gap between Prevision Model performance and the customized Random Forest is due to the feature transformation constructed and added by the platform before training the models

Conclusion:

In this post we have addressed a multi-classification using both prevision auto-ml and self-made coding approaches. With the second one we kept the default parameters and got poor result (1.08), but by using the best model hyper-parameters found by the auto-ml platform the loss have decreased to 0.967, but still the best cross val performance was 0.938 of the best model.

Once again, I want to show through this post how Auto-ML tools can facilitate our job. For this use case it took only a few minutes to create, train and evaluate a dozen of models and find out the best one. If you want to test it in your own, just log in Prevision cloud instance and you ll get a free trial access

I really do believe that auto-ml tools is the future of the AI, Hence we have to put most of the effort thinking about how to integrate the models in our systems rather than spending time creating and optimizing them with the traditional approach.

--

--