AutoML with Prevision.io

Zeineb Ghrib
Prevision.io
9 min readJun 8, 2020

--

Overview of Prevision.io auto-ml platform

What is Automated Machine Learning?

Each data scientist, whatever his level of expertise, would tell you that applying traditional end to end machine learning process to real-world business problems is very tedious, time and resource consuming and challenging.

Automated machine learning addresses these issues by applying a systematic process of iterative and time consuming tasks required to develop a machine learning model. All repetitive steps such as model building / training/ selection / tuning .. are fully automated and parallelized.

What about Prevision.io ?

Prevision.io provides an automated machine learning platform to generate and deploy highly accurate predictive models on cloud or on-premise. We propose a friendly interface that can be used without any prior technical knowledge or infrastructure and build standalone models.

At the implementation level, the platform incorporates machine learning best practices from top-ranked kagglers / data scientists to ensure highly efficient models while keeping all the complexity aspects opaque to users.

Prevision.io activity is basically focused in France and is the only French vendor in the AI Cloud Provider. In January 2020, Prevision has been Named a Visionary in the Gartner Magic Quadrant.

How to use Prevision Auto-ML Solution?

Data scientists and machine learning developers who want to keep hold of their ML workflow code source can take advantage of the great power of Prevision AutoML services via Prevision python package without our front-end application. We developed a Software Development Toolkit that allows you to build and launch machine learning use cases within Prevision services. Hence, You can interact with the service in any Python environment (Jupyter Notebooks, Pycharm, VS Code, …). An R package is also provided.

In this post I will show you how to use automated machine learning via Prevision python package, to create a regression model to predict median housing prices. Then I will expose the traditional method and see how it would be much easier to use AutoML comparing to self-made programming machine learning model.

Pre-requisistes

MASTER_TOKEN:

In order to initialize client workspace via the SDK and interact with its corresponding Prevision.io plateform instance, an API Token is required for authentication . It is obtained by going to the user menu and clicking on the API key item:

To copy the KEY clipboard go to the right of the screen and click on copy as shown below:

Install:

git clone https://github.com/previsionio/prevision-python.git
python ./prevision-python/setup.py install

Setting configuration and connection

In the code block below we will import the Python package to initialize the client instance to interact with its Prevision.io workspace. This needs to be done once per session.

import previsionio as pio
import pandas as pd

URL = 'https://XXXX.prevision.io'
TOKEN = '''YOUR_MASTER_TOKEN'''

# initialize client workspace
pio.prevision_client.client.init_client(URL, TOKEN)

Do not forget to change the values of the TOKEN with the generated key and the URL endpoint with the name of your instance in order to continue running this notebook.

Importing Data with Prevision in Python

Once the client instance successfully initialized, we will start up our end to end machine learning use case with Prevision.io auto ml platform and we ll show how easy it is to use such a tool, especially comparing to the traditional self coding method that we ll display later. We will be working on a Regression problem using the California Housing dataset. The main task consists of predicting the house price given a set of features as inputs.

We get the pre-registred dataset in Privision.io plateform by its corresponding name as follows:

data = pio.dataset.Dataset.get_by_name(name='housing')

data is a Prevision Dataset object. In order to load in-memory the data content as a pandas DataFrame, use to_pandas() method as follows:

data = data.to_pandas()
data.head()

Manipulate and transform data

Once we have retrieved the dataset, we can explore it, transform, add or delete some features. Then we split it into a training and a testing subsets, and finally register the obtained datasets in order to launch our usecase.

Example of new features that can be added (source)

data["rooms_per_household"] = data["total_rooms"]/data["households"]
data["bedrooms_per_room"] = data["total_bedrooms"]/data["total_rooms"]
data["population_per_household"]=data["population"]/data["households"]

Train/Test split

As usual, let’s make random split : 80% for traning subsample, and 20% for testing:

import numpy as np# equivalent to sklearn.model_selection.train_test_split 
def split_train_test(data, test_ratio=0.2, SEED=42):
idx = np.random.RandomState(seed=SEED).permutation(len(data))
test_set_size = int(len(data) * test_ratio)
return (data.iloc[idx[:test_set_size]].reset_index(drop=True),
data.iloc[idx[test_set_size:]].reset_index(drop=True))
train, test = split_train_test(data, test_ratio=0.2, SEED=42)

Now register in Previon platform transformed datasets

#register modified datasets
train_pio = pio.Dataset.new('housing_train', dataframe=train)
test_pio = pio.Dataset.new('housing_test', dataframe=test)

Prevision Auto ML with python

Once the datasets stocked in your instance, all you have to do is to configure an auto ml regression use case, and the tool will take care of creating and evaluating automatically several models from different types of models. You need to fix two configurations:

  1. Columns Configuration

Set the columns configurations required to define at least the target column via a previsionio.ColumnConfig object. In the case of a tabular usecase (Regression / Classification / Multi classification), we instantiate a previsionio.ColumnConfig object with the following attributes:

  • target_column : define the name of the target column
  • id_column: if the dataset contains an ID column, it does not carry any predictive signal, and we should specify while configuring the dataset for the current usecase via this attribute. Later, each prediction is provided with its corresponding id
  • fold_column : Optionally we can set the fold_column corresponding to the stratification feature. By default the stratification strategy is based on the target column. Please note that the choice of this option can significantly impact the quality of predictions (About the importance of the stratification, Kohavi concludes that: stratification is generally a better scheme, both in terms of bias and variance, when compared to regular cross-validation. ).
  • weight_column (optional) Typically, this column contains a linear feature indicating the importance of a given row. The higher the weight, the more important the row is. If not fed, all rows are considered equally important (which is the case in most use cases).
  • drop_list (optional): list of columns names to drop

In our case we will set only the target column

#configure columns of datasets
col_config = pio.ColumnConfig(target_column='median_house_value')

2. Use case training Configuration:

Define the use case settings such as the training profile, the type of models you want to experiment, the metric that will be used to train models, the feature engineering to use … this is done by instantiating a previsionio.TrainingConfig object.

Example of configuration:

# config use case profile
uc_config = pio.TrainingConfig(
models=[pio.Model.XGBoost, pio.Model.RandomForest, pio.Model.LightGBM],
features=pio.Feature.Full.drop(pio.Feature.EntityEmbedding,pio.Feature.PCA,pio.Feature.KMeans),
profile=pio.Profile.Normal,
with_blend=True
)
  • models : list of model types you want to experiment, the selection list offered by Prevision.io. In our example we will use Random Forest, XGboost and LightGBM.
  • features: list of Prevision Feature Engineering operations for your use case. Here we kept all the feature engineering offered by Prevision except the PCA transformation, Entity Embedding which is a new representation projection of categorical features to a new latent space called embedding space, and the K-Means clustering encoding.
  • profile: which is a parameter specific to Prevsion, according to the level of your needs your can set it to previsionio.Profile.Quick (recommended at first iterations) or previsionio.Profile.Normal (which has been used in our example) or previsionio.Profile.Advanced(used for optimization purpose at the final steps of a project)
  • with_blend: We activated the blending option which means that in addition to the basic models that were chosen, Prevision will pick some of the generated models that will be used to train another models, the obtained models are Blends of the other models.

Launch the use case

Now all we have to do is to launch the regression auto ml use case with the established configurations. For this purpose, we use fit method of previsionio.Regression (subclass of previsionio.Supervised) (previsionio.Classification for a classification usecase, previsionio.MultiClassification for a multi classification usecase..).

#launch Regression auto ml use case
uc = pio.Regression.fit(name='housing',
dataset=train_pio,
metric=pio.metrics.Regression.RMSE,
holdout_dataset=test_pio,
column_config=col_config,
training_config=uc_config)

The previous code-block would launch a process that sweeps several machine learning algorithms with different hyper-parameters settings according to the pre-defined settings. The AutoML platform can find the the best model, (or the fastest) by minimizing the given accuracy metric (here RMSE).

Interact with Prevision.io use case with python

List the created models:

Prevision Python package provides utilities to retrieve all the created models within the current use case:

#get number of the generated models for the usecase
print("{n} models have been trained for this usecase".format(n=len(uc)))
#list models
models = uc.models
for model in models:
print(model.name)

Here it is shown that 19 distinct models have been experimented either from different types of models (XGboost, DT-DecisionTree, LGB-LightGBM..) or different models of the same type (XGB-6, XGB-7…)

1. The best model

One of the most powerful operations offered by the auto-ml is that it can find out the best model within few minutes for your usecase, launched with the desired configurations (best in term of score).

To get the best model we use best_model property:

best_model = uc.best_model

Each of the created model is automatically evaluated with cross validation technique. Get the cross validation performances of the best model:

Traditional Approach:

With the self made code approach, you have to get through all the feature engineering, model selection/ tuning staff. Below an extract of what we have to go through to address this regression use case

1. Model selection:

First step we will create and evaluate kind of models and evaluate individually the cross validation performances

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
# Random Forest
forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
scores = cross_val_score(forest_reg, X_train.values, y_train.values,
scoring="neg_mean_squared_error",
cv=7)
rf_scores = np.sqrt(-scores)
#Linear Regression
lin_reg = LinearRegression()
scores = cross_val_score(lin_reg, X_train.values, y_train.values,
scoring="neg_mean_squared_error",
cv=7)
lr_scores = np.sqrt(-scores)#Xgbooft
xgb = XGBRegressor(n_estimators=10, max_depth=2)
scores = cross_val_score(xgb, X_train.values, y_train.values,
scoring="neg_mean_squared_error",
cv=7)
xgb_scores = np.sqrt(-scores)

Let’s check the resulted scores :

d = {0:'Random Forest', 1:'Linear Regression', 2:'XGboost'}
for i, scores in enumerate([rf_scores, lr_scores, xgb_scores]):
print(d.get(i)+ ' model')
print('Cross Validation Mean Score', scores.mean())
print('Cross Validation Upper Deviation Score ', scores.mean()+scores.std())
print('Cross Validation Lower Deviation Score', scores.mean()-scores.std())
print("**********************")

Here we found out that the best scores are performed by the random forest model.

2. Model Fine Tuning:

We will use Grid Search technique (not shown here) to find out the best hyper-parameters for our Random Forest model which seems to perform well in this use case

std_cv = grid_search.cv_results_['std_test_score'][grid_search.best_index_]
print('Cross Validation Mean Score',
np.sqrt(-grid_search.best_score_))
print('Cross Validation Upper Deviation Score ',
np.sqrt(-(grid_search.best_score_-std_cv)))
print('Cross Validation Lower Deviation Score',
np.sqrt(-(grid_search.best_score_+std_cv)))

Conclusion:

The best performances after model selection and fine tuning steps is 49711.26 meaning that the best model found within few minutes by Prevision auto ml tool is doing better by 10,86 %.

Beyond the resulted performances, the real added value of the Auto ml approach is time saving and complexity reducing thanks to automating the repetitive tasks of each new use case. With this way data scientists can focus more on finding meaningful business insights and KPIs. Moreover, even people with no technical background can deal with Machine Leaning use cases thanks to the AutoML. Some even think that it is the future and some day it would replace data scientists. To know more about that check this post.

Feel free to comment on this post and ask me any questions about my work. I am planning to write other posts comparing Prevision.io with other AutoML solutions such as H2O, DataRobot or AzureML, and any feedback towards that would be great. Also I will show up how i used Prevision Studio to resolve real business use cases for renowned clients such as (LaPoste-France, Renault)

Thanks for reading!! I hope this was informative in any way :)

--

--