GETTING STARTED | DATA PREPARATION | KNIME ANALYTICS PLATFORM

Data preparation for Machine Learning with KNIME and the Python “vtreat” package

TL;DR: Speed up your ML data preparation with the great Python package vtreat and just a few lines of code — wrapped in a convenient KNIME workflow

Markus Lauber

Published in

Low Code for Data Science

7 min readJan 7, 2023

Before you can do fancy machine-learning, it does make sense to prepare your data, e.g. deal with missing values, remove highly correlated variables and so on. You can do a lot of this by hand or you could employ a ready made tool like vtreat which has been implemented in R and Python.

To read about KNIME and Python you can check out this article: “How to Set Up Your Python Extensions” or read one of my previous Medium stories: “KNIME and Python — Setting up and managing Conda environments”.

To learn more about the package vtreat itself please check out this video: “vtreat for KNIME!” and read these articles:

“What is vtreat?” (https://win-vector.com/2019/08/14/what-is-vtreat/)
“vtreat package” (https://cran.r-project.org/web/packages/vtreat/vignettes/vtreat.html)
“R Tip: Use the vtreat Package For Data Preparation” (https://win-vector.com/2018/03/11/r-tip-use-the-vtreat-package-for-data-preparation/)

Based on the examples from the Github repository I created these KNIME workflows to bring this into a convenient environment with the latest Python nodes — similar to what has been described in the video above.

Vtreat for Binary Classifications (0,1)

(https://hub.knime.com/-/spaces/-/latest/~2gqD6JuPrcxKuZrp/)

The input into the Python Script are all the columns that would help to predict the outcome and the target variable (0,1) named “Target”. One output is a pickled object (the blue square) with the treatment plan which will be stored as a ZIP file with the help of the KNIME Model Writer the other is the transformed training data and an overview of the treatment plan as a CSV file.

workflow to learn and apply a vtreat data preparation — create a data preparation with vtreat package on the training data, store the procedure and apply it to the test data (https://hub.knime.com/-/spaces/-/latest/~2gqD6JuPrcxKuZrp/).

The Python code to ‘learn’ the model would look like this (following: “Using vtreat with Classification Problems”). Note: your Target variable will be used to create the plan (on the training data)! Only very few settings are used. You can check out more under the link.

import knime.scripting.io as knio
import vtreat

# the table imported from the KNIME workflow with the training data
input_table_1 = knio.input_tables[0].to_pandas()

# vtreat for KNIME!
# https://win-vector.com/2020/06/28/vtreat-for-knime/

# define the treatment according to:
# https://github.com/WinVector/pyvtreat/blob/main/Examples/Classification/ClassificationWarningExample.md
vtreat_transform = vtreat.BinomialOutcomeTreatment(
    outcome_name='Target', # outcome variable
    outcome_target='1',    # outcome of interest
    cols_to_copy=['Target'], # columns to "carry along" but not treat as input variables
    params = vtreat.vtreat_parameters({
        'filter_to_recommended': True,
     # the value being imported as Flow Variable from KNIME
        'indicator_min_fraction': knio.flow_variables['v_vtreat_indicator_min_fraction']
    })
) 

# learn the model to transform the data and apply it to the training data
d_prepared = vtreat_transform.fit_transform(input_table_1, input_table_1['Target'])

# save the transformation rules
vtreat_transform_as_data = vtreat_transform.description_matrix()

knio.output_tables[0] = knio.Table.from_pandas(d_prepared)
knio.output_tables[1] = knio.Table.from_pandas(vtreat_transform_as_data)

# the transformation is being exported as a Pickle object to be stored
knio.output_objects[0] = vtreat_transform

To apply the model to the test data (and later new data) the model will be read back into KNIME and then the treatment is being applied using this code within a KNIME Python node:

import knime.scripting.io as knio
import pandas as pd

import vtreat

# loads the saved vtreat model into the Python environment
vtreat_model_load = knio.input_objects[0]

# load the (new) data
input_table = knio.input_tables[0].to_pandas()

output_table = vtreat_model_load.transform(input_table)

knio.output_tables[0] = knio.Table.from_pandas(output_table)

Vtreat also offers the option to use an unsupervised treatment. Here you would not ‘leak’ information from your target into your features (we discuss leaks in a moment).

I also have created a workflow for the unsupervised version: (https://hub.knime.com/-/spaces/-/latest/~bgKqtVG7xLgxwDZf/)

If you now use this sort of preparation with some models you can see that the vtreat treatment will give you a (little) boost:

Binary Classification — use Python XGBoost package and other nodes to build models and deploy them thru KNIME Python nodes

Comparison of some model performances with and without vtreat (https://hub.knime.com/-/spaces/-/latest/~In0Rxt7EhzQfycx3/).

If this advantage will be sustainable (and if it is worth the effort) very much will depend on your data and business case.

You can read even more about ML classification tasks and KNIME here:
“H2O.ai AutoML in KNIME for classification problems”

Vtreat for regression targets

(https://hub.knime.com/-/spaces/-/latest/~tgdedu04tUX8ldcC/)

There is also a treatment for regression targets similar to the one for classification tasks.

import knime.scripting.io as knio
import vtreat

input_table_1 = knio.input_tables[0].to_pandas()

# vtreat for KNIME!
# https://win-vector.com/2020/06/28/vtreat-for-knime/

# define the treatment according to:
# https://github.com/WinVector/pyvtreat/blob/main/Examples/Regression/Regression.md
vtreat_transform = vtreat.NumericOutcomeTreatment(
    outcome_name='Target',       # outcome variable
    cols_to_copy=['Target'],     # columns to "carry along" but not treat as input variables
    params = vtreat.vtreat_parameters({
        'error_on_duplicate_frames': True,
        'filter_to_recommended': True,
        # the value being imported as Flow Variable from KNIME
        'indicator_min_fraction': knio.flow_variables['v_vtreat_indicator_min_fraction']
    })
)  
d_prepared = vtreat_transform.fit_transform(input_table_1, input_table_1['Target'])

# svae the transformation rules
vtreat_transform_as_data = vtreat_transform.description_matrix()

knio.output_tables[0] = knio.Table.from_pandas(d_prepared)
knio.output_tables[1] = knio.Table.from_pandas(vtreat_transform_as_data)

knio.output_objects[0] = vtreat_transform

An initial evaluation shows the benefit of the use of vtreat, a lower (better) RMSE (Root-Mean-Square Error) for this task of predicting Kaggle house prices.

XGBoost models with and without vtreat data preparation compared — XGBoost with vtreat has better (lower) RMSE than the model without preparation.

If you use the unsupervised version (https://hub.knime.com/-/spaces/-/latest/~W_QgpFcBs-11AxVB/) you still get some benefits, but not as much as when you let the vtreat model know what the target variable is.

result of model comparison for unsupervised data preparation with vtreat — XGBoost with vtreat has better (lower) RMSE than the model without preparation — also when using unsupervised.

What about Leaks?

This might be where there is a difference between an artificial task and a real world example. In a real world example I would not worry too much about such ‘leaks’ if (!) the correlation / connection between the target variable and your parameters would be expected to remain stable over time. They should give you the same information the next round. Of course you should always watch the performance of your models and be careful with the encoding of some variables like individual indicators (customer numbers are a no-go) or regional ones. Are you sure this sales region will behave the same way next time around and you want to ‘hard-code’ their performance into your model? On the other hand: you might expect managers in general to earn more than unskilled worker the next month also — so why not use this info on the ‘census-income’ dataset? You will always have to watch out for significant changes (workers getting a big raise … some pandemic ruining all your historic data).

As always, such model workflows should be the start of your exploration and different tasks might call for other preparations. I have compared a few regression models and also used vtreat. You can check out the resulting workflow on the KNIME hub:

Score Kaggle House Prices: Advanced Regression Techniques — prepare data with vtreat — use H2O.ai nodes and other models — measure results with RMSE

Score Kaggle House Prices: Advanced Regression Techniques — prepare data with vtreat — use H2O.ai nodes and other models — measure results with RMSE (https://hub.knime.com/-/spaces/-/latest/~_4d8QTJEXO50qzi7/).

Spoiler Alert: A KNIME GBM model with vtreat takes the ‘regression’ lead in front of H2O.ai Auto-Machine-Learning (again with vtreat).
Edit: Actually an even better model if you look at the actual workflow is a “RProp MLP Learner” — still exploring that one with some other preparations like normalization …

You can read even more about ML regression tasks and KNIME here:
“H2O.ai AutoML in KNIME for regression problems”

Vtreat for Multi-Class Targets

(https://hub.knime.com/-/spaces/-/latest/~KCc7hJcSXdEJcMWy/)

The latest pair of workflows deals with vtreat and Multi-Classification problems. I named the target variable “Class” for this:

import knime.scripting.io as knio
import vtreat

input_table_1 = knio.input_tables[0].to_pandas()

# vtreat for KNIME!
# https://win-vector.com/2020/06/28/vtreat-for-knime/

# define the treatment according to:
# https://github.com/WinVector/pyvtreat/blob/main/Examples/Multinomial/MultinomialExample.md
vtreat_transform = vtreat.MultinomialOutcomeTreatment(
    outcome_name='Class',    # outcome variable
    cols_to_copy=['Class'],  # columns to "carry along" but not treat as input variables
    params = vtreat.vtreat_parameters({
        'error_on_duplicate_frames': True,
        'filter_to_recommended': True,
        # the value being imported as Flow Variable from KNIME
        'indicator_min_fraction': knio.flow_variables['v_vtreat_indicator_min_fraction']
    })
)  

d_prepared = vtreat_transform.fit_transform(input_table_1, input_table_1['Class'])

# svae the transformation rules
vtreat_transform_as_data = vtreat_transform.description_matrix()

knio.output_tables[0] = knio.Table.from_pandas(d_prepared)
knio.output_tables[1] = knio.Table.from_pandas(vtreat_transform_as_data)

knio.output_objects[0] = vtreat_transform

If you check out the simple example results (with KNIME’s XGBoost) vtreat does not seem to have that much of an impact. I have then put the preparation in a larger workflow and used the H2O.ai AutoML (only allowing GBM models and stacked versions of them) again and here this combination comes out on top:

Score UCI Wine Quality Dataset — multiple Targets (multiclass) with H2O.ai nodes and other models — measure results with LogLoss

Score UCI Wine Quality Dataset — multiple Targets (multiclass) with H2O.ai nodes and other models — measure results with LogLoss

I gave the model 10 minutes of training. So it might very well be that a longer training time could further improve the model while H2O.ai tries to make sure that no overfitting is happening (“A Deep dive into H2O’s AutoML”).

Again: please take these examples with a grain of salt and keep in mind these are somewhat artificial data sets to demonstrate the functions. Your real life datasets and tasks might have other challenges.

If you want to read more about KNIME and machine learning you can check out my meta-collection on the KNIME hub

Machine Learning Meta Collection (with KNIME)

If you are interested in more about KNIME and auto-machine-learning I have these links to offer

Overview of Auto-ML frameworks and concepts with KNIME (https://forum.knime.com/t/how-to-evaluate-multiple-classification-models-and-choose-the-best-one-without-doing-too-much-manual-work/3277/5?u=mlauber71)