KNIME, XGBoost and Optuna for Hyper Parameter Optimization

TL;DR: Machine Learning gets better with hyper parameter optimisation and a tool like Optuna is there to help. Also you can integrate the results with KNIME …

Markus Lauber
12 min readMar 4, 2023

--

I recently wrote a blog about hyper parameter optimisation and LightGBM using BayesSearchCV. This article is doing something similar but with the powerful XGBoost library and Optuna. Combining KNIME and Python.

KNIME loves XGBoost and Optuna
The whole KNIME workflow
The whole KNIME workflow (https://hub.knime.com/-/spaces/-/latest/~GABT_OgeoWxWJW9P/)

You can just go ahead and download the workflow and you will find all the code in “kn_example_python_xgboost_hyper_parameter_optuna.ipynb” in the subfolder /data/notebooks/ and just use them (yes I am thinking about opening a Github page …). The KNIME part will come at the end…

Jupyter Notebooks in the KNIME workflow
Jupyter Notebooks in /data/ subfolder of the KNIME workflow (https://hub.knime.com/-/spaces/-/latest/~GABT_OgeoWxWJW9P/)

In this walkthrough I will hint at some points I have noticed while setting this up which hopefully might help you and which I might not have found on other blogs about these tools. I try to set up the code so you could have versions of the models and also deal with string and number variables in a consistent way and also store the results and use them later. Maybe this is one added value from an experienced data scientist.

And also to combine the results with your favourite low-code tool KNIME. So you could either set up such models yourself or let someone create them and then implement them in an easy way.

Walkthrough - XGBoost / Optuna Python code

At the beginning I like to give the model an individual name and add a timestamp so we can later identify the version with its components. Also you give the data and model paths. If you start this from the sub-folder of the KNIME workflow it will point to its data folder.

# http://strftime.org'
import time
var_timestamp_day = "{}".format(time.strftime("%Y%m%d"))
print("var_timestamp_day: ", var_timestamp_day)

var_timestamp_time = "{}h".format(time.strftime("%H%M"))
print("var_timestamp_time: ", var_timestamp_time)

# _edit: if you want to have another model name
var_model_name = "XGBoost_Optuna_Classification"

var_model_name_full = var_model_name + "_" + var_timestamp_day + "_" + var_timestamp_time + "_jupyter"
print("var_model_name_full: ", var_model_name_full)

# if you do not want to store the files in the working directory
var_path_data = "../result/"
var_path_model = "../model/"

Next, import the data. We again use the census-income (or adult) dataset. I like to have the data stored in Parquet format since it would preserve the variable types, it is compact and widely used and supported — also by KNIME.

import numpy as np
import pandas as pd
import pyarrow.parquet as pq

import json
import pickle

import xgboost as xgb

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, auc, average_precision_score, precision_recall_curve

import optuna
from optuna.visualization import plot_optimization_history, plot_param_importances
import matplotlib.pyplot as plt
import plotly
import matplotlib
import kaleido

# define the datasets
data = pq.read_table(var_path_data + "train.parquet").to_pandas()
data_test = pq.read_table(var_path_data + "test.parquet").to_pandas()

data = data.reset_index(drop=True)
data_test = data_test.reset_index(drop=True)

Then I like to define the variables we are about to use. Also set aside space for things like IDs (customer_numbers) and the like. Also you can easily add and subtract parts of the variable settings. We will later store these informations for further use.

excluded_features = ['row_id']
label = ['Target']

features = [feat for feat in data.columns if feat not in excluded_features and feat not in label]

num_cols = data[features].select_dtypes(include='number').columns.tolist()
cat_cols = data[features].select_dtypes(exclude='number').columns.tolist()

rest_cols = [feat for feat in data.columns if feat not in cat_cols]

print(f'''{"data shape:":20} {data.shape}
{"data[features] shape:":20} {data[features].shape}
categorical columns: {cat_cols}
numerical columns: {num_cols}
feature columns: {features}
rest columns: {rest_cols}''')

# THX David Gutmann

We now put our newfound variable list to good use and do some basic transformations and cleaning. This is no full blown data preparation obviously but it is necessary to smooth the workings of XGBoost.

data[cat_cols] = data[cat_cols].astype('category')
data[label] = data[label].astype('int32')

If you are interested in some quick (but also quite sophisticated) data preparation there is my article about “Data preparation for Machine Learning with KNIME and the Python “vtreat” package

Next the familiar split in test and training (we keep the data_test for now as a separate test file ... or was it validation, never mind) and transform the data.

# split training data into X and y
X = data[features]
y = data[label]

# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)

# one special feature is the use of xgb Matrices with categorical features enabled
D_train = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
D_test = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)

Next step is the definition of hyper parameters to search. Besides reading about the parameters I also (again) employed ChatGPT to ‘discuss’ them. You should check if they do work for you and you can modify them. Wer use AUCPR since this is a robust statistic that also does work well with unbalanced datasets.

# number of iterations
var_n_boost_round = 200

def objective(trial):
param = {
'eta': trial.suggest_float('eta', 0.01, 0.3), # Step size shrinkage used in update to prevents overfitting.
'max_depth': trial.suggest_int('max_depth', 6, 15), # Maximum depth of a tree. Larger values can lead to overfitting.
'subsample': trial.suggest_float('subsample', 0.1, 1.0), # Subsample ratio of the training instances. Lower values can prevent overfitting.
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.1, 1.0), # Subsample ratio of columns when constructing each tree. Lower values can prevent overfitting.
'gamma': trial.suggest_float('gamma', 1e-8, 1.0), # Minimum loss reduction required to make a further partition on a leaf node of the tree.
'alpha': trial.suggest_float('alpha', 1e-8, 1.0), # L1 regularization term on weights. Can help with sparsity of the model.
'lambda': trial.suggest_float('lambda', 1e-8, 1.0), # L2 regularization term on weights. Can help with overfitting.
'min_child_weight': trial.suggest_int('min_child_weight', 1, 10), # Minimum sum of instance weight (hessian) needed in a child. Can help with overfitting.
'max_delta_step': trial.suggest_int('max_delta_step', 0, 10), # Maximum delta step we allow each tree's weight estimation to be. Can help with convergence speed.
'scale_pos_weight': trial.suggest_float('scale_pos_weight', 0.1, 10), # Control the balance of positive and negative weights, for imbalanced datasets.
'tree_method': trial.suggest_categorical('tree_method', ['auto', 'exact', 'approx', 'hist']), # Algorithm used to construct trees. , 'gpu_hist' excluded
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.2), # Step size shrinkage used in update to prevents overfitting.
'objective': 'binary:logistic', # Objective function to optimize. In this case, binary logistic regression.
'eval_metric': 'aucpr' , # Metric to evaluate the model during training. In this case, AUC-PR.
'min_split_loss': trial.suggest_float('min_split_loss', 1e-8, 1.0), # minimum loss reduction required to make a split in the tree. It can be used to control the complexity of the tree
'max_bin': trial.suggest_int('max_bin', 64, 512) , # maximum number of bins to use for continuous features.
# 'early_stopping_rounds': trial.suggest_int('early_stopping_rounds', 25, 50),
# 'booster': trial.suggest_categorical('booster', ['gbtree', 'gblinear']),
# 'sample_type': trial.suggest_categorical('sample_type', ['uniform', 'weighted', 'weighted_unique']),
# 'normalize_type': trial.suggest_categorical('normalize_type', ['tree', 'forest', 'none'])
'seed': 42, # The random seed.
'n_jobs': -1 # Number of CPU threads to use for parallel execution, -1 means use all available CPU threads
}

# Train model with the given hyperparameters
model = xgb.train(param, D_train, num_boost_round=var_n_boost_round, evals=[(D_test, 'val')], early_stopping_rounds=25, verbose_eval=50)

# Predict and calculate AUCPR score on validation set
y_val_pred = model.predict(D_test)
score = average_precision_score(y_test, y_val_pred)

return score

The training would then start. I have set the same number of boosting rounds and trials. You can adapt that.

var_n_trials = var_n_boost_round

# Create Optuna study and optimize the objective function
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=var_n_trials)

# Get the best hyperparameters and best AUCPR score
best_params = study.best_params
best_score = study.best_value

With these optimal settings you would then have to train a model again:

# Train the best model with all the training data and the best hyperparameters
best_trial = study.best_trial
best_params = best_trial.params

best_model = xgb.train(best_params, D_train, num_boost_round=var_n_boost_round)

You can make an initial evaluation on the internal test data split from your first dataframe:

# evaluate the best model on the test data

y_pred = best_model.predict(D_test)
test_score = average_precision_score(y_test, y_pred)

# evaluate the initial values based on the (internal Test data)
auc_pred = roc_auc_score(y_test, y_pred, average='weighted')
print(f'Test AUC: {auc_pred:.4f}')

aucpr = average_precision_score(y_test, y_pred, average='weighted', pos_label=1)
print(f'Test AUCPR: {aucpr:.4f}')

Also you should store your model as JSON and as a Pickle file to have it for later.

import pickle
# set the path for the pickel file
path_model = var_path_model + var_model_name_full + "_model_stored.pkl"
path_model_json = var_path_model + var_model_name_full + "_model_stored.json"
# Save object as pickle file
pickle.dump(best_model, open(path_model, 'wb'), pickle.HIGHEST_PROTOCOL)

best_model.save_model(path_model_json)

Two interesting features of the Optuna optimization is that you can plot out the optimization history. This can give you an idea if maybe there would still be room for improvement. Be careful the PNG files can be quite large so you might best store them on the disk before they bloat your Jupyter notebook.

var_path_opt_history_png = var_path_model + var_model_name_full + "_opt_history.png"

# Plot the optimization history and save to a file
fig = plot_optimization_history(study)
fig.write_image(var_path_opt_history_png)
With this relatively simple dataset a good level of parameter optimisation is found
With this relatively simple dataset a good level of parameter optimisation is found early

Also the importance of the hyper parameters can be plotted. This might give you a hint where to look for further improvement.

var_path_param_importances_png = var_path_model + var_model_name_full + "_param_importances.png"

# Plot the hyperparameter importances and save to a file
fig_para = plot_param_importances(study)
fig_para.write_image(var_path_param_importances_png)
Learning rate is the most important hyper parameter in this case
Learning rate is the most important hyper parameter in this case

Then you have three types of feature importance that you also might store as a data frame. Yes I always like to store things in files and give them IDs — we will see in a moment what this is for.

# Get feature importance based on weight
importance_type = 'weight'
scores_weight = best_model.get_score(importance_type=importance_type)

# Get feature importance based on gain
importance_type = 'gain'
scores_gain = best_model.get_score(importance_type=importance_type)

# Get feature importance based on cover
importance_type = 'cover'
scores_cover = best_model.get_score(importance_type=importance_type)

# Create a pandas dataframe with feature importance information
feature_imp = pd.DataFrame({'Feature': list(scores_weight.keys()),
'Weight': list(scores_weight.values()),
'Gain': list(scores_gain.values()),
'Cover': list(scores_cover.values())})

# Calculate the average importance rank across all methods
feature_imp['Feature_Rank'] = feature_imp[['Weight', 'Gain', 'Cover']].rank(method='min', ascending=False).mean(axis=1)

feature_imp = feature_imp.sort_values(by='Feature_Rank', ascending=True, na_position='last')

feature_imp = feature_imp.reset_index(drop=True)
feature_imp['Feature_Rank'] = feature_imp.index

feature_imp.to_parquet(var_path_model + var_model_name_full + "_feature_importance.parquet", compression='gzip')

ChatGPT says: In a machine learning model that uses decision trees, feature importance measures help us to understand which features have the greatest impact on the model’s predictions. Here are explanations of the three feature importance measures used in the code you provided:

Weight: The weight of a feature is the number of times it is used to split the data across all trees in the ensemble model. Features with higher weight values are more important because they are used more frequently to make decisions and split the data.

Gain: The gain of a feature measures the improvement in the model’s performance that results from splitting the data based on that feature. Specifically, the gain of a feature is calculated by summing the reduction in impurity (e.g., entropy or Gini index) that results from splitting the data on that feature over all trees in the ensemble. Features with higher gain values are more important because they contribute more to the model’s ability to distinguish between classes.

Cover: The cover of a feature is the average number of samples that pass through the splits that use that feature. Features with higher cover values are more important because they have a greater influence on the model’s predictions by affecting a larger number of samples.

By considering all three feature importance measures, the code you provided is able to obtain a more comprehensive ranking of feature importance that takes into account different aspects of the decision-making process in the model.

Then you should store all the parameters and variables you have used (and not used) and put them in a JSON file — also the initial AUCPR values are being stored. These meta information will later be used by KNIME to weight the various optimisation runs you have made.

# store the variales list as dictionary in a JSON file to read back later

v_variable_list = {
"var_model_name": var_model_name,
"var_model_name_full": var_model_name_full,
"num_cols": num_cols,
"cat_cols": cat_cols,
"rest_cols": rest_cols,
"label": label,
"features": features,
"excluded_features": excluded_features,
"Test_AUC": f'{auc_pred:.5f}',
"Test_AUCPR": f'{aucpr:.5f}'
}

# Write the dictionary to a JSON file
with open(var_path_model + var_model_name_full + "_variable_list.json", "w") as f:
json.dump(v_variable_list, f)

Score a new dataset in Python

You can now (re-)use you model and score your remaining test file (or a new file) with the model.

# Load XGBoost model from the JSON file
path_apply_model = var_path_model + var_model_name_full + "_model_stored.json"

xgboost_apply = xgb.Booster()
xgboost_apply.load_model(path_apply_model)

import json

# Read the JSON file back into a Python dictionary
with open(var_path_model + var_model_name_full + "_variable_list.json", "r") as f:
loaded_dict = json.load(f)

# fill the list of categorical columns
new_cat_cols = loaded_dict['cat_cols']
new_features = loaded_dict['features']

Your dataset “data_test” is the one you imported before but you can import any set you might like (and which has the same data structure obviously). Note two things: you only keep the features you used also for training and you make sure you convert the strings to category and enable their use when creating the data matrix. The original data will stay the same also including additional values like IDs or customer numbers and just a prediction score (“P1”) will be added.

df_test_apply = data_test[new_features].copy()
df_test_apply[new_cat_cols] = df_test_apply[new_cat_cols].astype('category')

# Create DMatrix for new data
D_new = xgb.DMatrix(df_test_apply, enable_categorical=True)

# Get the predicted probabilities for each class
probabilities = xgboost_apply.predict(D_new)
probabilities_df = pd.DataFrame(probabilities, columns = ['P1'])

# Join the original target column with the predicted probabilities
result = pd.concat([data_test, probabilities_df], axis=1)

result.to_parquet(var_path_data + var_model_name_full + "_scored_test_data.parquet", compression='gzip')

If you have a Target variable in the file you can now check the statistics again

# evaluate the best model on the test data
auc_pred = roc_auc_score(result['Target'], result['P1'], average='weighted')
print(f'Test AUC: {auc_pred:.4f}')

# from sklearn.metrics import average_precision_score
aucpr = average_precision_score(result['Target'], result['P1'], average='weighted', pos_label='1')
print(f'Test AUCPR: {aucpr:.4f}')

You can make several runs with different settings and maybe parameters. All will get their time stamp and the settings will be stored.

Use the best XGBoost prediction in KNIME

OK this is supposed to be low-code here so we come to that. You as a data scientist have provided the models and now we can use KNIME to deploy the model within a nice KNIME workflow. I often use the power of KNIME to prepare and clean the data an of course you can also make great machine learning models with KNIME. But sometimes I like to throw in some input from the wider world of (Python or R based) machine learning algorithms.

In the workflow there is a JSON reader that would import all the files we have created — also from other algorithms; you can check them out and they might feature in another blog. The JSON reader can tak in all config files that match the criteria at once.

import model settings from JSON files
import model settings from JSON files (https://hub.knime.com/-/spaces/-/latest/~GABT_OgeoWxWJW9P/)

With a JSON Path node we can extract the path, name and the AUC(PR) statistics:

extract JSON data with Path
extract JSON data with Path (https://hub.knime.com/-/spaces/-/latest/~GABT_OgeoWxWJW9P/)

Only the best model (per algorithm) will make the cut

filter duplicates

The imported settings are used to load the model and the additional informations and then score the incoming data from KNIME.

KNIME workflow to apply the XGBoost model (https://hub.knime.com/-/spaces/-/latest/~GABT_OgeoWxWJW9P/)

Just like in the Jupyter notebook but using KNIME flow variables and the parameters from the JSON file the model will be aplied.

python code to apply the model embedded in a KNIME PYthon node

You can now integrate your best XGBoost model in your KNIME workflow — and if you keep improving the model the workflow will automatically select the best model. You will have the settings and things like variable importance ready as well as a comparison of various models:

inspect the results from various Jupyter notebooks with advanced ML models
Inspect the results from various Jupyter notebooks with advanced ML models

The workflow contains some more Python based model developments also for LightGBM and Optuna and H2O.ai — you can already explore them if you like. They might feature in a future blog.

--

--

Markus Lauber

Senior Data Scientist working with KNIME, Python, R and Big Data Systems in the telco industry