GETTING STARTED | HYPERPARAMETER OPTIMIZATION | KNIME ANALYTICS PLATFORM

Hyperparameter optimization for LightGBM — wrapped in KNIME nodes

TL;DR: Let a KNIME node find the right hyperparameters for your LightGBM ML model. Also use vtreat and see if that improves the results

Markus Lauber
Low Code for Data Science
13 min readFeb 4, 2023

--

LightGBM is a popular package for machine learning and there are also some examples out there on how to do some hyperparameter tuning. For this article, I have toyed around with ChatGPT (yes everyone does it these days) and (mostly) let it write the code. The process was not straightforward at all but I will not go into details here (if you want to read more and see some more examples please see my LinkedIn entry). Additionally, the handling of string variables is not as simple as it might seem in some other examples — but we are getting ahead of ourselves.

If you are curious about KNIME Analytics Platform itself, there is this great article: “The Best kept Secret in Data Science is KNIME”.

First, we would need to set up a Python environment where to work with LightGBM and some other packages for optimization. Since this is supposed to result in a KNIME workflow, we will have the Conda environment also be fit to be used with KNIME. Please keep in mind: these articles are intended to reach the moderately ambitious Python user, so if you are familiar with this stuff you can skip a good portion of this blog post.

KNIME workflow LightGBM
KNIME workflow LightGBM (https://hub.knime.com/-/spaces/-/latest/~GABT_OgeoWxWJW9P/).

If you want to learn more about setting up a Conda environment for KNIME you can check my article “KNIME and Python — Setting up and managing Conda environments”.

Here is the YAML file to create such a Conda environment (yes, I like to have useful comments ready within my file). See the article above on how to use it if you are not already familiar with this. Please note: the specific settings might evolve, so you might have to tweak some settings in the future (for example, regarding the restriction on the numpy version).

# https://medium.com/low-code-for-advanced-data-science/knime-and-python-setting-up-and-managing-conda-environments-2ac217792539
# conda env create -f="/Users/m_lauber/Dropbox/knime-workspace/Machine_Learning/ml_binary/kn_example_ml_binary_lightgbm_hyper_parameter_opt/data/py3_knime_lightgbm.yml"
# conda env create -f="C:\\Users\\x123456\\knime-workspace\\Machine_learning\\ml_binary\\kn_example_ml_binary_lightgbm_hyper_parameter_opt\\data\\py3_knime_lightgbm.yml"

# conda activate py3_knime_lightgbm
# conda update -n py3_knime_lightgbm --all

# conda env update --name py3_knime_lightgbm --file "/Users/m_lauber/Dropbox/knime-workspace/Machine_Learning/ml_binary/kn_example_ml_binary_lightgbm_hyper_parameter_opt/data/py3_knime_lightgbm.yml" --prune
# conda env update --name py3_knime_lightgbm --file "C:\\Users\\x123456\\knime-workspace\\Machine_learning\\ml_binary\\kn_example_ml_binary_lightgbm_hyper_parameter_opt\\data\\py3_knime_lightgbm.yml" --prune

# conda env update --name py3_knime_lightgbm --file "/Users/m_lauber/Dropbox/knime-workspace/Machine_Learning/ml_binary/kn_example_ml_binary_lightgbm_hyper_parameter_opt/data/py3_knime_lightgbm.yml"
# conda env update --name py3_knime_lightgbm --file "C:\\Users\\x123456\\knime-workspace\\Machine_learning\\ml_binary\\kn_example_ml_binary_lightgbm_hyper_parameter_opt\\data\\py3_knime_lightgbm.yml"
# conda update -n base conda

# KNIME official Python integration guide
# https://docs.knime.com/latest/python_installation_guide/index.html#_introduction

# KNIME and Python — Setting up and managing Conda environments
# https://medium.com/low-code-for-advanced-data-science/knime-and-python-setting-up-and-managing-conda-environments-2ac217792539

# file: py3_knime_lightgbm.yml with some modifications
# THX Carsten Haubold (https://hub.knime.com/carstenhaubold) for hints
name: py3_knime_lightgbm # Name of the created environment
channels: # Repositories to search for packages
# - defaults # edit: removed to just use conda-forge
# - anaconda # edit: removed to just use conda-forge
- conda-forge
# https://anaconda.org/knime
- knime # conda search knime-python-base -c knime --info # to see what is in the package
dependencies: # List of packages that should be installed
- python=3.9 # Python
- knime-python-base # dependencies of KNIME - Python integration
# - knime-python-scripting # everything you need to also build Python packages for KNIME
- cairo # SVG support
- pillow # Image inputs/outputs
- matplotlib # Plotting
- IPython # Notebook support
- nbformat # Notebook support
- scipy # Notebook support
- jpype1
# Jupyter Notebook support
- jupyter # Jupyter Notebook
- pandas-profiling # create overview of your data
- sweetviz # In-depth EDA (target analysis, comparison, feature analysis, correlation) in two lines of code!
# Machine Learning Modules
- lightgbm
- hyperopt
- scikit-optimize # skopt
- optuna # A hyperparameter optimization framework
- pip # Python installer
- pip:
# - JPype1 # Databases
- vtreat # https://medium.com/low-code-for-advanced-data-science/data-preparation-for-machine-learning-with-knime-and-the-python-vtreat-package-efcaf58fa783

If you wonder: “Wasn’t this supposed to be low-code?”. Yes, you can always just wrap your code in a component for ease of reusability and shareability — and low-code is not no-code! KNIME and Python are best friends. :-)

Walkthrough of the Jupyter notebook

First, you would initialize the necessary packages.

You can find the Jupyter notebook notebook on Github and it is also in the subfolder /data/ of the workflow named “kn_example_python_lightgbm_hyper_parameter_bayes_search_cv.ipynb” and if you do not care about the details you can just download the KNIME workflow and jump to the part where we talk about integrating it into KNIME itself.

# https://github.com/ml-score/knime_meets_python/tree/main/machine_learning/binary
# initialize the Python packages in py3_knime_lightgbm environment
import numpy as np
import pandas as pd
import pyarrow.parquet as pq
import json
import pickle

import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, auc, average_precision_score, precision_recall_curve
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
import matplotlib.pyplot as plt

Next, we need to prepare our data. I often find this part somewhat short in other articles, and some quirks that seem to arise might not be addressed when you use mixed data with numeric variables and strings. So I try to deal with that.

# import the data as 
data = pq.read_table("train.parquet").to_pandas()
data_test = pq.read_table("test.parquet").to_pandas()

# not strictly necessary in a notebook environment
# but might be useful if used in KNIME
data = data.reset_index(drop=True)
data_test = data_test.reset_index(drop=True)

I like to get an overview of the variables. If you still have variables that should not be in the model, you can add them under excluded_features.

# obviously you could adapt that to your needs
excluded_features = ['row_id']

# your target variable should be a string with "0" / "1" one for True
label = ['Target']
features = [feat for feat in data.columns if feat not in excluded_features and feat not in label]

num_cols = data[features].select_dtypes(include='number').columns.tolist()
cat_cols = data[features].select_dtypes(exclude='number').columns.tolist()

rest_cols = [feat for feat in data.columns if feat not in cat_cols]

print(f'''{"data shape:":20} {data.shape}
{"data[features] shape:":20} {data[features].shape}
categorical columns: {cat_cols}
numerical columns: {num_cols}
feature columns: {features}
rest columns: {rest_cols}''')

# THX David Gutmann
# you want to make sure your text variables ar categories
# and your Traget is an integer
data[cat_cols] = data[cat_cols].astype('category')
data[label] = data[label].astype('int32')

We will keep the test data separated and split the original training data further into another Training and Test set.

# split data into X and y only keeping the features
X = data[features]
y = data[label]

# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)

# well yes this seems to be necessary unless LighGBM would complain about data types
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

Once we have the data ready, we can start by defining the optimization. First, we need to set the search space of the parameters we might want to optimize. You would want to check the documentation and maybe change some ranges. I removed some recommended settings since I found them of no great use but this might be different in other cases. The number of iterations might also be interesting. Depending on the power of your machine and the nature of your data, this might run for several hours.

# define the search space for hyperparameters
# https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html
# the basic structure and comments have been provided by ChatGPT

# number of iterations
var_n_iter = 500

space = {
'learning_rate': Real(0.01, 0.5, 'log-uniform'), # step size shrinkage used to prevent overfitting. Lower values = more accuracy but slower training
'num_leaves': Integer(25, 250), # the maximum number of leaves in any tree
'max_depth': Integer(6, 15), # the maximum depth of any tree
'min_child_samples': Integer(1, 20), # minimum number of samples required in a child node to be split
'feature_fraction': Real(0.1, 0.9), # fraction of features used for each boosting iteration
'bagging_fraction': Real(0.1, 0.9), # fraction of the training data to be used for each iteration
'bagging_freq': Integer(1, 10), # number of iterations to perform bagging (sample of data to grow trees)
# 'lambda_l1': Integer(0, 100), # L1 regularization term on weights
# 'lambda_l2': Integer(0, 100), # L2 regularization term on weights
'reg_alpha': Real(0, 2), # L1 regularization term on weights
'reg_lambda': Real(0, 2), # L2 regularization term on weights
'class_weight': Categorical(['balanced', None]), # weighting of positive classes in binary classification problems
'boosting_type': Categorical(['gbdt', 'dart']), # type of boosting algorithm to use
'objective': Categorical(['binary']), # objective function to use for training
'metric': Categorical(['aucpr']), # evaluation metric to use for early stopping and model selection
'subsample': Real(0.1, 1.0, 'uniform'), # fraction of data samples used for each iteration
'colsample_bytree': Real(0.1, 1.0, 'uniform'), # fraction of features used for each iteration
# 'min_gain_to_split': Integer(0, 15), # minimum gain required to make a split
'min_split_gain': Real(0, 1.0, 'uniform'), # minimum gain required to make a split
'n_estimators': Integer(250, 1000), # number of trees in the model
# 'early_stopping_rounds': Integer(25, 100), # number of iterations with no improvement after which training will stop
'importance_type': Categorical(['split', 'gain']), # type of feature importance to use for feature selection
'scale_pos_weight': Real(0.1, 10.0, 'uniform') # control the balance of positive and negative weights
}

Next, we need to define the optimizer itself. The number of iterations will come from the variable above. Another interesting thing is the n_points, which will determine the number of settings that would be tested in parallel. The latter might also depend on the power of your machine and data.

# define the LightGBM classifier
clf = lgb.LGBMClassifier()

# define the optimizer
opt = BayesSearchCV(
clf,
space,
n_iter=var_n_iter,
# n_points=1 - Number of parameter settings to sample in parallel. If this does not align with n_iter,
# the last iteration will sample less points.
n_points=3,
cv=5, # depending on the size of the data you might reduce the number of cross-validations
n_jobs=-1,
return_train_score=True,
refit=True,
# https://scikit-optimize.github.io/stable/modules/generated/skopt.Optimizer.html#skopt.Optimizer
optimizer_kwargs={'base_estimator': 'GP'},
scoring='average_precision',
random_state=42 # set the seed here
)

# perform the hyperparameter search
opt.fit(X_train, y_train)

The result will be the best model, which you can extract and save for further use. You can also store the parameters for further inspection.

# extract the best model
best_model = opt.best_estimator_

# show the best parameters
best_parameters = best_model.get_params
print(best_parameters)

# export the best parameters to a txt file
with open("ml_model_lightgbm_jupyter_model_parameters.txt", "w") as file:
print(best_parameters, file=file)

# you can re-import the settings and put them in a pandas data frame
df_best_parameters = pd.read_csv("ml_model_lightgbm_jupyter_model_parameters.txt", sep = "@",header = None, names=["lgbm_parameters"])

# evaluate the best model on the (internal) test data
y_pred = best_model.predict_proba(X_test)[:,1]

# evaluate the initial values based on the (internal Test data)
auc_pred = roc_auc_score(y_test, y_pred, average='weighted')
print(f'internal Test AUC: {auc_pred:.4f}')

aucpr = average_precision_score(y_test, y_pred, average='weighted', pos_label=1)
print(f'internal Test AUCPR: {aucpr:.4f}')

# set the path for the pickel file (you could adapt the path)
path = 'ml_model_lightgbm_jupyter.pkl'
# Save the model as pickle file
pickle.dump(best_model, open(path, 'wb'), pickle.HIGHEST_PROTOCOL)

Additionally, you can store the feature importance in a file. If you have many variables you might also use an initial round to cut back on the features, using the first 100 out of 500 or so. Or you could employ other methods for dimensionality reduction (see the end of this article).

# extract the feature importance
# https://www.kaggle.com/code/ashishpatel26/feature-importance-of-lightgbm
feature_imp = pd.DataFrame(sorted(zip(best_model.feature_importances_,X_test.columns)), columns=['Value','Feature'])

feature_imp = feature_imp.sort_values(by='Value', ascending=False, na_position='last')

feature_imp = feature_imp.reset_index(drop=True)
feature_imp['Feature_Rank'] = feature_imp.index

feature_imp.to_parquet("ml_model_lightgbm_jupyter_feature_importance.parquet", compression='gzip')

Another step would be to save the variable lists in order to have them ready for the treatment of new data.

# store the variales list as dictionary in a JSON file to read back later
v_variable_list = {
"num_cols": num_cols,
"cat_cols": cat_cols,
"rest_cols": rest_cols,
"label": label,
"features": features,
"excluded_features": excluded_features
}

# Write the dictionary to a JSON file
with open("ml_model_lightgbm_jupyter_variable_list.json", "w") as f:
json.dump(v_variable_list, f)

Apply the LightGBM model with all the settings

Now that we have our model and the most important features saved we can move to apply the model. First, we read back the model and the settings. I like to do that right after creating the model to see if everything has been stored correctly and if I am able to apply the model to a completely new dataset (since that is what we will have to do — we are all software engineers now …)

import pickle
# set the path for the pickel file
path = 'ml_model_lightgbm_jupyter.pkl'
clf_apply = pickle.load(open(path, 'rb'))

# Read the JSON file back into a Python dictionary
with open("ml_model_lightgbm_jupyter_variable_list.json", "r") as f:
loaded_dict = json.load(f)

# fill the list of categorical columns
new_cat_cols = loaded_dict['cat_cols']
new_features = loaded_dict['features']

Prepare the new data for processing. I always think that it is important to actually save the results as a usable data format (like parquet).

df_test_apply = data_test[new_features].copy()
df_test_apply[new_cat_cols] = df_test_apply[new_cat_cols].astype('category')

# Get the predicted probabilities for each class and a singel 0/1 prediction
probabilities = clf_apply.predict_proba(df_test_apply)
prediction = clf_apply.predict(df_test_apply)

# convert the results to a pandas dataframe
probabilities_df = pd.DataFrame(probabilities, columns = ['P0','P1'])
prediction_df = pd.DataFrame(prediction, columns = ['Target_pred'])

# Join the original target column with the predicted probabilities
# you will keep all your original data and get 3 new columns
result = pd.concat([data_test, probabilities_df, prediction_df], axis=1)

# export the result as a parquet file for further processing
result.to_parquet("ml_model_lightgbm_jupyter.parquet", compression='gzip')

You could now check the AUC and AUCPR of the results. This can also later be done in KNIME with the help of the “H2O Binomial Scorer” or the “Binary Classification Inspector”.

# evaluate the best model on the test data
auc_pred = roc_auc_score(result['Target'], result['P1'], average='weighted')
print(f'Test AUC: {auc_pred:.4f}')

aucpr = average_precision_score(result['Target'], result['P1'], average='weighted', pos_label='1')
print(f'Test AUCPR: {aucpr:.4f}')

Next you could draw a nice graphic to check the results.

import numpy as np
import pandas as pd
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

y_true = result["Target"].astype(int).values
y_score = result["P1"].values

precision, recall, thresholds = precision_recall_curve(y_true, y_score)
auc_pr = np.trapz(precision, recall)

plt.step(recall, precision, color='b', alpha=0.2, where='post')
plt.fill_between(recall, precision, step='post', alpha=0.2, color='b')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Precision-Recall curve: AUC={0:0.3f}'.format(auc_pr))
plt.show()
AUCPR curve

Doing this model Training and Deployment in KNIME

This code can be put into a Python Script node in KNIME and wrapped in a component. The inputs are the Test and Training data and a Flow variable with the number of iterations. The output is going to be a scored test data, as well as the feature importance list and the winning model parameters as a text file.

In this scenario, the Target variable would be named “Target” (yes, with a capital letter at the front) as a string with “0” and “1” (positive class) as values. You could, of course, define more Flow Variables if you want to define further parameters — or do it within the node.

KNIME workflow with component for LightGBM model
KNIME workflow LightGBM (https://hub.knime.com/-/spaces/-/latest/~GABT_OgeoWxWJW9P/).

Inside the component the nodes are arranged in a classic KNIME machine learning setting with model writers and readers and the export of the variable list. You could do this all in one go but this way it might give you a better overview and you might be able to re-use components.

KNIME workflow LightGBM (https://hub.knime.com/-/spaces/-/latest/~GABT_OgeoWxWJW9P/).

Data Preparation with vtreat

Also in the workflow you will find another component that would automatically prepare the data with the help of the Python vtreat package.

See the article Data preparation for Machine Learning with KNIME and the Python “vtreat” package published in “Low Code for Data Science

In this example, this would not give the model a large boost but you have it there just in case.

To make sure you have all the necessary Python packages at hand you can use the Conda Environment Propagation node, since the lightgbm and vtreat packages are not part of the bundled Python environment in KNIME.

Conda Environment Propagation with KNIME
KNIME and Python — Setting up and managing Conda environments (https://medium.com/p/2ac217792539).

You can read about KNIME and Conda in this article: KNIME and Python — Setting up and managing Conda environments.

That would conclude the development of a machine learning model with LightGBM and hyperparameter tuning in KNME. Hope you enjoyed it!

To sum up the Code and Nodes used

Next on the agenda could be two more hyper parameter tuning approaches:

You can also follow me on the KNIME Forum (https://forum.knime.com/u/mlauber71/) and check out my repository on the KNIME hub (https://hub.knime.com/mlauber71) with further sample workflows and also check my other Medium stories (https://medium.com/@mlxl).

--

--

Markus Lauber
Low Code for Data Science

Senior Data Scientist working with KNIME, Python, R and Big Data Systems in the telco industry