Maximizing Model Performance with LightGBM, MLflow, and Optuna: A Titanic Dataset Case Study

7 min readDec 31, 2023

Introduction

In the rapidly evolving field of data science, there is a growing interest in leveraging advanced techniques like Large Language Models (LLMs) for AI applications. While LLMs have garnered attention, it’s essential to remember that tabular data remains a cornerstone, requiring advanced techniques to maximize model performance. This article explores the integration of LightGBM, MLflow, and Optuna in a unified workflow, focusing on optimizing model parameters without delving into feature engineering. The analysis centers on the renowned Titanic dataset, a tabular dataset with the binary survival outcome.

Titanic Dataset Overview

The Titanic dataset is a classic dataset widely used for machine-learning exercises. For the purpose of this study, we will focus on specific features: Pclass, Sex, Age, SibSp, Parch, Fare, Embarked, and Survived. The goal is to build a model that predicts whether a passenger survived or not based on these features. It's important to note that we will not perform any feature engineering in this analysis, but rather concentrate on optimizing the LightGBM model through parameter tuning.

The dataset can be accessed and downloaded from Kaggle

Example Rows of Titanic Dataset:

| Pclass | Sex    | Age  | SibSp | Parch | Fare     | Embarked | Survived |
|--------|--------|------|-------|-------|----------|----------|----------|
| 3      | male   | 22.0 | 1     | 0     | 7.25     | S        | 0        |
| 1      | female | 38.0 | 1     | 0     | 71.2833  | C        | 1        |
| 3      | female | 26.0 | 0     | 0     | 7.925    | S        | 1        |
| 1      | female | 35.0 | 1     | 0     | 53.1     | S        | 1        |
| 3      | male   | 35.0 | 0     | 0     | 8.05     | S        | 0        |

Why LightGBM?

Before delving into model development, let’s understand why LightGBM is a powerful choice for this task. LightGBM, a gradient-boosting framework, stands out for its speed and efficiency. Its leaf-wise tree growth strategy and histogram-based approach make it well-suited for large datasets. The ability to handle categorical features and robust performance in diverse scenarios make LightGBM an essential tool in a data scientist’s arsenal.

Key Parameters and Capabilities of LightGBM:

boosting_type: Specify the boosting method (e.g., gbdt, dart).
num_leaves: Control the maximum number of leaves in one tree.
learning_rate: Adjust the step size in each iteration.
max_depth: Set the maximum depth of the tree.
objective: Define the optimization objective (e.g., binary classification).

Initial Model Evaluation

Now, let’s proceed with the initial model evaluation. Before diving into hyperparameter optimization, we evaluate the performance of a basic LightGBM model on the Titanic dataset using the F1 score, a metric that balances precision and recall.

# Import necessary libraries
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, confusion_matrix
import lightgbm as lgb
import optuna
import mlflow
import mlflow.lightgbm


# Load Titanic dataset
titanic_data = pd.read_csv('titanic.csv')  # Assuming the dataset is stored in 'titanic.csv'

# Select specific features
selected_features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'Survived']
titanic_data = titanic_data[selected_features]

# Convert categorical features to numerical using one-hot encoding
titanic_data = pd.get_dummies(titanic_data, columns=['Sex', 'Embarked'], drop_first=True)


# Extract features and target variable
X = titanic_data.drop('Survived', axis=1)
y = titanic_data['Survived']

# Split dataset
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

# Train an initial LightGBM model
initial_model = lgb.train({}, lgb.Dataset(X_train, label=y_train), 100)

# Make predictions and calculate F1 score
y_pred_initial = initial_model.predict(X_valid)
f1_initial = f1_score(y_valid, (y_pred_initial > 0.5).astype(int))

# Display F1 Score and Confusion Matrix
print(f"Initial F1 Score: {f1_initial}")
print("Confusion Matrix:")
print(confusion_matrix(y_valid, (y_pred_initial > 0.5).astype(int)))

After training the initial LightGBM model on the Titanic dataset, the model achieved an F1 Score of approximately 0.79. Breaking down the results further with a confusion matrix:

[[92 13]
[17 57]]

In the confusion matrix:

True Positive (TP): 92
False Positive (FP): 13
False Negative (FN): 17
True Negative (TN): 57

The Significance of Tabular Data and Parameter Tuning

In the contemporary AI landscape, Large Language Models (LLMs) often steal the spotlight, but the importance of tabular data and advanced techniques in maximizing model performance should not be overlooked. In this case study, we refrain from feature engineering and concentrate on optimizing the LightGBM model through parameter tuning.

MLflow for Experiment Tracking

MLflow simplifies the machine learning lifecycle by offering tools to manage experiments, track parameters, and store models. Its logging capabilities facilitate collaboration among team members, enabling reproducibility and efficient model comparison.

Key Capabilities of MLflow:

Tracking Experiments: Log parameters, metrics, and artifacts to record and compare experiments.
Model Packaging: Easily package and share models in a standardized format.
Model Deployment: Simplify the deployment process with MLflow’s deployment options.

Optuna for Hyperparameter Optimization

Optuna automates the hyperparameter optimization process, helping data scientists find the best combination of parameters efficiently. Its framework supports various optimization algorithms, making it adaptable to different optimization scenarios.

Key Capabilities of Optuna:

Distributed Optimization: Parallelize hyperparameter search across multiple nodes.
Pruning: Accelerate optimization by early-stopping unpromising trials.
Scalability: Scale up optimization efforts for complex models and large datasets.

Integrating LightGBM, MLflow, and Optuna

Below is a simplified example of integrating LightGBM with MLflow and Optuna using the Titanic dataset:

# Define LightGBM objective function for Optuna
def objective(trial):
    params = {
        'objective': 'binary',
        'metric': 'binary_error',
        'boosting_type': 'gbdt',
        'num_leaves': trial.suggest_int('num_leaves', 2, 256),
        'max_depth': trial.suggest_int('max_depth', 2, 32),
        'learning_rate': trial.suggest_loguniform('learning_rate', 0.005, 0.5),
        'feature_fraction': trial.suggest_uniform('feature_fraction', 0.1, 1.0),
    }

    # Set Train and Valid datasets
    lgb_train = lgb.Dataset(X_train, y_train)
    lgb_valid = lgb.Dataset(X_valid, y_valid, reference=lgb_train)

    # Train the model
    model = lgb.train(params, lgb_train, num_boost_round=100, valid_sets=[lgb_train, lgb_valid], 
                      callbacks=[lgb.record_evaluation({"valid": lgb_valid})])

    # Make predictions and calculate F1 score
    y_pred = model.predict(X_valid, num_iteration=model.best_iteration)
    f1 = f1_score(y_valid, (y_pred > 0.5).astype(int))

    # Log parameters and metrics with MLflow
    with mlflow.start_run():
        mlflow.log_params(params)
        mlflow.log_metric('f1', f1)
        # Save the model as an artifact
        mlflow.lightgbm.log_model(model, "model")

    return f1
# Optimize hyperparameters with Optuna
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

Objective Function Definition:

The objective function defines the hyperparameters that Optuna will optimize.
It sets up a dictionary (params) containing hyperparameters such as the objective, metric, boosting type, number of leaves, max depth, learning rate, and feature fraction.

2. Dataset Preparation:

LightGBM datasets (lgb_train and lgb_valid) are created using the training (X_train, y_train) and validation (X_valid, y_valid) data.

3. Model Training:

The LightGBM model is trained with the current set of hyperparameters using the training dataset (lgb_train).
The model is validated on the validation dataset (lgb_valid), and training progress is recorded with the record_evaluation callback.

4. Prediction and Metric Calculation:

Predictions are made on the validation set, and the F1 score is calculated using the f1_score function.

5. Logging with MLflow:

A new MLflow run is started, and the hyperparameters (params) and F1 score are logged.
The trained LightGBM model is saved as an artifact with the name “model.”

6. Optimization with Optuna:

The Optuna study is created to optimize the objective function (maximize F1 score) over a specified number of trials (n_trials).

This process iteratively refines hyperparameters to find the combination that maximizes the F1 score. The final model and optimal hyperparameters are then logged using MLflow.

Tracking with MLflow

To view model runs and metrics logged in MLflow, follow these steps:

Install MLflow: pip install mlflow
Start the MLflow UI: Run mlflow ui in your terminal.
Open your browser and go to http://localhost:5000 to access the MLflow UI.
Explore the recorded runs, metrics, and parameters.

Logging models with MLflow provides an organized record of each iteration’s configuration, facilitating easy comparison and decision-making. The MLflow UI acts as a user-friendly control center, offering insights into model performance and aiding in the seamless exploration and utilization of trained models.

Training the Final Model

Training once more with the optimal parameters identified by the Optuna process, we establish our ‘final_model.’ Now equipped with the best parameters, this final model is primed for evaluation on the test data.

# Train the final model with the best hyperparameters
best_params = study.best_params
final_model = lgb.train(best_params, lgb.Dataset(X_train, label=y_train), num_boost_round=100)

# Make predictions and calculate F1 score
y_pred_final = final_model.predict(X_valid)
f1_final = f1_score(y_valid, (y_pred_final > 0.5).astype(int))

# Display F1 Score and Confusion Matrix
print(f"Final Model F1 Score: {f1_final}")
print("Confusion Matrix:")
print(confusion_matrix(y_valid, (y_pred_final > 0.5).astype(int)))

After optimizing the LightGBM model using the Optuna hyperparameter tuning technique, the final F1 Score achieved is approximately 0.81

Confusion Matrix:

[[92 13]
[15 59]]

In the confusion matrix:

True Positive (TP): 92
False Positive (FP): 13
False Negative (FN): 15
True Negative (TN): 59

The F1 Score of 0.81 signifies a noteworthy enhancement compared to the initial model’s performance. Notably, through the streamlined Optuna, MLflow, and LightGBM process, we achieved a significant accuracy boost without delving into any intricate feature engineering efforts. This underscores the effectiveness of our approach in maximizing model accuracy through strategic parameter tuning.

Conclusion

In conclusion, this case study illustrates the significance of advanced techniques in maximizing model performance with tabular data. By leveraging LightGBM, MLflow, and Optuna, the article demonstrates a streamlined workflow for optimizing model parameters without engaging in feature engineering. The focus on parameter tuning in this scenario provides insights into the potential improvements achievable through more intricate feature engineering. As the AI industry embraces new buzzwords like Large Language Models (LLMs), it remains imperative to tackle tabular data with advanced techniques for robust model development.

Maximizing Model Performance with LightGBM, MLflow, and Optuna: A Titanic Dataset Case Study

Introduction

Titanic Dataset Overview

Why LightGBM?

Initial Model Evaluation

The Significance of Tabular Data and Parameter Tuning

MLflow for Experiment Tracking

Optuna for Hyperparameter Optimization

Integrating LightGBM, MLflow, and Optuna

Training the Final Model

Conclusion

Written by Tom Haber