From Classroom to Kaggle Competitions: Speed with LightAutoML in the AutoML Grand Prix

20 min read4 days ago

Welcome back, data enthusiasts! Today, I’m thrilled to take you through another exciting chapter of our Kaggle journey. If you missed my previous blog on classifying academic outcomes, be sure to read “Classifying Academic Outcomes with Machine Learning”. This time, we’re diving into the fast-paced world of the AutoML Grand Prix, a competition designed to test the best Automated Machine Learning (AutoML) tools. Join me as I share our experiences and strategies in leveraging LightAutoML to navigate this thrilling challenge.

1- The AutoML Grand Prix: A New Frontier in Kaggle Competitions

Before this competition began, we discovered an exciting announcement on Kaggle: the 2024 AutoML Grand Prix, in partnership with the International Conference on Automated Machine Learning. This multi-competition challenge invites participants to test their AutoML skills and tools across a series of monthly Tabular Playground Series competitions.

The AutoML Grand Prix is structured to run from May to September, featuring monthly challenges where participants have only 24 hours to submit their best models. Given this tight timeframe, the competition emphasizes speed, efficiency, and the power of automated machine learning.

Our team, guided by our insightful supervisor, Reza Shokrzad, eagerly decided to participate. Each group within our team chose a different AutoML package to tackle the competition, and we selected LightAutoML for our submission. We applied this during the first 24 hours of the Kaggle Playground Series — Season 4, Episode 6: Academic Success competition.

Unlike our original approach to the academic success competition, where we had a month to refine our models and perform extensive preprocessing, this challenge required us to make quick decisions and rely heavily on the automated features of LightAutoML. For those interested in a detailed explanation of the competition and dataset, I recommend reading my previous blog post on the academic success challenge. In that post, we had the luxury of time for comprehensive preprocessing and customized feature engineering.

However, in the AutoML Grand Prix, with just 24 hours to deliver results, we leaned on LightAutoML’s capabilities to handle preprocessing tasks efficiently. This shift in approach presented both challenges and opportunities, testing our ability to adapt to a high-speed, automated environment.

2- Tackling the Academic Success Prediction Challenge

Academic success is a critical measure of educational outcomes, reflecting students’ achievements and overall learning effectiveness. For an in-depth understanding of the dataset, preprocessing, and feature engineering steps, refer to my previous blog post on classifying academic outcomes. In this competition, we faced an imbalanced target variable. We employed SMOTE (Synthetic Minority Over-sampling Technique) to address this issue effectively.

In the following sections, we’ll explore our implementation of SMOTE and the use of LightAutoML to streamline our model development process and enhance prediction accuracy. Stay tuned to learn about these powerful techniques and their application in our fast-paced competition journey.

3- Understanding AutoML

AutoML, or Automated Machine Learning, is a revolutionary approach designed to automate the end-to-end process of applying machine learning to real-world problems. It aims to make machine learning accessible to non-experts and improve the efficiency of experts by automating repetitive tasks. AutoML covers several stages of a machine learning pipeline, including data preprocessing, feature engineering, model selection, hyperparameter tuning, and model evaluation.

AutoML systems employ a variety of sophisticated techniques to achieve their goals. At their core, many AutoML tools use meta-learning, which allows them to understand how different algorithms perform on various types of datasets. This knowledge enables AutoML systems to make smart choices about which models to try and how to set them up, much like an experienced data scientist would.

Another key feature of AutoML is the use of ensemble methods. By combining multiple models, AutoML tools can often achieve better performance than any single model alone. This is similar to getting advice from a group of experts instead of just one. Additionally, AutoML systems often use advanced optimization techniques like Bayesian optimization or evolutionary algorithms. These methods help efficiently search through the vast space of possible models and settings, finding good solutions much faster than a human could.

Some AutoML systems go even further by automating feature engineering. Using techniques like deep feature synthesis, they can create new, potentially useful features from the raw data. This is particularly exciting because feature engineering is often considered one of the most creative and time-consuming parts of machine learning. By automating these complex tasks, AutoML not only saves time but can often discover model configurations that even human experts might miss, potentially leading to better performance on a wide range of machine learning tasks.

3–1- Why Use AutoML?

AutoML has emerged as a game-changer in the field of machine learning, offering numerous benefits that make it an attractive option for both novices and experienced data scientists. Here’s why AutoML is becoming increasingly popular:

Democratization of Machine Learning: AutoML breaks down barriers, allowing professionals from various fields to use the power of machine learning without extensive coding or data science expertise. This democratization opens up new possibilities for innovation across industries.
Rapid Prototyping: In fast-paced business environments, AutoML enables quick development of proof-of-concept models. This flexibility allows teams to validate ideas and make data-driven decisions more quickly, potentially leading to faster time-to-market for new products or services.
Handling Complex Data: As datasets grow in size and complexity, AutoML systems can efficiently navigate through intricate data structures, identifying patterns and relationships that might be challenging for human analysts to discover manually.
Continuous Learning and Adaptation: Many AutoML platforms incorporate online learning capabilities, allowing models to adapt to new data in real-time. This feature is particularly valuable in dynamic environments where data patterns may shift rapidly.
Cost-Effectiveness: By reducing the need for large teams of specialized data scientists, AutoML can significantly lower the costs associated with implementing machine learning solutions, making it more accessible for small to medium-sized businesses.
Standardization of Best Practices: AutoML tools often incorporate industry best practices and state-of-the-art techniques, ensuring that even less experienced users can produce high-quality models that follow established standards.
Explainable AI: Advanced AutoML platforms are increasingly focusing on model interpretability, providing insights into how models make decisions. This transparency is crucial for building trust in AI systems, especially in regulated industries.

By leveraging these advantages, organizations can accelerate their AI initiatives, drive innovation, and stay competitive in an increasingly data-driven world. AutoML not only enhances efficiency but also empowers a wider range of professionals to contribute to data science projects, fostering a culture of data-driven decision-making across the entire organization.

3–2- Popular AutoML Packages

1. H2O.ai: H2O.ai provides an open-source machine learning platform known for its scalability and ease of use. It’s suitable for both small and large datasets and supports a wide range of machine learning algorithms. H2O.ai Documentation

2. Google AutoML: Google AutoML offers a suite of machine learning tools from Google Cloud that provide powerful capabilities for custom model development, including image and text classification. Google AutoML Documentation

3. Auto-Sklearn: Auto-Sklearn is an open-source AutoML tool built on the popular scikit-learn library. It’s particularly useful for classification and regression tasks, making it ideal for traditional machine learning problems. Auto-Sklearn Documentation

4. TPOT: TPOT (Tree-based Pipeline Optimization Tool) uses genetic algorithms to optimize machine learning pipelines. It’s well-suited for users who need to automate the model selection and hyperparameter tuning process. TPOT Documentation

5. Azure AutoML: Azure AutoML is a cloud-based service from Microsoft Azure that provides automated machine learning capabilities, making it easy to build and deploy models at scale. Azure AutoML Documentation

6. LightAutoML: LightAutoML is a lightweight AutoML framework designed for both time series and tabular data. It offers fast and accurate model training and is particularly suited for quick iterations and handling large datasets. LightAutoML Documentation, LightAutoML Repository

4- Exploring LightAutoML

4–1- LightAutoML: A Deeper Dive

LightAutoML (LAMA) is an open-source AutoML framework developed by Sberbank AI Lab that’s revolutionizing the way we approach machine learning tasks. Designed to be both lightweight and powerful, LAMA excels in handling tabular and time-series data efficiently, making it a go-to solution for data scientists and researchers alike.

4–2- Key Features and Advantages:

1. Lightning-Fast Performance (Speed and Efficiency): LAMA’s speed is its standout feature. It blazes through model training and hyperparameter tuning, making it perfect for time-sensitive projects and large datasets.

2. User-Friendly Interface (Ease of Use): With its intuitive APIs, LAMA simplifies complex machine learning pipelines. You don’t need to be a seasoned data scientist to build robust models — LAMA’s got your back.

3. Versatility at Its Core (Flexibility): Whether you’re tackling classification, regression, or time-series forecasting, LAMA adapts to your needs. It’s like having a Swiss Army knife for machine learning tasks.

4. Smart Feature Engineering: LAMA doesn’t just process your data; it enhances it. Its automated feature engineering capabilities can uncover hidden patterns, boosting your model’s predictive power.

5. Ensemble Learning (Stacking): Through advanced model stacking techniques, LAMA combines multiple models to create a super-predictor that’s greater than the sum of its parts.

6. Scalability: From academic research to large-scale industrial applications, LAMA scales effortlessly to meet your data demands.

7. Community-Driven Innovation: As an open-source project, LAMA benefits from a vibrant community of contributors, ensuring it stays at the cutting edge of AutoML technology.

8. Interpretability Matters: LAMA doesn’t just give you predictions; it helps you understand them. Its built-in tools for model interpretation demystify the decision-making process.

9. Customizability: While LAMA automates much of the ML pipeline, it still gives you the flexibility to adjust and customize where it counts.

10. Resource-Savvy (Efficient Resource Utilization): LAMA is optimized to squeeze every bit of performance from your hardware, making efficient use of computational resources.

4–3- Real-World Impact:

In our recent academic success prediction challenge, LAMA proved to be an invaluable tool, allowing us to quickly iterate and optimize our models within the constrained 24-hour timeframe. Its automated capabilities and robust performance were crucial in tackling this complex challenge, showcasing the true power of AutoML in high-pressure, fast-paced environments.

5- Handling Target Imbalance in Machine Learning

Target imbalance is a common challenge in machine learning, particularly in classification problems. It occurs when the classes in the target variable are not represented equally. This imbalance can lead to biased models that perform poorly on minority classes. Let’s explore some techniques to address this issue, with a special focus on SMOTE, which we used in our AutoML implementation.

Common Techniques for Handling Imbalanced Data:

Oversampling: This involves increasing the number of instances in the minority class.

Random Oversampling: Randomly duplicates examples from the minority class.
SMOTE (Synthetic Minority Over-sampling Technique): Creates synthetic examples of the minority class.

2. Undersampling: This involves reducing the number of instances in the majority class.

Random Undersampling: Randomly removes examples from the majority class.
Tomek Links: Removes majority class examples that are close to minority class examples.

3. Combination Methods: These combine oversampling and undersampling techniques.

SMOTETomek: Applies SMOTE followed by Tomek Links.
SMOTEENN: Applies SMOTE followed by Edited Nearest Neighbors.

4. Algorithmic Ensemble Techniques: These use ensemble methods to handle imbalance.

Balanced Random Forest: A random forest that undersamples each bootstrap sample.
EasyEnsemble: Trains multiple classifiers on balanced subsets of the data.

5. Cost-Sensitive Learning: Assigns different costs to misclassification of different classes.

5–1- Deep Dive into SMOTE

In our AutoML implementation, we chose to use SMOTE (Synthetic Minority Over-sampling Technique) to handle the class imbalance. Here’s a detailed look at how SMOTE works:

5–1–1- Basic Principle:

SMOTE creates synthetic examples in the feature space rather than duplicating existing examples. This helps to avoid overfitting that can occur with simple oversampling.

5–1–2- Algorithm Steps:

For each minority class sample, the following steps are performed:

Find its k-nearest neighbors (typically k=5): Identify the closest minority class instances in the feature space.
Randomly select one of these neighbors: Choose one of the nearest neighbors at random.
Create a synthetic example along the line segment joining the selected neighbor: Generate a new sample by interpolating between the original sample and the selected neighbor.

5–1–3- Advantages:

SMOTE offers several advantages in handling class imbalance, making it a valuable technique in many scenarios:

Increases the number of minority samples without exact replication.
Helps the classifier to build larger decision regions that contain nearby minority class points.
Can be more effective than simple random oversampling in many scenarios.

5–1–4- Considerations:

While SMOTE is a powerful technique for addressing class imbalance, it’s important to consider its limitations:

The synthetic examples are created based on feature space similarities, not on the underlying data generation process: This means that the generated samples may not always accurately represent real-world data patterns.
It may not be suitable for high-dimensional data or when the minority class is extremely rare: In these cases, SMOTE might struggle to create meaningful synthetic examples, potentially leading to suboptimal results.

5–2- Why We Initially Chose SMOTE and Our Subsequent Findings

In the context of our academic success prediction task during the 24-hour AutoML Grand Prix, SMOTE was our initial choice for handling class imbalance for several reasons:

Preserving Information: Unlike undersampling techniques, SMOTE doesn’t discard any majority class examples, preserving all available information.
Avoiding Overfitting: By creating synthetic examples rather than duplicating existing ones, SMOTE helps prevent the model from overfitting to specific minority class instances.
Improved Model Generalization: The synthetic examples created by SMOTE can help the model learn more general decision boundaries for the minority classes.
Compatibility with AutoML: SMOTE integrates well with many machine learning algorithms, making it a suitable choice for our AutoML pipeline.

However, it’s important to note that our initial choice of SMOTE was made under the time constraints of the 24-hour competition. In our subsequent month-long experiments, which are detailed in our previous blog post, we discovered that for this specific dataset, using class weights yielded better results.

This finding underscores the importance of thorough experimentation and the need to adapt our approaches based on the specific characteristics of each dataset. While SMOTE is a powerful technique for many imbalanced datasets, our extended analysis revealed that class weighting was more suitable for this particular academic success prediction challenge.

This experience highlights a crucial lesson in data science: initial choices made under time constraints should always be revisited and validated when more time is available for in-depth analysis and experimentation.

6- Implementing AutoML with LightAutoML

Implementing AutoML involves several steps, from data preprocessing to model training and evaluation. Below, we provide a detailed explanation of our implementation using LightAutoML, a powerful AutoML tool designed for quick and efficient model training.

6–1- Installation

To get started with LightAutoML, we first need to install the library:

!pip install lightautoml

6–2- Setting Up the Task

Next, we define the task for our AutoML pipeline. In this case, we are dealing with a multiclass classification problem, where we aim to predict the academic success of students.

# Define your task
task = Task('multiclass', greater_is_better=True, metric='accuracy')

6–3- Parameter Configuration

Setting parameters appropriately is crucial for the success of any AutoML process. These parameters control various aspects of the model training and evaluation, ensuring that the process runs efficiently and effectively. We set several parameters for our AutoML process, including timeout, number of threads, number of folds for cross-validation, and a random state for reproducibility.

# Set parameters
TIMEOUT = 3600  # 1 hour timeout
N_THREADS = 4
N_FOLDS = 5
RANDOM_STATE = 42
TARGET_NAME = 'Target'

6–3–1- Detailed Explanation of Parameters

1- Timeout: The TIMEOUT parameter specifies the maximum amount of time (in seconds) that the AutoML process is allowed to run. Here, we set it to 3600 seconds, which equals one hour. This ensures that the AutoML process will not exceed our time constraints, making it a critical parameter when working with limited time resources. This constraint pushes the AutoML tool to prioritize quicker algorithms and strategies, balancing between speed and performance.

2- Number of Threads: The N_THREADS parameter defines the number of CPU threads to be used during the AutoML process. By setting this to 4, we allow LightAutoML to utilize four threads concurrently. This helps in speeding up the computations by parallelizing tasks, which is particularly beneficial when dealing with large datasets or complex models. The number of threads can be adjusted based on the computational resources available.

3- Number of Folds for Cross-Validation: Cross-validation is a technique used to evaluate the performance of a model by dividing the dataset into several subsets (folds). The N_FOLDS parameter specifies the number of these folds. In this case, we set it to 5, meaning the dataset will be split into five parts. Each part will be used as a validation set once, while the remaining parts serve as the training set. This helps in obtaining a more reliable estimate of the model's performance by ensuring that each data point is used for both training and validation. It also helps in detecting overfitting.

4- Random State: Ensures that the randomness in the process is controlled, making results reproducible and consistent.

5- Target Column: The TARGET_NAME parameter specifies the name of the column in the dataset that contains the labels to be predicted, guiding the AutoML process on what to optimize.

By carefully configuring these parameters, we can efficiently utilize LightAutoML to automate the machine learning process, ensuring that we get reliable and reproducible results within our specified constraints.

6–4- Data Preprocessing

For consistent label encoding, we combine the training and test datasets, excluding the target column in the training data. We then label encode the categorical features.

# Combine train and test for consistent label encoding (excluding target column in train)
combined = pd.concat([train_data.drop(columns=[TARGET_NAME]), test_data], ignore_index=True)

# Label encode categorical features
for col in categorical_features:
    le = LabelEncoder()
    combined[col] = le.fit_transform(combined[col].astype(str))

# Separate back into train and test
train_encoded = combined.iloc[:len(train_data)]
test_encoded = combined.iloc[len(train_data):]

# Add back the target column to train dataset only
train_encoded[TARGET_NAME] = train_data[TARGET_NAME].values

6–5- Feature Engineering

Given the time constraint of 24 hours, we performed basic feature engineering. However, our previous blog details a more thorough and refined approach to feature engineering. We created simple new features by combining existing ones and scaling numerical features.

6–6- Handling Class Imbalance with SMOTE

To address the imbalance in our target variable, we used the Synthetic Minority Over-sampling Technique (SMOTE) to generate synthetic samples for the minority classes.

# Handle class imbalance using SMOTE
smote = SMOTE(random_state=RANDOM_STATE)
X_resampled, y_resampled = smote.fit_resample(train_encoded.drop(columns=[TARGET_NAME]), train_encoded[TARGET_NAME])

# Combine resampled features and target into one DataFrame for LightAutoML
train_resampled = pd.DataFrame(X_resampled, columns=train_encoded.columns.drop(TARGET_NAME))
train_resampled[TARGET_NAME] = y_resampled

6–7- Initializing LightAutoML

We initialize the LightAutoML framework with our specified parameters, including the task, timeout, number of threads, cross-validation settings, and hyperparameter tuning time.

# Initialize LightAutoML
automl = TabularAutoML(
    task=task,
    timeout=TIMEOUT,
    cpu_limit=N_THREADS,
    reader_params={
        'n_jobs': N_THREADS,
        'cv': N_FOLDS,
        'random_state': RANDOM_STATE,
        'stratified': True 
    },
    tuning_params={'max_tuning_time': 1200},
)

6–7–1- Detailed Explanation of Parameters

Task: Defines the nature of the problem (e.g., classification, regression) and specifies the evaluation metric and whether a higher or lower value is better for the chosen metric.
Timeout: Limits the total time for the AutoML process to ensure it completes within the given timeframe.
Number of Threads (cpu_limit): Sets the maximum number of CPU threads to use, facilitating parallel processing to speed up computations.
Reader Parameters (reader_params):

Number of Jobs (n_jobs): Controls the number of threads for data reading and processing.
Cross-Validation (cv): Specifies the number of folds for cross-validation to ensure robust model evaluation.
Random State (random_state): Ensures reproducibility by controlling the randomness in data splitting and algorithm behavior.
Stratified: Indicates whether to maintain the distribution of the target variable across the folds during cross-validation.

5. Tuning Parameters (tuning_params): Maximum Tuning Time (max_tuning_time): Limits the time spent on hyperparameter tuning to optimize model performance within a reasonable timeframe.

These parameters collectively configure LightAutoML to efficiently handle the dataset, perform cross-validation, ensure reproducibility, and optimize model performance within the specified constraints.

6–8- Model Training and Prediction

We fit the AutoML model to our resampled training data and predict on both the training and test datasets. We also save the out-of-fold predictions for further analysis.

# Fit and predict with LightAutoML
oof_preds = automl.fit_predict(
    train_resampled,
    roles={'target': TARGET_NAME},
    verbose=1
)

# Save the out-of-fold predictions
with open('lightautoml_oof_preds.pkl', 'wb') as f:
    pickle.dump(oof_preds.data, f)

# Predict on test data
test_predictions = automl.predict(test_encoded)

# Evaluate the model
y_pred_train = automl.predict(train_encoded.drop(columns=[TARGET_NAME])).data.argmax(axis=1)
accuracy = accuracy_score(train_encoded[TARGET_NAME], y_pred_train)
f1 = f1_score(train_encoded[TARGET_NAME], y_pred_train, average='weighted')
print(f"Accuracy on the training set: {accuracy}")
print(f"F1-score on the training set: {f1}")

6–9- Out-of-Fold Predictions

Out-of-fold (OOF) predictions are a technique used in machine learning to evaluate model performance in a way that closely mimics how the model will perform on unseen data. This is achieved by using cross-validation.

6–9–1- How Out-of-Fold Predictions Work

The process of generating out-of-fold predictions involves several key steps:

1. Cross-Validation: The training data is divided into k folds. In each iteration, one fold is used as the validation set, and the remaining k−1 folds are used as the training set.

2. Training and Prediction: The model is trained on the k−1 training folds and then makes predictions on the validation fold. This process is repeated k times, so each fold serves as the validation set once.

3. Combining Predictions: The predictions for each fold are combined to create a complete set of predictions for the entire training dataset. These are the out-of-fold predictions.

6–9–2- Advantages of Out-of-Fold Predictions

Out-of-fold predictions offer several significant advantages:

Robust Evaluation: OOF predictions provide a robust estimate of model performance by leveraging multiple splits of the data.
Bias Reduction: By training on different subsets of the data, the model’s bias is reduced, leading to more reliable performance metrics.
Ensembling: OOF predictions are commonly used in ensemble methods like stacking, where the predictions of base models are used as features for a meta-model.

6–9–3- Why Use Out-of-Fold Predictions?

Out-of-fold predictions are essential for:

Accurate Model Evaluation: They provide a realistic estimate of how the model will perform on new, unseen data.
Avoiding Data Leakage: By ensuring that the validation data is not used during training, OOF predictions prevent data leakage and give an unbiased performance estimate.
Improving Generalization: They help in identifying and mitigating overfitting, leading to models that generalize better to new data.

In summary, out-of-fold predictions are a crucial component in the model evaluation process, ensuring that we have a robust and unbiased measure of our model’s performance.

6–10- Saving and Submitting Predictions

Finally, we save the test predictions and create a submission file.

# Optionally save the test predictions
test_data['predictions'] = test_predictions.data.argmax(axis=1)
# Convert predictions to categorical labels
test_data['predictions'] = test_data['predictions'].replace({0: 'Graduate', 1: 'Dropout', 2: 'Enrolled'})

# Create a DataFrame with id column and predictions
submission_df = pd.DataFrame({'id': test_ids, 'Target': test_data['predictions']})

# Save the DataFrame to a CSV file
submission_df.to_csv('submissionautoml1.csv', index=False)

7- Output Analysis of Running LightAutoML

The following is the detailed output log from running LightAutoML. Let’s delve into some significant aspects that are crucial for understanding how LightAutoML operates and how it can be leveraged for your projects.

7–1- Training and Validation

LightAutoML leverages LightGBM for training and provides detailed logging on the validation scores across iterations. The logs show the validation scores improving over iterations and illustrate early stopping mechanisms when improvements plateau. This ensures the model does not overfit and saves computational resources.

INFO:lightautoml.automl.base:Time left 3404.82 secs
INFO3:lightautoml.ml_algo.boost_lgbm:Training until validation scores don't improve for 200 rounds
DEBUG:lightautoml.ml_algo.boost_lgbm:[100] valid's multi_error: 0.162936
DEBUG:lightautoml.ml_algo.boost_lgbm:[200] valid's multi_error: 0.15315
...
DEBUG:lightautoml.ml_algo.boost_lgbm:[1200] valid's multi_error: 0.137389
DEBUG:lightautoml.ml_algo.boost_lgbm:Did not meet early stopping. Best iteration is:
[1172] valid's multi_error: 0.137043

7–2- Model Selection and Training Parameters

LightAutoML goes through several models and hyperparameter configurations, showing its automated hyperparameter tuning capabilities. The parameters logged here for LightGBM include specific settings such as learning_rate, num_leaves, and feature_fraction, showcasing the fine-tuning process that AutoML undertakes to optimize model performance.

[22:38:39] Selector_LightGBM fitting and predicting completed
INFO:lightautoml.ml_algo.base:Selector_LightGBM fitting and predicting completed
[22:38:53] Start fitting Lvl_0_Pipe_1_Mod_0_LightGBM ...
INFO:lightautoml.ml_algo.base:Start fitting Lvl_0_Pipe_1_Mod_0_LightGBM ...
DEBUG:lightautoml.ml_algo.base:Training params: {'task': 'train', 'learning_rate': 0.04, 'num_leaves': 128, 'feature_fraction': 0.7, 'bagging_fraction': 0.7, 'bagging_freq': 1, 'max_depth': -1, 'verbosity': -1, 'reg_alpha': 1, 'reg_lambda': 0.0, 'min_split_gain': 0.0, 'zero_as_missing': False, 'num_threads': 2, 'max_bin': 255, 'min_data_in_bin': 3, 'num_trees': 2000, 'early_stopping_rounds': 100, 'random_state': 42}

7–3- Hyperparameter Optimization with Optuna

The integration with Optuna for hyperparameter optimization is another key aspect. Optuna performs a series of trials to find the best set of hyperparameters, significantly enhancing model performance. The log entries show the optimization process and the resulting best parameters, highlighting the sophisticated search mechanisms in LightAutoML.

INFO:lightautoml.ml_algo.tuning.optuna:Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ... Time budget is 1.00 secs
INFO:optuna.storages._in_memory:A new study created in memory with name: no-name-17578caf-a604-447b-a514-25ae2cae9528
INFO3:lightautoml.ml_algo.boost_lgbm:Training until validation scores don't improve for 200 rounds
...
INFO:optuna.study.study:Trial 0 finished with value: 0.8579779172981165 and parameters: {'feature_fraction': 0.6872700594236812, 'num_leaves': 244, 'bagging_fraction': 0.8659969709057025, 'min_sum_hessian_in_leaf': 0.24810409748678125, 'reg_alpha': 2.5361081166471375e-07, 'reg_lambda': 2.5348407664333426e-07}. Best is trial 0 with value: 0.8579779172981165.

7–4- Ensembling Different Models

LightAutoML employs an ensemble approach, training and evaluating different models such as LightGBM and CatBoost. The logs show the performance of each model, and the ensemble method combines these to achieve better overall accuracy. This ensemble strategy is a powerful feature of LightAutoML, leveraging the strengths of multiple models.

[23:00:45] Fitting Lvl_0_Pipe_1_Mod_2_CatBoost finished. score = 0.8508190806090784
INFO:lightautoml.ml_algo.base:Fitting Lvl_0_Pipe_1_Mod_2_CatBoost finished. score = 0.8508190806090784
[23:00:45] Lvl_0_Pipe_1_Mod_2_CatBoost fitting and predicting completed
INFO:lightautoml.ml_algo.base:Lvl_0_Pipe_1_Mod_2_CatBoost fitting and predicting completed

7–5- Blending Optimization

The blending optimization process is crucial for enhancing the final model’s performance by combining predictions from individual models. This process iteratively adjusts the weights assigned to each model to minimize prediction errors. The optimization details are as follows:

Initial Score: The process begins with equal weights and an initial score of 0.8515756930191474.

[23:10:21] Blending: optimization starts with equal weights and score 0.8515756930191474
INFO:lightautoml.automl.blend:Blending: optimization starts with equal weights and score 0.8515756930191474
[23:10:22] Blending: iteration 0: score = 0.8530706502128887, weights = [0.        0.6437206 0.3562794 0.       ]
INFO:lightautoml.automl.blend:Blending: iteration 0: score = 0.8530706502128887, weights = [0.        0.6437206 0.3562794 0.       ]
[23:10:23] Blending: iteration 1: score = 0.8533737461210941, weights = [0.         0.76393205 0.23606798 0.        ]
INFO:lightautoml.automl.blend:Blending: iteration 1: score = 0.8533737461210941, weights = [0.         0.76393205 0.23606798 0.        ]
[23:10:24] Blending: iteration 2: score = 0.8533737461210941, weights = [0.         0.76393205 0.23606798 0.        ]
INFO:lightautoml.automl.blend:Blending: iteration 2: score = 0.8533737461210941, weights = [0.         0.76393205 0.23606798 0.        ]
[23:10:24] Blending: no score update. Terminated

The blending process terminates when no further score improvement is observed, indicating that the optimal weights have been found.

7–6- Final Model Description

The final model is a combination of several base models, each contributing based on the optimized weights. The model description is as follows:

Final prediction for new objects (level 0) = 
    0.76393 * (2 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
    0.23607 * (3 averaged models Lvl_0_Pipe_1_Mod_2_CatBoost)

This formula illustrates that the final prediction is primarily influenced by two averaged LightGBM models, contributing a combined weight of 0.76393. Additionally, three averaged CatBoost models contribute with a weight of 0.23607. This weighted combination ensures that the strengths of both LightGBM and CatBoost are leveraged to produce robust and accurate predictions.

By understanding these outputs and their implications, you can better appreciate the model’s performance and the effectiveness of the blending optimization process in LightAutoML. This comprehensive analysis provides a solid foundation for evaluating and utilizing automated machine learning models in various applications.

8- Conclusion

As we conclude this thrilling adventure through the AutoML Grand Prix, it’s clear that leveraging LightAutoML has been a game changer for our team. The combination of speed, simplicity, and sophistication offered by this powerful framework allowed us to navigate the challenges of the competition effectively, even under tight time constraints.

Throughout this experience, we learned the importance of adaptability in the fast-paced world of machine learning. Whether you are a seasoned data scientist looking to enhance your productivity or a newcomer eager to explore the vast possibilities of machine learning, LightAutoML stands out as a valuable partner in pushing the boundaries of what’s possible in data science.

I would like to extend my heartfelt thanks to my supervisor, Reza Shokrzad, for his guidance and support throughout this competition. His insights were invaluable in shaping our approach and strategy. Additionally, I want to acknowledge my classmates for their collaboration and enthusiasm, which made this experience even more enriching.

As we continue our journey in the world of data science, I encourage you all to embrace the tools and techniques that can help streamline your workflows and improve your models. The AutoML Grand Prix has not only tested our skills but has also inspired us to keep pushing forward in our quest for knowledge and innovation. Thank you for joining me on this adventure, and stay tuned for more insights and experiences from our Kaggle journey!

Additional Resources

Kaggle Competition: Playground Series — S4E6
Full Code on Kaggle: Repository Link
Full Code on GitHub: Repository Link

Your Turn!

Hope you enjoyed this read. We eagerly await your experiences, results, or alternative approaches in the comments below! Join the conversation and enjoy the collective learning journey with us!