From Classroom to Kaggle Competitions: Unlocking the Secrets of Academic Success

SamiraAlipour
21 min readJul 20, 2024

--

From Classroom to Kaggle Competitions: Speed with LightAutoML in the AutoML Grand Prix

Classifying Academic Outcomes: Insights from Kaggle Playground Series

Welcome back, data enthusiasts! Today, I’m excited to take you through another exciting chapter of our Kaggle journey. If you missed my previous blog on predicting flood probabilities, be sure to check out “From Classroom to Kaggle Competitions: Mastering Regression for Flood Prediction”. This time, we’re diving into a new challenge: predicting academic success using classification models. Join me as I share our experiences from the Kaggle Playground Series — Season 4, Episode 6, and explore the strategies we used to classify academic outcomes effectively.

Here’s a quick look at the competition to get you started:

Classification with an Academic Success Dataset. Kaggle competition

Our team, guided by our insightful supervisor, Reza Shokrzad, started this mission to predict academic success and build robust classification models. We decided to use traditional machine learning techniques for this dataset, saving our LightAutoML experiments for a future blog post.

Here’s a brief overview of our approach, the lessons we learned, and the unique insights we gained from this competition. One of the standout challenges was dealing with multiclass classification, which appeared to be imbalanced. We explored several methods to handle this imbalance, including SMOTE and adjusting class weights.

For feature engineering, we utilized K-Modes clustering to capture patterns in the categorical data. This approach, along with our comprehensive preprocessing steps, helped us enhance our models’ performance. Read on to discover more about our journey, the strategies we employed, and how you can apply these insights to your own projects.

1- Competition and Academic Success Dataset Description

1–1- About the Competition and Problem Statement

In the “Classification with an Academic Success Dataset” competition, our task is to predict student outcomes — whether they will drop out, remain enrolled, or graduate — based on a variety of factors. This challenge provides a valuable opportunity to apply and improve our machine learning skills in an educational context, helping to identify students at risk and potentially reducing dropout rates. The competition evaluates models using accuracy to measure how well our predictions align with actual student outcomes.

1–2- Academic Success Dataset Description

The dataset for this competition, both training and testing sets, is generated from a deep learning model trained on the Predict Students’ Dropout and Academic Success dataset from the UCI Machine Learning Repository. We merged this dataset with additional data from Kaggle to gain better scores. The feature distributions are similar to, but not identical to, the original dataset. Participants are encouraged to use the original dataset to explore differences and potentially improve model performance by incorporating it into their training process.

This dataset, compiled from various databases of a higher education institution, includes data on students enrolled in diverse undergraduate degrees such as agronomy, design, education, nursing, journalism, and management. The original dataset has been carefully preprocessed to handle anomalies, outliers, and missing values. It consists of 38 attributes, including:

  • id: Unique identifier for each student.
  • Marital Status: Categories include single, married, widower, divorced, facto union, and legally separated.
  • Application Mode: Various modes of application, such as general contingent, special contingent, transfer, and international student.
  • Application Order: Order of application preferences.
  • Course: Various undergraduate degrees.
  • Daytime/Evening Attendance: Indicates if the student attends daytime or evening classes.
  • Previous Qualification: The level of education before enrollment.
  • Previous Qualification (Grade): Grade of the previous qualification.
  • Nationality: Nationality of the student.
  • Parents’ Qualification: Education levels of the mother and father.
  • Parents’ Occupation: Occupations of the mother and father.
  • Admission Grade: Grade at the time of admission.
  • Displaced: Indicates if the student is displaced from their home.
  • Educational Special Needs: Indicates if the student has special educational needs.
  • Debtor: Indicates if the student has outstanding debts.
  • Tuition Fees Up to Date: Indicates if tuition fees are current.
  • Gender: Gender of the student.
  • Scholarship Holder: Indicates if the student is on a scholarship.
  • Age at Enrollment: Age of the student at the time of enrollment.
  • International: Indicates if the student is an international student.
  • Curricular Units (1st and 2nd Semesters): Number of units credited, enrolled, evaluated, approved, and grades for both semesters.
  • Economic Indicators: Unemployment rate, inflation rate, and GDP.
  • Target: The classification target — dropout, enrolled, or graduate.

The dataset was created to help reduce academic dropout and failure in higher education by using machine learning techniques to identify students at risk early in their academic journey. By analyzing these features, we can implement strategies to support these students and improve their chances of success.

2- Exploratory Data Analysis (EDA)

In our journey to predict academic success, the first crucial step was to dive deep into our data through Exploratory Data Analysis (EDA). This process allowed us to uncover hidden patterns, identify potential challenges, and gain invaluable insights that would guide our subsequent preprocessing and modeling decisions.

2–1- Statistical Information of Numerical and Categorical Values

In our Exploratory Data Analysis (EDA), we began by examining the numerical and categorical features to understand their basic statistics. For example, in the numerical features, we observed that “Marital status” was mostly 1, with values ranging from 1 to 6, and the “Age at enrollment” averaged about 22 years, ranging from 17 to 70 years. You can see the ranges and detailed analysis for other features in my code on Kaggle and GitHub repository. For categorical features, we found an imbalance in the target variable, with “Graduate” being the most frequent category. While these statistical insights were helpful in understanding the data’s structure, they are not enough on their own. We need more advanced techniques to uncover deeper patterns and relationships within the data.

2–2- Visualizing Data Distributions

To efficiently analyze our features, we developed a custom visualization function. This tool enabled us to generate appropriate plots for each feature type, providing a comprehensive view of our data landscape.

def plot_feature_distributions(data, target):
cat_cols = [col for col in data.columns if data[col].dtype == 'O' or data[col].nunique() < 100 and col != target]
num_cols = [col for col in data.columns if col not in cat_cols and col != target]

# Number of subplots including the pie chart for the target column
total_plots = len(cat_cols) + len(num_cols) + 1 # +1 for the pie chart
n_cols = 2
n_rows = int(np.ceil(total_plots / n_cols))

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 5))
axes = axes.flatten()

# Plot pie chart for the target variable
target_counts = data[target].value_counts(normalize=True)
colors = ['#66c2a5', '#fc8d62', '#8da0cb'] # Add more colors for additional classes
axes[0].pie(target_counts, labels=target_counts.index, autopct='%1.1f%%', colors=colors[:len(target_counts)], startangle=90)
axes[0].set_title(f"Distribution of {target}")

# Loop through categorical columns
for idx, col in enumerate(cat_cols):
contingency_table = pd.crosstab(data[col], data[target], normalize='index')
contingency_table.plot(kind="bar", stacked=True, color=colors[:len(target_counts)], ax=axes[idx+1]) # +1 to account for the pie chart
axes[idx+1].set_title(f"Percentage Distribution of {target} across {col}")
axes[idx+1].set_xlabel(col)
axes[idx+1].set_ylabel("Percentage")
axes[idx+1].legend(title=target, loc='upper right')

# Loop through numerical columns
for idx, col in enumerate(num_cols, start=len(cat_cols) + 1): # +1 to account for the pie chart
# Check skewness and compressed to axis condition
if data[col].dtype != 'O' and skew(data[col]) > 0.75:
sns.histplot(data=data, x=col, hue=target, kde=True, ax=axes[idx], palette=colors[:len(target_counts)], bins=50, kde_kws={'bw_adjust': 0.5})
else:
sns.histplot(data=data, x=col, hue=target, kde=True, ax=axes[idx], palette=colors[:len(target_counts)], bins='auto', kde_kws={'bw_adjust': 0.5})

axes[idx].set_title(f"Distribution of {col} colored by {target}")
axes[idx].set_xlabel(col)
axes[idx].set_ylabel("Density")

# Remove any extra empty plots if the number of subplots is odd
for i in range(total_plots, len(axes)):
fig.delaxes(axes[i])

plt.tight_layout()
plt.show()

plot_feature_distributions(train_data, 'Target')

2–2–1- Target Variable Distribution

We started with a pie chart to visualize the distribution of our target variable. This simple yet effective visualization immediately revealed the proportion of students in each category: Graduate, Dropout, and Enrolled. Understanding this distribution is crucial for addressing any class imbalances in our predictive model.

2–2–2- Categorical and Discrete Numerical Features

For categorical features and numerical features with less than 100 unique values, we employed stacked bar charts. These visualizations effectively displayed the percentage distribution of our target variable across different categories. This approach allowed us to identify potential correlations between certain categorical variables (such as application modes or marital status) and academic outcomes.

Observation: Our stacked bar charts revealed interesting patterns. For example, we noticed that students applying through certain modes had higher graduation rates, while others showed a higher proportion of dropouts. For example, students whose Application mode was 10 or 11 were all graduates.

Visualizing Data Distributions

2–2–3- Continuous Numerical Features

Histograms served as our primary tool for visualizing continuous numerical features. These plots revealed the shape of each feature’s distribution, enabling us to identify skewness and unusual patterns that might require attention during preprocessing.

Observation: we found that several of our numerical features, (Curricular units 1st sem (without evaluations), Curricular units 2nd sem (without evaluations), Curricular units 2nd sem (credited), Curricular units 1st sem (credited), Age at enrollment, Application order) exhibited significant skewness. This observation was crucial, as skewed distributions can violate the assumptions of many machine learning algorithms and potentially impact our model’s performance.

For a complete view of all the plots, you can find them on my Kaggle notebook.

Histogram plots

2–2–4- The Power of Visualization

Our EDA process underscored the critical importance of data visualization in understanding complex datasets. While statistical summaries are valuable, visualizations reveal patterns and relationships that numbers alone might miss. These visual insights guided our preprocessing decisions and helped us form hypotheses about potential predictors of academic success. By thoroughly analyzing our data’s distributions and skewness, we laid a solid foundation for the next stages of our project, ensuring that our model would be built on a deep understanding of the underlying data patterns.

2–3- Identifying and analyzing outliers

In our analysis of outliers within the academic success dataset, we employed an approach combining visual and statistical methods. Boxplots served as our primary visual tool, allowing us to quickly identify features with significant deviations. We complemented this with a detailed statistical analysis, calculating the interquartile range (IQR) for each numerical feature and determining outliers as values falling outside the established bounds. This process revealed varying proportions of outliers across different features, with some, such as “Curricular units 1st sem (grade)” and “Curricular units 2nd sem (grade),” showing notably high percentages of outliers exceeding 20%.

Outlier

2–4- Correlation Analysis

We took a close look at how different features relate to each other and to our target variable. We did this by creating a correlation matrix, which is like a map showing how strongly each feature is connected to others. We then turned this matrix into a colorful heatmap, making it easy to quickly see which features were strongly connected to each other. This step was crucial because it helped us understand which features might be most useful for predicting academic success.

2–4–1- Advantages of Correlation Analysis
Understanding feature relationships through correlation analysis offers several benefits. First, it helps us pick the most promising features for our model, potentially improving its accuracy. Second, it warns us about features that are too similar, which could confuse our model. Third, it gives us ideas for creating new, more powerful features by combining existing ones. Lastly, when we pair this data-driven approach with our knowledge about education, we can make smarter choices about which features to use and how to use them. This careful analysis of correlations helped us build a stronger, more accurate model for predicting student success.

3- Preprocessing

3–1- Handling Unique Categorical Values Not in Test Data

A critical challenge in our Kaggle competition project was ensuring consistency of categorical values between the train and test datasets, crucial for maintaining the integrity of our predictive model.

Our initial analysis revealed that several features, though numerically coded, were inherently categorical, such as “Course” and “Nationality”. Upon closer inspection, we found discrepancies in unique values between the train and test sets. For instance, “Application mode” in the train set contained values {4, 9, 12, 57, 26} that were absent in the test set. Similar inconsistencies were observed in other features like “Course”, “Previous qualification” and “Nacionality”.

Unseen values in ‘Application mode’, ‘Course’ and ‘Previous qualification’

To address this, we developed a nuanced imputation strategy. Initially, we used the mode of each feature, but due to our imbalanced dataset, this approach was inadequate. Instead, we opted to impute unseen categorical values using the third most frequent value within each target group. This method ensured that the imputation respected the distribution within each class, maintaining the dataset’s inherent characteristics while addressing discrepancies.

Imputation of unseen values in ‘Application mode’, ‘Course’ and ‘Previous qualification’

3–2- Encoding Categorical Features

In the feature analysis and transformation phase, accurately identifying and preprocessing both categorical and numerical features is essential for the effectiveness of our predictive models. During our initial data inspection, we discovered that several features, although numerically coded, were inherently categorical. This distinction is important because it determines the appropriate preprocessing techniques for each feature type. For categorical data, it’s crucial to avoid implying any order or distance between categories that do not possess such properties. To address this, it’s better to initially use label encoding, which prevents these ordinal implications.

categorical_features = [
'Marital status', 'Application mode', 'Course', 'Daytime/evening attendance',
'Previous qualification', 'Nacionality', 'Mother\'s qualification', 'Father\'s qualification',
'Mother\'s occupation', 'Father\'s occupation', 'Displaced', 'Educational special needs',
'Debtor', 'Tuition fees up to date', 'Gender', 'Scholarship holder', 'International'
]

# Label encode categorical features
for col in categorical_features:
le = LabelEncoder()
combined[col] = le.fit_transform(combined[col].astype(str))

While one-hot encoding is typically recommended to prevent ordinal assumptions by creating binary columns for each category, it can lead to a high-dimensional dataset that may be computationally expensive. Therefore, we opted for label encoding in conjunction with CatBoost, a machine learning algorithm designed to handle categorical features effectively. CatBoost can manage categorical data without misinterpreting ordinal relationships, making it a suitable choice for our model. By understanding the nature of our features and selecting the appropriate preprocessing techniques, we balanced simplicity with model performance.

3–3- Handling Inconsistencies in the Data

In our Kaggle competition project, we encountered several inconsistencies within the dataset that needed to be addressed to ensure data integrity and accuracy in our predictive modeling. Detecting and correcting these misleading values was crucial to prevent erroneous information from influencing the model’s learning process.

One significant inconsistency was the presence of zero grades for students marked as graduated. Given the importance of graduation status as a target variable, this issue was particularly problematic. To resolve it, we decided to drop samples with zero grades for graduated students, as this represented a dominant class and their removal would not significantly impact the overall dataset.

Another issue was the occurrence of approved units exceeding the number of enrolled or evaluated units. We corrected this anomaly by adjusting the approved units to ensure they did not surpass the enrolled or evaluated units.

To systematically address these anomalies, we created a function called “Find_inconsistencies”, which automatically detected and corrected these issues, streamlining the data cleaning process. Ensuring data integrity by addressing inconsistencies is essential for accurate model training, as correcting these anomalies prevents misleading information from skewing the model’s learning process.

3–4- Handling Outliers

We initially attempted to address outliers by imputing them using a 1.5 IQR bound, replacing outliers with values at the boundary of the interquartile range. However, this approach led to a decrease in model accuracy. We decided to retain the outliers in their original form, as they might contain valuable information essential for the model’s predictive performance. Retaining these outliers helped maintain the dataset’s integrity and ensured we did not lose potentially significant data points.

3–5- Feature Transformation

To address skewness in our features, we applied logarithmic transformations using np.log1p to normalize the distributions. While transforming skewed data is a common practice to improve algorithm performance, this approach did not yield the expected results for our dataset. In fact, we found that the skewed features, even in their original form, might have positively contributed to the model. Consequently, we chose to proceed without log transformations, favoring the retention of the original feature distributions. This decision highlights the importance of empirical validation when applying preprocessing techniques, especially in the context of imbalanced target variables.

4- Feature Engineering

In our feature engineering process, we created new features to enhance our model’s performance. We focused on derived features and clustering for some of our categorical variables.

  1. Derived Features: We generated features like total evaluations and curricular units for each semester, approval ratios, average grades, and academic load metrics. These features provide a comprehensive view of a student’s academic performance. Note: We included a small epsilon (1e-9) in the denominator of some calculations to handle zero values and avoid division by zero errors.
  2. K-Modes Clustering: For some of our categorical variables, we used K-Modes clustering to group similar records together. K-Modes is effective for categorical data, helping to uncover patterns that enhance predictive power. We determined the optimal number of clusters (five) and added the resulting cluster labels as a new feature (kmodes_cluster).

4–1- K-Modes Clustering

K-Modes clustering is particularly useful for categorical data because it assigns cluster centers (modes) based on the most frequent categories, minimizing dissimilarity within clusters. This algorithm helps capture hidden patterns and relationships within categorical features that might not be apparent when using the original categories alone. By converting categorical variables into cluster labels, we introduced a new dimension of information that can significantly improve the model’s ability to make accurate predictions.

Here’s how we applied K-Modes clustering to our categorical features:

  1. Select Categorical Features: We chose features related to parental qualifications and occupations:
categoricalKmodes_columns = ['Mother\'s qualification', 'Father\'s qualification', 'Mother\'s occupation', 'Father\'s occupation']
categorical_Kmodesdata = X_train[categoricalKmodes_columns].copy()
categorical_Kmodesdata_test = test_data[categoricalKmodes_columns].copy()

2. Determine Optimal Clusters: We tested different values of k (number of clusters) and used inertia to find the optimal number:

k_range = range(2, 6)
inertia_dict = {}
for k_val in k_range:
km = KModes(n_clusters=k_val, init='Cao', n_init=5, verbose=1, random_state=42)
clusters = km.fit_predict(categorical_Kmodesdata)
inertia_dict[k_val] = km.cost_
print(f'K: {k_val}, Inertia: {km.cost_}')
best_k = min(inertia_dict, key=inertia_dict.get)
print(f'Selected K: {best_k}')

3. Fit and Apply K-Modes: We fit the final K-Modes model with the selected number of clusters and added the cluster labels to our dataset:

km_final = KModes(n_clusters=best_k, init='Cao', n_init=5, verbose=1, random_state=42)
clusters_final = km_final.fit_predict(categorical_Kmodesdata)
clusters_final_test = km_final.predict(categorical_Kmodesdata_test)
X_train['kmodes_cluster'] = clusters_final
test_data['kmodes_cluster'] = clusters_final_test

4–2- Encouraging Creativity in Feature Engineering

Feature engineering is a creative process, and the features we created are tailored to our specific dataset and goals. The approaches we implemented are just one way to tackle the feature engineering step in a Kaggle competition. I encourage you to think critically about your data, experiment with different ideas, and don’t be afraid to innovate. Whether your ideas work or not, the experience you gain will be invaluable. Remember, feature engineering is an iterative process, and every experiment brings you closer to a deeper understanding of your data.

5- Model Selection and Training

In this section, we delve into the specifics of our model selection and training process. As mentioned in the introduction, we approached this task using two methods: a custom pipeline detailed below and LightAutoML, which will be covered in a separate blog. Our target variable is imbalanced, and we experimented with two methods to handle this: SMOTE (used in the LightAutoML blog) and class weights (discussed here).

5–1- Handling Target Imbalance

Imbalanced datasets pose a significant challenge for machine learning models as they can lead to biased predictions towards the majority class. There are several approaches to handle target imbalance, including:

1. Resampling Techniques: Adjust the class distribution by either increasing minority samples or decreasing majority samples.

  • Oversampling: Increases the number of minority class samples by duplicating them or generating new samples synthetically (e.g., SMOTE).
  • Undersampling: Reduces the number of majority class samples to balance the class distribution.

2. Algorithm-Level Approaches: Modify the learning algorithm to give more importance to the minority class during training.

  • Class Weights: Adjusts the algorithm to pay more attention to the minority class by assigning higher penalties to misclassifications of the minority class.
  • Cost-Sensitive Learning: Integrates different costs for misclassifications into the learning algorithm.

3. Ensemble Methods: Combine multiple models trained on balanced subsets to improve overall performance on imbalanced data.

  • Balanced Bagging and Boosting: Combines multiple models trained on balanced subsets of the data.

5–1–1- SMOTE (Synthetic Minority Over-sampling Technique)

SMOTE generates synthetic samples for the minority class by interpolating between existing minority instances. This helps balance the class distribution, allowing the model to learn equally from both classes. More details on SMOTE and its implementation can be found in our LightAutoML blog.

5–1–2- Class Weights

Class weights assign a higher penalty to misclassifications of the minority class during training. This ensures the model pays more attention to the minority class. We used class weights in our custom pipeline for several classifiers: XGBoost, LightGBM, and CatBoost. In our dataset, class weights proved more effective than SMOTE, possibly due to the meaningful relationships between feature values in each class, which SMOTE might disrupt by creating synthetic samples.

5–1–2–1- Why Use Class Weights?

Class weights are particularly useful when the minority class is too small for effective oversampling or when synthetic samples might disrupt the natural patterns in the data. By assigning higher weights to the minority class, the model learns to give more importance to the minority class during training, which can lead to improved performance on imbalanced datasets.

Implementation of Class Weights:

from sklearn.utils.class_weight import compute_class_weight

classes = np.unique(y_train)
class_weights = compute_class_weight('balanced', classes=classes, y=y_train)
class_weights_dict = {i: class_weights[i] for i in range(len(classes))}

In XGBoost, LightGBM, and CatBoost, class weights are incorporated as hyperparameters:

  • XGBoost: scale_pos_weight
  • LightGBM: class_weight
  • CatBoost: class_weights

These hyperparameters adjust the training process to emphasize the minority class. The weights are typically calculated based on the inverse of class frequencies, ensuring the minority class has a proportionally higher impact on the training process.

5–2- Classifiers and Their Characteristics

We selected three powerful classifiers for our model: XGBoost, LightGBM, and CatBoost. Each of these has unique characteristics that make them suitable for different types of data and problems.

5–2–1- XGBoost

XGBoost (Extreme Gradient Boosting) is known for its efficiency and performance in handling large datasets. It uses a boosting technique to sequentially add new models that correct the errors of previous ones. The scale_pos_weight parameter is used to balance the positive and negative weights.

5–2–2- LightGBM

LightGBM (Light Gradient Boosting Machine) is designed for speed and efficiency, capable of handling large-scale data with lower memory usage. It grows trees leaf-wise, which can lead to better accuracy. The class_weight parameter adjusts the weight for each class.

5–2–3- CatBoost

CatBoost (Categorical Boosting) is particularly effective with categorical features and requires minimal pre-processing. It handles categorical data internally and reduces the chances of overfitting. The class_weights parameter is used to set different weights for each class.

5–3- Model Training and Hyperparameter Tuning

To ensure our models were properly evaluated and tuned, we integrated them into a pipeline with a StandardScaler for feature scaling. We used Optuna for hyperparameter tuning, allowing us to efficiently search for the best parameters. For a detailed explanation of Optuna, refer to my previous blog here.

5–3–1- Hyperparameter Tuning with Optuna:

def objective(trial, model_class, X_train, y_train, class_weights, n_splits=5, random_state=42):
if model_class == XGBClassifier:
params = {
'n_estimators': trial.suggest_int('n_estimators', 100, 2000),
'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.3),
'max_depth': trial.suggest_int('max_depth', 3, 10),
'reg_alpha': trial.suggest_loguniform('reg_alpha', 0.01, 10.0),
'reg_lambda': trial.suggest_loguniform('reg_lambda',0.01, 10.0),
'scale_pos_weight': trial.suggest_categorical('scale_pos_weight', [class_weights[1] / class_weights[0]])
}
model = model_class(**params, objective='multi:softprob', num_class=len(np.unique(y_train)),
random_state=random_state)
elif model_class == LGBMClassifier:
params = {
'n_estimators': trial.suggest_int('n_estimators', 100, 2000),
'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.3),
'max_depth': trial.suggest_int('max_depth', 3, 10),
'reg_alpha': trial.suggest_loguniform('reg_alpha', 0.01, 10.0),
'reg_lambda': trial.suggest_loguniform('reg_lambda', 0.01, 10.0),
'class_weight': class_weights_dict
}
model = model_class(**params, objective='multiclass', random_state=random_state)
elif model_class == CatBoostClassifier:
params = {
'iterations': trial.suggest_int('iterations', 1000, 2000),
'learning_rate': trial.suggest_loguniform('learning_rate', 0.1, 0.15),
'depth': trial.suggest_int('depth', 3, 7),
'l2_leaf_reg': trial.suggest_loguniform('l2_leaf_reg', 2, 5),
'class_weights': class_weights
}
model = model_class(**params, loss_function='MultiClass', random_state=random_state, verbose=0)

pipeline = Pipeline([
('scaler', StandardScaler()),
('model', model)
])

skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
scores = []
for train_index, test_index in skf.split(X_train, y_train):
X_tr, X_te = X_train.iloc[train_index].copy(), X_train.iloc[test_index].copy()
y_tr, y_te = y_train.iloc[train_index], y_train.iloc[test_index]
pipeline.fit(X_tr, y_tr)
y_pred = pipeline.predict(X_te)
scores.append(accuracy_score(y_te, y_pred))

return np.mean(scores)

def tune_hyperparameters(X_train, y_train, model_class, class_weights, n_trials=50):
study = optuna.create_study(direction='maximize', sampler=TPESampler())
study.optimize(lambda trial: objective(trial, model_class, X_train, y_train, class_weights), n_trials=n_trials)
return study.best_params

5–4- Voting Classifier

To leverage the strengths of each model, we combined XGBoost, LightGBM, and CatBoost into a Voting Classifier. This ensemble method uses the predictions of multiple models and outputs the majority class (hard voting) or the average probability (soft voting). Soft voting, which we used, is particularly beneficial as it combines the probability outputs of each model, leading to a more balanced and robust final prediction.

voting_clf = VotingClassifier(
estimators=[
('xgb', xgb_model),
('lgbm', lgbm_model),
('catboost', catboost_model)
],
voting='soft',
weights=[7, 2, 2]
)

6- Model Evaluation

6–1- Cross-Validation with Stratified Folds

To evaluate our models’ performance reliably, we used cross-validation with stratified folds. Cross-validation is a robust technique that helps ensure our models generalize well to unseen data by dividing the data into several subsets, or folds. The model is trained on a combination of these folds and validated on the remaining fold, repeating this process multiple times. This provides a comprehensive view of the model’s performance across different subsets of the data.

For imbalanced datasets, stratified cross-validation is particularly crucial. Stratified cross-validation ensures that each fold maintains the same proportion of classes as the original dataset. This is important because it prevents the model from being trained or validated on folds that do not represent the true distribution of the target classes. Without stratification, some folds might end up with a disproportionate number of samples from the majority class, leading to biased performance metrics that do not accurately reflect the model’s ability to handle the minority class.

Stratified cross-validation provides several benefits:

  1. Balanced Representation: Ensures each fold is representative of the entire dataset, maintaining the class distribution.
  2. Reduced Variance: Helps in reducing the variance of the performance estimates, giving a more stable and reliable measure.
  3. Improved Generalization: By training and validating on folds that mirror the actual data distribution, the model is better equipped to generalize to real-world data.

6–2- Evaluation Metrics

In the context of our Kaggle competition, the primary evaluation metric was accuracy, which measures the proportion of correctly classified instances out of the total instances. However, accuracy alone can be misleading for imbalanced datasets, as it can be high even if the model performs poorly on the minority class. For example, if the minority class constitutes only 1% of the data, a model that predicts the majority class for all instances would still achieve 99% accuracy, despite having zero recall for the minority class.

To address this limitation, we also reported the macro F1 score, providing a more comprehensive evaluation of our model’s performance across all classes. The macro F1 score is the average of the F1 scores for each class, calculated as:

F1 score

Precision (positive predictive value) is the ratio of correctly predicted positive observations to the total predicted positives, while recall (sensitivity or true positive rate) is the ratio of correctly predicted positive observations to all actual positives. The F1 score balances precision and recall, offering a single metric that accounts for both false positives and false negatives.

The macro F1 score treats each class equally by averaging the F1 scores of all classes without considering their support (the number of true instances for each class). This is particularly important for imbalanced datasets as it ensures the minority class is evaluated on equal footing with the majority class. High accuracy does not necessarily mean good performance on minority classes, but a high macro F1 score indicates the model performs well across all classes, providing a more balanced assessment.

Using both accuracy and macro F1 score offers a dual perspective:

  1. Accuracy: Gives an overall measure of correctness but can be skewed by the majority class.
  2. Macro F1 Score: Ensures that the performance on minority classes is not overshadowed by the majority class, highlighting any issues in class imbalance handling.

In summary, by using stratified cross-validation and evaluating with both accuracy and macro F1 score, we ensure a robust and fair assessment of our model’s performance on imbalanced data, capturing both overall correctness and balanced class-specific performance.

7- Submission Process

7–1- Submission Strategy: Navigating the Kaggle Competition

After finalizing our models, we prepared for submission by fitting the complete pipeline on the entire training set and predicting on the test set. Consistent preprocessing steps were applied to the test set to maintain uniformity.

We formatted our predictions according to the sample_submission.csv provided by Kaggle and submitted the submission.csv file. This process allowed us to see our model’s public score on the leaderboard, based on 20% of the test data, offering an immediate reflection of our model’s competitive performance.

By following these steps, we achieved a competitive score in the Kaggle competition, validating our model development and evaluation strategies.

Kaggle LeaderBoard
Our public scores

8- Key Lessons Learned

  1. Data Preprocessing is Crucial: The importance of thorough data cleaning and preprocessing cannot be overstated. Addressing inconsistencies was essential step that significantly improved our model’s performance.
  2. Feature Engineering Enhances Predictive Power: Creating new features from existing ones and using clustering techniques like K-Modes helped us capture hidden patterns in the data, ultimately boosting our model’s accuracy.
  3. Balancing Imbalanced Data: Effectively managing imbalanced datasets using methods like class weights was critical in ensuring our models performed well across all classes, not just the majority class.
  4. Model Selection and Tuning Matter: Selecting the right models and fine-tuning their hyperparameters with tools like Optuna played a key role in achieving optimal results. Combining multiple models through techniques like soft voting further enhanced our predictions.

9- Conclusion: Learning Through Data Challenges

Engaging in Kaggle competitions is a rewarding way to transition theoretical knowledge into practical skills. The journey from classroom learning to applying machine learning techniques in real-world scenarios is filled with valuable lessons. In this blog, we navigated through the complexities of predicting academic success, handling multiclass imbalances, and the significance of feature engineering.

The key takeaway is that every step, from EDA to model evaluation, plays a vital role in building an effective machine learning model. The iterative process of experimenting, failing, and improving is what leads to mastery. By participating in these challenges, you not only sharpen your technical skills but also develop a problem-solving mindset that is crucial in the ever-evolving field of data science.

Special thanks to our supervisor, Reza Shokrzad, and my dedicated teammates. Their guidance and collaboration were instrumental in navigating this complex challenge and achieving our goals.

As we continue our Kaggle journey, remember that each competition is a learning opportunity. Embrace the process, stay curious, and keep pushing the boundaries of what you can achieve with data. The future is bright for those who persist and innovate. Stay tuned for the next chapter in our blog series, where we will continue to explore the intricacies of data and uncover creative solutions together. Happy Kaggling and data diving!

Additional Resources

Your Turn!

Hope you enjoyed this read. We eagerly await your experiences, results, or alternative approaches in the comments below! Join the conversation and enjoy the collective learning journey with us!

--

--