From Classroom to Kaggle Competitions: Our Abalone Age Prediction Adventure

15 min readJun 26, 2024

From Classroom to Kaggle Competitions: Our Abalone Age Prediction Adventure

A Journey into Machine Learning with the Kaggle Abalone Dataset

Welcome to the first post in our series on team-based experiences in Kaggle competitions! In this blog, we dive into our recent adventure with the ‘Regression with an Abalone Dataset’ competition on Kaggle. This series aims to provide valuable insights and practical tips for students, data scientists, and machine learning engineers alike, to sharpen your regression modeling skills. Join us as we explore the challenges and joys of working with real-world data, beginning with this exciting project.

Regression with an Abalone Dataset competition on Kaggle

You know how sometimes you think you’re signing up for a simple class project, and suddenly you’re knee-deep in sea snail data? Yeah, that was me a few weeks ago. Our supervisor, Reza Shokrzad, decided to throw us into the deep end with this Kaggle challenge. At first, I was like, “Predicting the age of abalone? How hard could it be?”

As it turns out, pretty challenging! But also incredibly rewarding. Our supervisor split us into small, three-person teams for this mini-hackathon-style competition. Working together, brainstorming ideas, and learning from each other made this experience both fun and insightful.

We had weekly brainstorming sessions where taught me more about real-world data science than any textbook ever could. Participating in this Kaggle competition was more than a technical exercise; it was a journey of discovery and collaboration. We navigated the complexities of the Abalone dataset, transforming data into actionable insights. This experience demonstrated that real learning happens through practice and teamwork, bridging the gap between theoretical knowledge and real-world applications.

Now, let me take you through our journey of mastering regression with the Abalone dataset. We’ll explore everything from initial data analysis to advanced modeling techniques, and I’ll share the insights we gained along the way.

1. Competition and the Abalone Dataset Description

1.1. About the Competition and Problem Statement

In the “Regression with an Abalone Dataset” competition, our goal is to predict the age of abalone from physical characteristics. This competition simplifies the complex lab process traditionally used to determine abalone age, making it a perfect challenge for sharpening our machine learning techniques. We’ll be judged on Root Mean Squared Logarithmic Error (RMSLE), which measures how well our model predicts values on a logarithmic scale, emphasizing relative accuracy.

Traditionally, determining an abalone’s age involves cutting the shell, staining it, and counting rings under a microscope__a tedious process. This competition aims to use easily obtainable measurements to predict age, making the process faster and less invasive.

1.2. Understanding the Dataset

For this competition, Kaggle created a modified version of the UCI Abalone Dataset by training a deep learning model on the original data. This model generated a new dataset with slightly different feature distributions to provide a fresh challenge. The key attributes include ID, Sex (M/F/I), Length, Diameter, Height, Whole Weight, Shucked Weight, Viscera Weight, Shell Weight, and Rings (+1.5 = age in years). The target variable is the number of rings in the shell, which directly correlates to the abalone’s age.

Our experience showed that combining this Kaggle dataset with the original UCI dataset improved our model’s predictions and resulted in higher scores. To merge these two datasets, consider renaming some features to maintain consistency and avoid conflicts.

2. Exploratory Data Analysis (EDA)

During our Exploratory Data Analysis (EDA) of the Abalone dataset, we utilized distplots, boxplots, and heatmaps to explore the data’s structure and relationships. We found that “Rings,” our target variable, exhibited skewness, pointing to potential outliers. Additionally, features such as “Length” and “Whole Weight” displayed strong correlations with “Rings.” The skewed and imbalanced distribution of “Rings” highlighted the need for careful handling in preprocessing. These findings were essential in shaping our approach to subsequent data preparation and model development.

3. Data Preprocessing

3.1. Handling Skewed Target Variable

3.1.1. Understanding and Addressing Skewness

In machine learning, the distribution of the target variable is crucial for model performance, particularly in competitions where metrics like RMSLE (Root Mean Squared Logarithmic Error) are used. RMSLE penalizes under-predictions more than over-predictions and is sensitive to relative differences, making it ideal for datasets with skewed targets. The Abalone dataset’s target variable, “Rings,” exhibited significant skewness, indicating outliers and a non-normal distribution, which could bias predictions and distort the model’s understanding of feature relationships.

3.1.2. Applying np.log1p Transformation

To address the skewness in the “Rings” target variable, we applied the `np.log1p` transformation (logarithm of 1 plus the value), which normalized the distribution and gracefully handled zero values. This transformation reduced the impact of outliers, resulting in a more balanced and symmetric distribution. It aligned the target variable with the assumptions of our regression models, thereby enhancing our model’s performance. Specifically, it improved the accuracy and reliability of predictions and significantly boosted our competition score (RMSLE metric), highlighting the importance of handling skewed data effectively in machine learning.

3.2. Handling Missing Values

3.2.1. Identifying and Imputing Missing Values

Our next step involved handling missing values, specifically zeros in the “Height” column, which were unrealistic based on domain knowledge. Zero height values were treated as missing data that needed imputation.

3.2.2. Imputation Strategy

To impute these missing values, although we can impute them with mean or median of this column, but we choose a data-driven approach. First, we identified “Diameter” as the feature most correlated with “Height” (correlation value: 0.92). Using this relationship, we imputed missing “Height” values by calculating the median “Height” within each “Diameter” group. This method ensured that the imputed values reflected realistic and contextually accurate estimates. Here’s a snippet code of our approach:

# Group by 'Diameter' and calculate the median 'Height' within each group
median_height_by_diameter = X_train.groupby('Diameter')['Height'].median()

# Create a function to impute zero values with median 'Height' based on 'Diameter'
def impute_height(row):
    if row['Height'] == 0:
        diameter_median_height = median_height_by_diameter.get(row['Diameter'])
        if diameter_median_height is not None:
            return diameter_median_height
    return row['Height']

# Apply the function to impute missing 'Height' values
X_train['Height'] = X_train.apply(impute_height, axis=1)

3.2.3. Lesson Learned

The key takeaway from this step was the importance of leveraging correlations within the data to inform imputation strategies. This not only preserves the integrity of the dataset but also maintains the statistical relationships crucial for accurate modeling.

Additionally, we learned that missing values are not always explicitly marked as null or NaN. In some cases, values that are not acceptable for certain features, such as zeros for “Height” in this dataset, should be considered missing. Recognizing these contextually inappropriate values as missing is crucial for effective data preprocessing.

3.3. Encoding Categorical Features

3.3.1. Transforming Categorical Data

Handling categorical variables is essential for regression models, which require numerical input. In the Abalone dataset, the “Sex” column was categorical with values ‘M’, ‘F’, and ‘I’ (infant). We used one-hot encoding via pd.get_dummies to transform this categorical feature into numerical format. This technique created separate binary columns for each category, enabling the model to interpret the categorical data effectively.

3.3.2. Practical Insight

Using one-hot encoding for categorical features provided a clear representation of these categories without assuming any ordinal relationship between them. This approach ensured that our model could leverage all the categorical information without introducing unnecessary biases.

3.4. Outlier Detection and Handling

3.4.1. Identifying Outliers

Outliers can distort model performance by introducing extreme values. We identified outliers in numerical features using the Interquartile Range (IQR) method. Outliers were detected if values were below the lower bound or above the upper bound, calculated as follows:

X_numerical_features = X_train.select_dtypes(include=[np.number])

# Define a function to find outliers based on IQR
def find_outliers(df):
    outliers = {}
    for col in df.columns:
        v = df[col]
        q1 = v.quantile(0.25)
        q3 = v.quantile(0.75)
        iqr = q3 - q1
        lower_bound = q1 - 1.5* iqr  
        upper_bound = q3 + 1.5* iqr 
        outliers_count = ((v < lower_bound) | (v > upper_bound)).sum()
        perc = outliers_count * 100.0 / len(df)
        outliers[col] = (perc, outliers_count)
        print(f"Column {col} outliers = {perc:.2f}% ({outliers_count} out of {len(df)})")
    return outliers

# Find outliers in the DataFrame
outliers = find_outliers(X_numerical_features)

3.4.2. Handling Outliers

To handle the identified outliers, we replaced values outside the IQR bounds with the lower and upper boundaries using a custom class. This method retained the integrity of the data while minimizing the impact of extreme values:

class OutlierBoundaryImputer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        Q1 = np.percentile(X, 25, axis=0)
        Q3 = np.percentile(X, 75, axis=0)
        IQR = Q3 - Q1
        self.lower_bound = Q1 - 1.5 * IQR
        self.upper_bound = Q3 + 1.5 * IQR
        return self

    def transform(self, X):
        X_outlier_imputed = np.where(X < self.lower_bound, self.lower_bound, X)
        X_outlier_imputed = np.where(X_outlier_imputed > self.upper_bound, self.upper_bound, X_outlier_imputed)
        return X_outlier_imputed

This approach ensured a more consistent dataset, which improved the model’s stability and performance.

3.5. Scaling numeric columns

In our analysis, StandardScaler consistently worked better than MinMaxScaler for preparing our data in machine learning models.

3.6. Integrating Preprocessing into a Pipeline

3.6.1. Why Use a Pipeline?

A machine learning pipeline streamlines the preprocessing and modeling workflow, ensuring consistent application of transformations across training and validation sets. It automates the sequential steps, from handling outliers and scaling features to encoding categorical variables and fitting the model, thus minimizing errors and improving efficiency.

3.6.2. Implementation in Our Project

In our project, we integrated the earlier mentioned preprocessing steps into a pipeline. This included outlier handling with the OutlierBoundaryImputer, feature scaling using “StandardScaler”. Importantly, we applied “fit_transform” on the training set and only “transform” on the validation and test sets.

pipeline_model = Pipeline(steps=[('outlier_imputer', OutlierBoundaryImputer()),
                                 ('scaler', StandardScaler(),
                                 ('model', CatBoostRegressor()])

To ensure robust model development and evaluation, we used 5-fold cross-validation. This approach was crucial because the training and validation sets change in each fold, making a pipeline the better choice for consistent application of “fit_transform” and “transform” during outlier handling and scaling. The pipeline ensured that these steps were applied correctly across all folds, enhancing the reliability of our model performance.

4. Feature Engineering

4.1. Creating New Features and Transformations for Better Predictions

Feature engineering is a critical step in machine learning, aiming to create new features from existing ones to boost model performance. In our exploration with the Abalone dataset, we experimented with various new features derived from the original attributes to capture complex relationships and improve predictive power.

Initially, we hypothesized that features related to the size and composition of abalones, such as Volume, Shell Thickness, Density, and Shell Surface Area, might provide valuable insights into the age prediction. Volume was calculated from Length, Diameter, and Height, reflecting overall size. Shell Thickness was derived as the difference between Diameter and Height, hypothesizing that older abalones might have thicker shells. Density was computed using estimated mass from Whole Weight divided by Volume, and Shell Surface Area was approximated using the cylindrical surface area formula. We also explored several ratio features, including the ratios of Length to Shell Weight, Shucked Weight to Whole Weight, Length to Diameter, and Height to Whole Weight, among others. These ratios aimed to capture relationships between different physical aspects of abalones.

Despite these efforts, most of these features did not significantly enhance model performance. After rigorous feature importance analysis and model evaluation, we found that only one newly created feature consistently improved our model’s predictive performance: the Proportion of Shucked Weight to Whole Weight. This ratio provided a meaningful indication of the abalone’s composition, reflecting how the distribution of meat relative to the overall weight could correlate with age. Here’s the code for generating this feature:

# Feature Engineering

def Feature_Engineering(data):

  #  Proportion of Shucked Weight to Whole Weight
  data['WholeW1_to_WholeW_Proportion'] = data['Whole weight.1'] / data['Whole weight']

  return data

X_train = Feature_Engineering(X_train)

4.2. Lesson Learned

The key lesson from this experience is that not all theoretically relevant features contribute to model improvement. It’s essential to empirically validate each feature’s impact through feature importance analysis and model testing. Our findings emphasize the value of simplicity and relevance over quantity in feature engineering. Creating a wide array of new features can be insightful, but practical effectiveness depends on their actual contribution to the model’s predictive power. Our most effective strategy was leveraging features that intuitively and statistically aligned with the target variable, such as the ratio reflecting the abalone’s composition. Thus, feature engineering should be both data-driven and guided by domain knowledge to ensure meaningful enhancements in model performance.

5. Model Selection and Training

Comparing Regression Models and Tuning Hyperparameters for Optimal Performance

Model selection was a critical phase in our approach to predicting abalone age. We evaluated a variety of regression models to identify the most effective ones for this task. Here’s a detailed account of our model selection process:

5.1. Model Testing and Pipeline Integration

We tested several regression models, each integrated into a preprocessing pipeline. This pipeline, previously described, included outlier handling, feature scaling, and the final model. Each model was evaluated using a 5-fold cross-validation to ensure robust performance assessment. Here’s a summary of the models tested and their roles:

Linear Regression: Used as a baseline model due to its simplicity and interpretability.
Lasso and Ridge Regression with Polynomial Features: Added polynomial features to capture non-linear relationships, with Lasso and Ridge providing regularization to manage overfitting.
RandomForest Regressor: Chosen for its ability to handle complex interactions between features without requiring scaling.
XGBoost: Known for its efficiency and performance on structured data through gradient boosting.
LightGBM (LGBM): Preferred for its speed and ability to handle large datasets efficiently. LGBM’s leaf-wise tree growth strategy allowed it to capture important patterns and nuances in the data effectively.
CatBoost: Particularly useful due to its ability to handle categorical variables directly and mitigate overfitting through ordered boosting techniques.
GradientBoosting Regressor: Utilized for its capability to build strong predictive models by combining multiple weak learners iteratively, effectively capturing non-linear relationships.

5.2. Hyperparameter Tuning

Hyperparameter tuning was a game-changer in our model optimization process. We used GridSearchCV to conduct a systematic search for the best hyperparameters within a 5-fold cross-validation framework. This strategy ensured that each model was assessed comprehensively across the entire dataset, avoiding the biases that can occur with a single train-test split. Our experience reinforced the importance of tuning hyperparameters to enhance model performance significantly, offering a finer control over the model’s behavior and results.

5.3. Ensemble Methods: Stacking and Voting Regressors

After evaluating individual models, “LightGBM”, “CatBoost”, and “GradientBoosting Regressor” emerged as the top performers. To further enhance performance, we used ensemble methods:

Stacking: This approach combined the predictions of LightGBM, CatBoost, and GradientBoosting Regressor using a meta-model. The meta-model learned how to best integrate the strengths of each base model, leading to improved overall predictive performance.
Voting Regressor: This method aggregated predictions by averaging the outputs from the three top models. By assigning different weights to each based on their individual performance, the Voting Regressor capitalized on their complementary strengths, delivering the best results in our tests.

For each ensemble model, we built a pipeline that included outlier handling, feature scaling, and the regression model itself. This pipeline structure was crucial in maintaining consistency throughout the modeling process, especially during cross-validation. Our experience highlighted that ensemble methods like stacking and voting could integrate the combined strengths of individual models, delivering superior predictions.

6. Model Evaluation

Evaluating Model Performance: Metrics and Techniques for Regression

Evaluating model performance accurately is crucial for regression tasks, particularly when preparing for competition. Here’s how we approached it:

6.1. Evaluation Metrics

We used a range of metrics to assess our model’s effectiveness:

RMSLE (Root Mean Squared Logarithmic Error): The primary competition metric, RMSLE, was ideal because it penalizes underpredictions and aligns well with our log-transformed target. This metric focuses on the relative error, which helps in scenarios where the target variable spans several orders of magnitude.
R² Score: This metric helped us understand how well our model explained the variance in the data. It provided a straightforward indication of the proportion of variance in the target that was predictable from the features.
RMSE (Root Mean Squared Error): RMSE provided a direct measure of prediction error in the same units as the target variable. It highlighted the absolute errors between predicted and actual values, giving us a clear sense of the average magnitude of prediction errors.
MSLE (Mean Squared Logarithmic Error): MSLE offered another logarithmic perspective on prediction error, complementing RMSLE by considering the log-transformed differences between predicted and actual values.

Since we transformed our target variable using log1p, we applied `np.expm1` to both the true and predicted values during the evaluation phase to transform back to the original scale. This adjustment was essential for interpreting the metrics accurately and ensuring they reflected the original data distribution.

6.2. Lessons Learned

Metric Selection: Using multiple evaluation metrics provided a rounded view of model performance. While RMSLE was crucial for the competition, RMSE and MSLE gave additional insights into error characteristics, and R² helped in understanding the model’s explanatory power.

Data Transformation: Transforming the target variable with log1p and reversing this transformation during evaluation was essential for meaningful metric interpretation. This practice was particularly important for RMSLE and MSLE, as they are sensitive to the scale of the target variable.

Cross-Validation: The 5-fold cross-validation was key in obtaining a robust performance estimate. It mitigated overfitting and offered a reliable indication of how well the model would perform on unseen data, reinforcing its suitability for real-world applications.

7. Submission Process

7.1. Submission Strategy: Navigating the Kaggle Competition

From Prediction to Public Leaderboard

After finalizing our models, we prepared for submission by fitting the complete pipeline on the entire training set and predicting on the test set. The preprocessing steps were consistently applied to the test set to maintain uniformity.

We formatted our predictions according to the `sample_submission.csv` provided by Kaggle and submitted the `submission.csv` file. This process allowed us to see our model’s public score on the leaderboard, based on 20% of the test data, offering an immediate reflection of our model’s competitive performance.

By following these steps, we achieved a competitive score in the Kaggle competition, validating our model development and evaluation strategies.

Our community, which you can see in the above picture, shared an enriching experience. This initial gathering on Kaggle was a milestone, showing our collaborative efforts and individual scores. Our supervisor, with their remarkable cleverness, patience, and kindness, guided us through this journey, encouraging us to tackle even more challenging competitions in the future, such as NLP tasks and computer vision projects.

8. Key Lessons Learned

Data-driven approaches are crucial: From imputing missing values to feature engineering, letting the data guide our decisions was key.
Not all engineered features are useful: Despite creating numerous features, only a few significantly improved our model.
Ensemble methods can provide a significant boost: Combining models often outperformed individual algorithms.
Cross-validation is essential: It provided a more robust estimate of our model’s performance on unseen data.
The power of collaboration: Weekly brainstorming sessions and teamwork were invaluable in tackling this challenge.

9. Conclusion: Embracing the Learning Process

Our journey through the abalone dataset shows how powerful data science can be. Competitions like this mix theory with real practice, turning what we learn from books into hands-on skills. Working on this project allowed us to apply complex machine learning concepts to real-world problems, improving both our problem-solving skills and teamwork.

During the challenge, we tried different algorithms, carefully looked at the results, and made our models better with ongoing feedback. This practical experience taught us that real learning goes beyond the classroom, growing as we tackle real data challenges and develop with creative solutions.

We learned that data is often complex, which helped us use various tools and approaches better, getting us ready for bigger tasks in data science. I hope our experience encourages you, whether you’re a student, a data science fan, or just curious about machine learning, to take on similar challenges and enjoy the learning journey.

Remember, in the world of machine learning, every dataset tells a story. Our task is to listen, uncover these stories, and translate them into valuable insights. Success in this journey requires continuous learning, collaboration, and staying committed. So, roll up your sleeves, dive into the data, and let the adventure of learning begin!

Acknowledgments

I am deeply grateful to our supervisor, Reza Shokrzad, for his invaluable wisdom, patience, and unwavering support. His guidance was crucial, especially when we hit a challenging phase in model tuning, consistently encouraging us to explore new knowledge and skills. I also want to thank my teammates for their active involvement and contributions. Their collaboration, particularly during the late-night coding sessions, made our journey both insightful and enjoyable, enhancing our collective experience.

In future posts, I’ll be sharing more stories from these advanced competitions. Each competition is a unique learning adventure. I’m excited to dive into and discuss more challenging problems with you, exploring the details of data and the creative solutions they inspire.

Additional Resources

Kaggle Competition: Playground Series — S4E4
Full Code on Kaggle: Repository Link
Full Code on GitHub: Repository Link

Your Turn!

Hope you enjoyed this read. We eagerly await your experiences, results, or alternative approaches in the comments below! Join the conversation and enjoy the collective learning journey with us!