Kaggle Tabular Playground Series: March Edition (Part 4)
Welcome to the March chapter of the Kaggle Tabular Playground Series.
This is the final article of a 4-part series where I will be covering the Kaggle Playground Series datasets and describing the data, processing it, deriving insights, and making predictions through various python libraries.
In the first article of this series, we saw a glimpse of what the dataset looks like and introduced some new features that helped us with the analysis that can be found in this article. Then we performed some additional processing on the dataset to prepare it for machine learning model training and evaluation that we will be looking in this article. You can find the notebook here.
Baseline Models
Using a customized code, I have taken a list of 13 regression algorithms and fit them on the train data. The algorithms are as follows:
clfs = [LinearRegression(),
Ridge(random_state=42),
Lasso(random_state=42),
ElasticNet(alpha=0.5, random_state=42),
DecisionTreeRegressor(random_state=42),
RandomForestRegressor(random_state=42),
KNeighborsRegressor(),
GaussianNB(),
GradientBoostingRegressor(random_state=42),
CatBoostRegressor(random_state=42),
XGBRegressor(random_state=42),
BaggingRegressor(random_state=42),
LGBMRegressor(random_state=42)
]
The results after evaluation on the validation set look like this:
- You can observe that Tree based models perform better and ensemble algorithms perform the best, specially the CatBoostRegressor and BaggingRegressor.
- Let’s take CatBoostRegressor and perform some hyperparameter tuning on it.
Hyperparameter Tuning
- To tune our CatBoostRegressor, we are taking the following parameter grid:
param_grid_cat = {'iterations': [100, 150, 200],
'learning_rate': [0.03, 0.1],
'depth': [2, 4, 6, 8],
'l2_leaf_reg': [0.2, 0.5, 1, 3]}
- We will perform GridSearchCV (You can go for RandomizedSearchCV as well) and this is how the code looks like:
cbr = CatBoostRegressor(random_state=42)
cbrcv = GridSearchCV(estimator = cbr, param_grid = param_grid_cat, scoring ='neg_mean_absolute_error', cv = 5)
cbrcv.fit(X_train, y_train)
- From this tuning, the best set of hyperparameters that we obtain are:
[Hyperparameters]: {'depth': 8, 'iterations': 200, 'l2_leaf_reg': 0.5, 'learning_rate': 0.1}
- The best MAE value with this tuning obtained is:
Best Score: -6.6831468984375135
- Let’s take the best model from this tuning, train it and perform a evaluation on the validation set:
clf = cbrcv.best_estimator_
clf.fit(X_train, y_train)
y_pred_train = clf.predict(X_train)
y_pred_val = clf.predict(X_val)print('[MAE Train]:', mean_absolute_error(y_train, y_pred_train))
print('[MAE Validation]:', mean_absolute_error(y_val, y_val_pred))
- The output:
[MAE Train]: 6.662079098373852
[MAE Validation]: 7.036600206150604
- A plot between predicted and actual values of the validation set:
- We can see a decent to good fit according to this plot.
- Let’s now make predictions on the test set:
y_pred = clf.predict(X_test)my_submission = pd.DataFrame({'row_id': data_test['row_id'], 'congestion': y_pred.ravel()})my_submission.to_csv('submission.csv', index=False)
- Let’s upload these predictions over here and check our results:
- A MAE score of 5.203 with a leaderboard rank of 588! Not bad for a very slightly tuned model.
- You can always improvise and try different models and different combinations of features.
- New features can also be engineered and they can allow for better training.
Conclusion
- If you have reached here, congratulation! The series is over.
- Thank You very much for reading through this attempt at the competition.
- We described the dataset, performed EDA, prepared the data for ML and finally prepared a model which gave us a decent score in the competition.
- I hope you learned something from this series. This will keep improving and I will be posting for the April competition as well! So keep a watch for my articles!
Final Thoughts and Closing Comments
There are some vital points many people fail to understand while they pursue their Data Science or AI journey. If you are one of them and looking for a way to counterbalance these cons, check out the certification programs provided by INSAID on their website. If you liked this story, I recommend you to go with the Global Certificate in Data Science because this one will cover your foundations plus machine learning algorithms (basic to advance).