Kaggle Tabular Playground Series: March Edition (Part 3)
Welcome to the March chapter of the Kaggle Tabular Playground Series.
This is the third article of a 4-part series where I will be covering the Kaggle Playground Series datasets and describing the data, processing it, deriving insights, and making predictions through various python libraries.
In the first article of this series, we saw a glimpse of what the dataset looks like and introduced some new features that helped us with the analysis that can be found in this article. This article covers the feature importance and basic post-processing required to prepare our data for model building. The notebook can be found here.
- This particular dataset doesn’t require a lot of feature engineering (unless you want to explore further with combinations of
x
,y
, anddirection
features). - We will simply perform a feature importance using a random forest regressor and yellowbrick package and determine which features can be used for model building.
- So let’s get started!
Feature Extraction
- We have first extracted the X (independent features) and y (target feature) from the train dataset:
X = data_train.drop(columns=['row_id', 'time', 'congestion'])
y = data_train['congestion']
- We can see that some features like direction and period are not readily available to be directly inputted in the model.
- We will encode them using a LabelEncoder()
label_encoder = LabelEncoder()
X['direction'] = label_encoder.fit_transform(X['direction'])
X['period'] = label_encoder.fit_transform(X['period'])
- Now, let’s proceed with the FeatureImportances() class of Yellowbrick using a RandomForestRegressor():
model = RandomForestRegressor(random_state=42)
viz = FeatureImportances(model)
viz.fit(X, y)
viz.show(figsize=(15, 7))
- We can see that our top 6 features include
y
,direction
,x
,hour
,weekday
, andmonth
. - We will proceed with these features for our model building in the test set as well:
X_test = data_test.loc[:, ('x', 'y', 'direction', 'month', 'weekday', 'hour')]
Feature Encoding
- Using the same label encoder, we will also transform the direction feature of our test set:
X_test['direction'] = label_encoder.transform(X_test['direction'])
Feature Scaling
- Most of our features are categorical in nature so we don’t have to perform any feature scaling on this dataset.
Train-Test Split
- A simple train-test split is performed to obtain a validation set out of our big training set:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 42)
print(‘Training Data Shape:’, X_train.shape, y_train.shape)
print(‘Validation Data Shape:’, X_val.shape, y_val.shape)'''
Output:
Training Data Shape: (679068, 12) (679068,)
Validation Data Shape: (169767, 12) (169767,)
'''
- And that’s it! Our data is ready for model training and evaluation.
Conclusion
- In the next article, we will be building various models and checking their relevant scores and errors.
- We will also proceed with hyperparameter tuning for some of the models and compare their final scores in validation sets and in the final Kaggle submissions.
Final Thoughts and Closing Comments
There are some vital points many people fail to understand while they pursue their Data Science or AI journey. If you are one of them and looking for a way to counterbalance these cons, check out the certification programs provided by INSAID on their website. If you liked this story, I recommend you to go with the Global Certificate in Data Science because this one will cover your foundations plus machine learning algorithms (basic to advance).