Kaggle Tabular Playground Series: March Edition (Part 1)
Welcome to the March chapter of the Kaggle Tabular Playground Series.
This is the first article of a 4-part series where I will be covering the Kaggle Playground Series datasets and describing the data, processing it, deriving insights, and making predictions through various python libraries.
This article covers the dataset description and information available to the users along with what appropriate pre-processing methods can be used applied generating visualizations, insights, and predictions. You can find the corresponding notebook here.
Prerequisites
- Before we begin, I would like to mention that I will be directly accessing the dataset using my Kaggle API Token. You can refer to this article on how to do the same.
- If you are set up for the series, let’s begin.
Introduction to March Chapter of Playground Series
This month, we are challenged to forecast twelve hours of traffic flow in a U.S. metropolis. The time series in this dataset are labeled with both location coordinates and a direction of travel — a combination of features that will test our skill at Spatio-temporal forecasting within a highly dynamic traffic network.
You can find the competition here.
Dataset Description
- Initially, we can see that the training dataset has 848835 rows and 6 features — ‘row_id’, ‘time’, ‘x’, ‘y’, ‘direction’, and our target feature — ‘congestion’.
- The test dataset has 2340 rows with 5 features and doesn’t contain the ‘congestion’ feature.
- We are going to make predictions for that feature with the help of the rest of the features.
- On checking the dataset description, we have 4 numeric features — ‘row_id’, ‘x’, ‘y’, and ‘congestion’.
- Further, inspecting the dataset information, we see that the datasets have 3 int64 type features (row_id, x, y) and 2 object type features (time, direction).
- The target feature — ‘congestion’ is also of the int64 data type.
Profiling Report before pre-processing
- A pandas profiling report gives a great overview of the dataset along with the univariate plots for the dataset features, correlations present, missing values, duplicated rows, cardinality, and much more.
- We won’t be looking at the univariate analysis just yet, but we can note some important observations from the report:
- There are no missing values in either dataset.
- There are no duplicated rows in either dataset.
- We cannot observe any obvious correlations (Pearson’s) between the features according to the reports.
Data Pre-processing
- We will first convert the time into the appropriate DateTime format.
- Then, We will extract some important information like month, weekday, period of the day, is_Monday, is_Friday, is_weekend, is_month_start, is_month_end and later check their significance w.r.t. congestion.
- We will use the following conversion table from hour of the day to period of the day where we have basically divided the day into 6 parts:
- After successfully preprocessing our data, the dataset heads should look like this:
- Now that we are done with the basic dataset description and pre-processing, we can proceed with the Univariate and Multivariate analysis of our data.
Conclusion
- In the next article, we will extract important insights from the dataset.
Final Thoughts and Closing Comments
There are some vital points many people fail to understand while they pursue their Data Science or AI journey. If you are one of them and looking for a way to counterbalance these cons, check out the certification programs provided by INSAID on their website. If you liked this story, I recommend you to go with the Global Certificate in Data Science because this one will cover your foundations plus machine learning algorithms (basic to advance).