Kaggle Tabular Playground Series: March Edition (Part 2)
Welcome to the March chapter of the Kaggle Tabular Playground Series.
This is the second article of a 4-part series where I will be covering the Kaggle Playground Series datasets and describing the data, processing it, deriving insights, and making predictions through various python libraries.
In the first article of this series, we saw a glimpse of what the dataset looks like and introduced some new features that will help us with the analysis that we will be performing In this article. The entire analysis performed can be found here.
The EDA is performed with some questions about the various features of the dataset, finding their answers, and noting down the observations made. So, let’s get started!
Question 1: What is the spread of the target feature?
- The congestion feature is slightly skewed towards the right, but can be assumed almost normal.
- There are some outliers present as well.
- We won’t be changing any properties or values of this feature.
Question 2: What is the interpretation between x and congestion?
- Looking at the middle chart we observe that x = 1 location is the busiest, while x = 0 is the one with the least traffic.
Question 3: What is the interpretation between y and congestion?
- Finally, at the right, we see that both y = 0 and y = 2 locations are the busiest.
- The difference between these two and the other (in ‘y’) is a bit significant.
Question 4: What is the interpretation between the direction of the traffic and congestion?
- The direction with the largest congestion rates is in South Bound and North Bound.
- Mean congestion from NE, NW, SE, SW directions is quite low.
Question 5: How does the average congestion look over time?
- We observe a strong seasonality with respect to the congestion rate per week.
- Moreover, the trend remains almost constant, increasing insignificantly over time.
Question 6: How does the congestion look like throughout the week?
- As we can see, working days of the week have a similar congestion rate.
- Likewise, we can see that the weekend days are the ones with the least traffic, with Sunday being the quietest day.
Question 7: How does the congestion differ on Mondays as compared to the rest of the week?
- Mondays do have slightly higher traffic as compared to the rest of the days of the week.
Question 8: How does the congestion differ on Fridays as compared to the rest of the week?
- Similar to what we observed for Mondays, Fridays too have slightly higher traffic as compared to the rest of the days of the week.
Question 9: How does the congestion look like throughout the day?
- We can observe a sudden increase in traffic after an early morning where it dips down to night.
- We can see a sharp drop between night congestion and late-night (after midnight) congestion as well.
Question 10: How does the congestion look like on an hourly basis?
- We observe that there is an increase in traffic at the beginning of the day.
- Traffic peaks at around 8 AM before dipping down till noon.
- The busiest hours are between 13h — 17h (1 PM — 5 PM), and after congestion rate decreases as the night falls.
Question 11: How does the congestion look like on an hourly basis throughout the week?
- We can see that in working days congestion rate is quite similar for every hour.
- However, this changes when we get into the weekend.
- We can appreciate that, due to the fact that people don’t have to work on weekends causing the congestion rates to go down.
- Moreover, the weekend congestion trend does not have as ups and downs as working days have.
Question 12: How does the congestion look like via various directions with respect to midpoint coordinates?
- Congestion is always heaviest at East Bound and sometimes can be spotted through the southwest direction.
Conclusion
- With this, we conclude our EDA.
- Some of the notable observations include the rate of congestion at different times of the day or week, followed by some spatial observations concerning the direction of the congestion throughout the span of 6 months.
- You can find the notebook here with the codes and plots.
- In the next article, we will create some additional features and perform a feature selection to determine which combination of features gives the best result to determine congestion.
Final Thoughts and Closing Comments
There are some vital points many people fail to understand while they pursue their Data Science or AI journey. If you are one of them and looking for a way to counterbalance these cons, check out the certification programs provided by INSAID on their website. If you liked this story, I recommend you to go with the Global Certificate in Data Science because this one will cover your foundations plus machine learning algorithms (basic to advance).