How my team overcame its “Rookie” Hiccups in our first ever “Data Science Hackathon” !

Published in

FACE | Amrita Bangalore

11 min readAug 28, 2019

Hackathon — The term was among our “Most Desirable Tech Events to attend before you die” list.
Data Hackathon — This term qualified among our “It sounds cool, let’s try it” list.

So, Team Analytica (that was us), chose to attend the 24 hour SLAC Honeywell Hackathon 2019 held at Amrita School of Engineering, Bangalore, sponsored by Honeywell and organized by FACE (The CSE Departmental Forum at the school).

Pre Hackathon

A month before hack day, we were dozing off in our classes, making plans about gaming and binge watching during the hack and “pondering” over making a submission at the hack.

However, a week before the hack we realized that what we had been seeing as “Just Another Hackathon” was in fact an opportunity in disguise, an opportunity to have real experience on a field that is widely touted as an angelic contribution of human thought to the universe. We had in front of us a door to a new world that has bloated up significantly in the past decade due to the multifaceted implications of its science. On such an enlightening realization, we began our preparation for the hack. But, in a couple of hours, the several other assignments that we had in our minds supplanted our actions of preparation for the hack.

A day before the hack, we prepared. Probably the only significant prep we had done for the hack came around 10 hours before hack commencement. However, it is a noteworthy mention that this “prep” was nothing more than creating Azure ML Studio accounts and watching tutorials on the use of keras (because we HAD TO USE DEEP LEARNING BECAUSE IT SOUNDED COOL)

This summarizes our prep for the hack. CREDITS : Ramshankar Yadhunath

Taking on the Hackathon

The Honeywell team came in a group of 11 and they had the most novel problem statements and the most experienced of mentors as guides for each of the statements. There were four problem statements centered around the following main ideas :

Classification
Regression
Ideation
Image Processing

We narrowed down in on both problem statements 1 and 2 and later decided to work with problem statement 2 (regression) as we felt that “Regression was easy” (we are changed human beings now).

Choosing our Methodology

We were well aware of the fact that our team was a bunch of rookies put together who had very little experience with real world data science. The only data science we knew was limited to the long paragraphs we write in examinations to convince the teacher about our ability to effectively memorize several pages of highly mathematical approaches to the discipline. Therefore, it was very important for us as a team to at least stick to a good, channelized approach in order to tackle the problem at hand. As a result, we decided to follow the OSEMN Framework, a methodology I had followed earlier for a project at college. The OSEMN consists of 5 distinct, pipelined processes :

Obtain : Obtaining the data needed for the analysis
Scrub : Preprocessing and Cleaning the data for being able to make useful predictions
Explore : Visualizing data to identify relations within the data, creating stories from the data
Model : Creating ML/DL models to solve the task at hand
Interpret : View the results, understand the deeper meanings implied by data and re-iterate the whole process starting from any step in between depending on how further you want to optimize your work

This blog by Dr. Cher Han Lau was what caused my rendezvous with the OSEMN life cycle and is probably the best delineation of the topic that I have seen.

5 Steps of a Data Science Project Lifecycle

Often, when we talk about data science projects, nobody seems to be able to come up with a solid explanation of how the…

towardsdatascience.com

Inspecting our data

The data we were provided with was a very tidy dataset compared to what we had expected to get. It had 24 columns (called features in ML terms) and over 45,000 rows (called instances in ML terms) and only 19 of these rows had missing values. So, we decided to drop these 19 rows as doing so would not affect our data too much as we would still be retaining 99.95% of our data.
The next thing that we did was to inspect the type of each feature. We saw that all but one of the features were either int or float. The only feature that was of an Object type was later removed as it had false data(as informed by our mentors). As a final approach, we plotted correlation plots of the features in our dataset to identify which features had the strongest correlations (positive or negative) with our Target feature.

EDA to the rescue

One of the first visualizations of data looked like this. Personally, I love this as it gives an extremely important story about an important event.

Source : https://fixstream.com/wp-content/uploads/2018/01/Napoleons-March.jpg

Exploratory Data Analysis(EDA) is a very common term in the Data Science world. It describes the use of visualization to identify and unearth stories behind data. The EDA done by us could have been a notch better(something we realized at the end of the hack). However, we did use a few different types of plots to understand our data better and at the time we did this, it seemed like a pretty decent job. We also used EDA at a later stage to analyze and evaluate the performance of our models on our data. We employed the following types of visualizations for our EDA :

Density Plots
Heatmaps
Histograms
Scatter Plots
Line Graphs

For the EDA part, we used the Seaborn library. Two good guides to learn it are :

Doing the fun part — Modelling

And we reach modelling ! But, there is something you all need to know before reading ahead.

The tough part in Machine Learning is not modelling, it is pre-processing your data to make it fit for being modeled. Insufficient EDA can’t save your project even if you employ the greatest of ML/DL algorithms to your dataset.
Our Mentor at the hack

For a team that was attempting a real world data challenge for the first time in our lives, we skinny dipped into a large number of ML algorithms to work out our best solution.

It would be an understatement to say that we were completely unaware of what we were doing, but it would also be an overambitious statement if we say that we were in control.

Our approach was far from those wielded by experienced ML practitioners or seasoned Data Science Competitors, yet there was a tinge of stability in it. We were simply following the highly talked about “NO FREE LUNCH” theorem(NFL).

No FREE Lunch ?! CREDITS : Ramshankar Yadhunath

For a layman, let me define the above theorem in a basic way. It simply means,

“There is no perfect algorithm that works perfectly for every problem. You got to try out every model you possibly can think of, in order to affirm on one as the best choice”

For anybody who desires a deeper read into NFL, the following blog by Leon Fedden is a perfect read.

The No Free Lunch Theorem

(or you can’t have your cake and eat it)

medium.com

Since the scope of this story is to only state my team’s experience, I will not be going into the details of any of the algorithms we used and shall plainly state them here. There are plenty of brilliant blogs out there on the internet on each of these and interested readers can definitely search for them and learn them in much more depth to be able to use them.

Linear Regression (Multivariate)
Support Vector Regression
XGBoost for Regression
Random Forest Trees for regression
A simple feed forward Neural Network architecture

In spite of having used 6 models, we were able to receive “good” results only from linear regression and neural networks. This does not necessarily discredit the other models. It only means that they did not help us for this particular problem.

They surely are very strong contenders in several other ML problems in the world.

Getting Mentored

This part was extremely important for us. This is where we learnt about our pitfalls and were being constantly evaluated by our mentors for the problem statement. They were extremely resourceful and gave us all sorts of hints to help us in converging to the best solution for the problem.

Therefore our advice to any newbie at a data hack would be

Do not shy away from talking to your mentors. Ambulate to them and discuss about your data problems with them because they surely have seen much more data science than you possibly have.

Surfing the web for tutorials and help

There is no hackathon without surfing the internet. Every competitor, irrespective of their experience or depth in subject needs to find a good source of help to tackle their respective objectives. Without prolonging this section, I shall directly provide links to the websites that helped us in our learning process.

Medium
Analytics Vidhya
Kaggle Kernels
Real Python — A personal favorite for anything related to Python

The reason these links are important is because they are gateways to some of the best learning repositories out there in the real wide world. These websites provide immense help to all kinds of data scientists, irrespective of whether you are only one project old or whether you have published a zillion research papers on the topic.

Dealing with “Times of Stagnation”

A data hack can be rather exhausting even for the most experienced of competitors. This is much due to the fact that data throws a horde of surprises every time we observe it from a different dimension. Newer observations cast serious doubts on older ones and things can get a bit messy. The key here would be to stay calm and take a small break. Keeping a clear head would sort things up in much better fashion than getting skittish or totally disinterested.

I don’t look very pleased with my code here. But, we all need to bounce back and get our heads back into the game!

Surprisingly for us, none of us panicked during the hack. This is probably because we never expected to put up a good show. All we were looking for was to learn a bit about a field that had constantly fascinated us ever since we heard about the incredible “Beer and Diaper” story of Walmart.

We had our breaks(lots of them) where we munched on fries, chips and washed it all down with several cups of coffee. We took our walks around the venue and even played a few games of taboo to unwind from the stress of the 24 hour hack.

We had our team bonding sessions where “Trivial Prattle” percolated the walls of “Intellectual Reasoning”

But, we did snap out of these light sessions of stagnation and quickly got back to rectifying our models and devising newer strategies to use our data.

The “Mother” of Surprises

The next day’s morning was the final round of evaluation for all the 50 teams that were participating in the hack. Our mentors came up to us, took down the value of our “Evaluation Metric” and then all the teams broke for breakfast.

We were in the school cafeteria when we were called by one of the hack organizers who asked us to get back to the venue for the announcement of the Top 10 finalists for the hack and guess what.

WE WERE AMONG THE TOP 10 !!!

In the Top 10? Woah! Didn’t see that coming!

This was perhaps the greatest ever moment for each one of us on Team Analytica. We had managed to qualify among the top 20% of the competing teams at the hack, in spite of being in our first hack. This significantly increased our confidence.

We were obviously not the greatest team at the hack, we were not the winners at the end, we did not create the most efficient model out there, we were not Kaggle grandmasters.

But, we were a much better set of data scientists than what we were 24 hours ago and this is all that mattered to us.

The long forgotten “David and Goliath” story came back to life in our case(or rather that’s how we like to see it)

Our Takeaways from the Hack

The following are the most important things we learnt during this hack and I would love to share it here to help any of the future data hack participants.

Don’t let academic theory beguile you
Linear regression, something that most academia dismisses as easy, managed to suck out 24 hours of our team.
Have a data science methodology in place
When you are new to a field, there are many factors you can’t control. However, sticking to a unified process can do wonders.
Pre-process and explore your data well
A main reason why we could not win a prize was because we lacked in pre-processing our data and did not conduct sufficient EDA. As a result, we missed out on several important relations within our data.
Improve summarizing skills
The final presentation we created missed out on several important visualizations that we had performed due to our lack of good summarization. So, watch that aspect for better presentation of results.
Network with your mentors and fellow participants
Nothing beats collective learning.
Love the data journey, not the destination
Don’t let the end results bog you down or put you on a pedestal.

Hope our journey acts as a catalyst for you to take part in a data hack and be a part of one of the fastest growing fields in the current world.

Acknowledgements

Firstly, I thank Honeywell for conducting the hack, giving us real world problems to work on and also for assigning erudite professionals as our mentors during the hack. The SLAC Honeywell Data Hack 2019 has significantly cut down our “Learning Curve” in Data Science.

Secondly, I thank Team FACE 2019–20 for organizing a large scale hack and making it look all too simple for an attendee. You guys did a real awesome job! I also would like to extend my gratitude from the side of my team to the CSE Department at ASE-B for being ever so supportive in all the endeavors taken up by FACE. It’s great to be a part of a department that always manages to watch our back 😃.

ASE-B’s Reading Hall’s epic makeover into the SLAC Coding Center

Things just got to the next level at night!

Lastly, I would wanna thank my team, Team Analytica. It was a real good experience working with them and together we learnt so much.

**Nervous team laugh** on being asked a **difficult** question 😅 Thanks Team Analytica ❤️

Hope you had a Good Read !

Do encourage the Blogger in me by applauding generously to the story if it manages to interest you.😄 😄

You can find me on Twitter here where I intend to start releasing frequent posts on Data Science Concepts that I come across via my projects.