How my team overcame its “Rookie” Hiccups in our first ever “Data Science Hackathon” !
Hackathon — The term was among our “Most Desirable Tech Events to attend before you die” list.
Data Hackathon — This term qualified among our “It sounds cool, let’s try it” list.
So, Team Analytica (that was us), chose to attend the 24 hour SLAC Honeywell Hackathon 2019 held at Amrita School of Engineering, Bangalore, sponsored by Honeywell and organized by FACE (The CSE Departmental Forum at the school).
Pre Hackathon
A month before hack day, we were dozing off in our classes, making plans about gaming and binge watching during the hack and “pondering” over making a submission at the hack.
However, a week before the hack we realized that what we had been seeing as “Just Another Hackathon” was in fact an opportunity in disguise, an opportunity to have real experience on a field that is widely touted as an angelic contribution of human thought to the universe. We had in front of us a door to a new world that has bloated up significantly in the past decade due to the multifaceted implications of its science. On such an enlightening realization, we began our preparation for the hack. But, in a couple of hours, the several other assignments that we had in our minds supplanted our actions of preparation for the hack.
A day before the hack, we prepared. Probably the only significant prep we had done for the hack came around 10 hours before hack commencement. However, it is a noteworthy mention that this “prep” was nothing more than creating Azure ML Studio accounts and watching tutorials on the use of keras (because we HAD TO USE DEEP LEARNING BECAUSE IT SOUNDED COOL)
Taking on the Hackathon
The Honeywell team came in a group of 11 and they had the most novel problem statements and the most experienced of mentors as guides for each of the statements. There were four problem statements centered around the following main ideas :
- Classification
- Regression
- Ideation
- Image Processing
We narrowed down in on both problem statements 1 and 2 and later decided to work with problem statement 2 (regression) as we felt that “Regression was easy” (we are changed human beings now).
Choosing our Methodology
We were well aware of the fact that our team was a bunch of rookies put together who had very little experience with real world data science. The only data science we knew was limited to the long paragraphs we write in examinations to convince the teacher about our ability to effectively memorize several pages of highly mathematical approaches to the discipline. Therefore, it was very important for us as a team to at least stick to a good, channelized approach in order to tackle the problem at hand. As a result, we decided to follow the OSEMN Framework, a methodology I had followed earlier for a project at college. The OSEMN consists of 5 distinct, pipelined processes :
- Obtain : Obtaining the data needed for the analysis
- Scrub : Preprocessing and Cleaning the data for being able to make useful predictions
- Explore : Visualizing data to identify relations within the data, creating stories from the data
- Model : Creating ML/DL models to solve the task at hand
- Interpret : View the results, understand the deeper meanings implied by data and re-iterate the whole process starting from any step in between depending on how further you want to optimize your work
This blog by Dr. Cher Han Lau was what caused my rendezvous with the OSEMN life cycle and is probably the best delineation of the topic that I have seen.
Inspecting our data
The data we were provided with was a very tidy dataset compared to what we had expected to get. It had 24 columns (called features in ML terms) and over 45,000 rows (called instances in ML terms) and only 19 of these rows had missing values. So, we decided to drop these 19 rows as doing so would not affect our data too much as we would still be retaining 99.95% of our data.
The next thing that we did was to inspect the type of each feature. We saw that all but one of the features were either int or float. The only feature that was of an Object type was later removed as it had false data(as informed by our mentors). As a final approach, we plotted correlation plots of the features in our dataset to identify which features had the strongest correlations (positive or negative) with our Target feature.
EDA to the rescue
One of the first visualizations of data looked like this. Personally, I love this as it gives an extremely important story about an important event.
Exploratory Data Analysis(EDA) is a very common term in the Data Science world. It describes the use of visualization to identify and unearth stories behind data. The EDA done by us could have been a notch better(something we realized at the end of the hack). However, we did use a few different types of plots to understand our data better and at the time we did this, it seemed like a pretty decent job. We also used EDA at a later stage to analyze and evaluate the performance of our models on our data. We employed the following types of visualizations for our EDA :
- Density Plots
- Heatmaps
- Histograms
- Scatter Plots
- Line Graphs
For the EDA part, we used the Seaborn library. Two good guides to learn it are :
Doing the fun part — Modelling
And we reach modelling ! But, there is something you all need to know before reading ahead.
The tough part in Machine Learning is not modelling, it is pre-processing your data to make it fit for being modeled. Insufficient EDA can’t save your project even if you employ the greatest of ML/DL algorithms to your dataset.
Our Mentor at the hack
For a team that was attempting a real world data challenge for the first time in our lives, we skinny dipped into a large number of ML algorithms to work out our best solution.
It would be an understatement to say that we were completely unaware of what we were doing, but it would also be an overambitious statement if we say that we were in control.
Our approach was far from those wielded by experienced ML practitioners or seasoned Data Science Competitors, yet there was a tinge of stability in it. We were simply following the highly talked about “NO FREE LUNCH” theorem(NFL).
For a layman, let me define the above theorem in a basic way. It simply means,
“There is no perfect algorithm that works perfectly for every problem. You got to try out every model you possibly can think of, in order to affirm on one as the best choice”
For anybody who desires a deeper read into NFL, the following blog by Leon Fedden is a perfect read.
Since the scope of this story is to only state my team’s experience, I will not be going into the details of any of the algorithms we used and shall plainly state them here. There are plenty of brilliant blogs out there on the internet on each of these and interested readers can definitely search for them and learn them in much more depth to be able to use them.
- Linear Regression (Multivariate)
- Support Vector Regression
- XGBoost for Regression
- Random Forest Trees for regression
- A simple feed forward Neural Network architecture
In spite of having used 6 models, we were able to receive “good” results only from linear regression and neural networks. This does not necessarily discredit the other models. It only means that they did not help us for this particular problem.
They surely are very strong contenders in several other ML problems in the world.
Getting Mentored
This part was extremely important for us. This is where we learnt about our pitfalls and were being constantly evaluated by our mentors for the problem statement. They were extremely resourceful and gave us all sorts of hints to help us in converging to the best solution for the problem.
Therefore our advice to any newbie at a data hack would be
Do not shy away from talking to your mentors. Ambulate to them and discuss about your data problems with them because they surely have seen much more data science than you possibly have.
Surfing the web for tutorials and help
There is no hackathon without surfing the internet. Every competitor, irrespective of their experience or depth in subject needs to find a good source of help to tackle their respective objectives. Without prolonging this section, I shall directly provide links to the websites that helped us in our learning process.
- Medium
- Analytics Vidhya
- Kaggle Kernels
- Real Python — A personal favorite for anything related to Python
The reason these links are important is because they are gateways to some of the best learning repositories out there in the real wide world. These websites provide immense help to all kinds of data scientists, irrespective of whether you are only one project old or whether you have published a zillion research papers on the topic.
Dealing with “Times of Stagnation”
A data hack can be rather exhausting even for the most experienced of competitors. This is much due to the fact that data throws a horde of surprises every time we observe it from a different dimension. Newer observations cast serious doubts on older ones and things can get a bit messy. The key here would be to stay calm and take a small break. Keeping a clear head would sort things up in much better fashion than getting skittish or totally disinterested.
Surprisingly for us, none of us panicked during the hack. This is probably because we never expected to put up a good show. All we were looking for was to learn a bit about a field that had constantly fascinated us ever since we heard about the incredible “Beer and Diaper” story of Walmart.
We had our breaks(lots of them) where we munched on fries, chips and washed it all down with several cups of coffee. We took our walks around the venue and even played a few games of taboo to unwind from the stress of the 24 hour hack.
We had our team bonding sessions where “Trivial Prattle” percolated the walls of “Intellectual Reasoning”
But, we did snap out of these light sessions of stagnation and quickly got back to rectifying our models and devising newer strategies to use our data.
The “Mother” of Surprises
The next day’s morning was the final round of evaluation for all the 50 teams that were participating in the hack. Our mentors came up to us, took down the value of our “Evaluation Metric” and then all the teams broke for breakfast.
We were in the school cafeteria when we were called by one of the hack organizers who asked us to get back to the venue for the announcement of the Top 10 finalists for the hack and guess what.
WE WERE AMONG THE TOP 10 !!!
This was perhaps the greatest ever moment for each one of us on Team Analytica. We had managed to qualify among the top 20% of the competing teams at the hack, in spite of being in our first hack. This significantly increased our confidence.
We were obviously not the greatest team at the hack, we were not the winners at the end, we did not create the most efficient model out there, we were not Kaggle grandmasters.
But, we were a much better set of data scientists than what we were 24 hours ago and this is all that mattered to us.
The long forgotten “David and Goliath” story came back to life in our case(or rather that’s how we like to see it)
Our Takeaways from the Hack
The following are the most important things we learnt during this hack and I would love to share it here to help any of the future data hack participants.
- Don’t let academic theory beguile you
Linear regression, something that most academia dismisses as easy, managed to suck out 24 hours of our team. - Have a data science methodology in place
When you are new to a field, there are many factors you can’t control. However, sticking to a unified process can do wonders. - Pre-process and explore your data well
A main reason why we could not win a prize was because we lacked in pre-processing our data and did not conduct sufficient EDA. As a result, we missed out on several important relations within our data. - Improve summarizing skills
The final presentation we created missed out on several important visualizations that we had performed due to our lack of good summarization. So, watch that aspect for better presentation of results. - Network with your mentors and fellow participants
Nothing beats collective learning. - Love the data journey, not the destination
Don’t let the end results bog you down or put you on a pedestal.
Hope our journey acts as a catalyst for you to take part in a data hack and be a part of one of the fastest growing fields in the current world.
Acknowledgements
Firstly, I thank Honeywell for conducting the hack, giving us real world problems to work on and also for assigning erudite professionals as our mentors during the hack. The SLAC Honeywell Data Hack 2019 has significantly cut down our “Learning Curve” in Data Science.
Secondly, I thank Team FACE 2019–20 for organizing a large scale hack and making it look all too simple for an attendee. You guys did a real awesome job! I also would like to extend my gratitude from the side of my team to the CSE Department at ASE-B for being ever so supportive in all the endeavors taken up by FACE. It’s great to be a part of a department that always manages to watch our back 😃.
Lastly, I would wanna thank my team, Team Analytica. It was a real good experience working with them and together we learnt so much.
Hope you had a Good Read !
Do encourage the Blogger in me by applauding generously to the story if it manages to interest you.😄 😄
You can find me on Twitter here where I intend to start releasing frequent posts on Data Science Concepts that I come across via my projects.