The importance of a well scoped problem in Data Science

Photo by Karen Lau on Unsplash

When you start a project it is so easy to get caught up in the excitement. There’s something cool, intriguing that you want to try! You want to go! Yet, investing in scoping a well defined problem is often most of the solution.

This blog post shares about my first week at Metis, a 12 week long data science bootcamp, and how our first project really taught me more about defining a problem well before you zoom in too far or go too fast.

The Project

In our project, we were tasked to use MTA Subway data and do some sort of exploratory data analysis. We came up with a scenario that we were a team entering a competition by an environmental NGO to see who could fundraise during a specific week.

At the end we had identified a schedule for our team to follow to go to specific busy subway stations. Each day we targeted the top 4 stations (since there were 4 of us) and the busiest 4 hour window when subway riders exited the train (we only looked at exits because we wanted to sell food products). For those curious, we used Python libraries including Pandas, Seaborn and the Dark Sky Weather API.

See below our graph of one day of hotspots:

MTA Subway exit traffic in July 2018

If you’re interested in the whole presentation, see here.

The Lesson: Problem Scoping

The biggest lesson I learned is how much time you can save by scoping your problem to very specific parameters. We started off our project each examining the data and had a fun discussion about the scenario we wanted to run. This was a good start.

Where we could’ve done better is that we didn’t specify how many people were going to be the “street team” fundraising and exactly how we’d set busy stations. Is it by day? By hour? Will our fundraising team work 8 hour shifts when deployed on the streets?

This resulted in us wasting quite a bit of time and duplicating some code to group our data by various time scales where the code took a while to generate and wasn’t generalizable.

My lesson is to get specific and discuss with your team up front as many specific times and parameters as you can imagine. Obviously, you’ll never get them all. Hopefully this process will help you save some time we lost.

Conclusion

Getting to manipulate a large dataset with new tools was an amazing experience. Our team had a lot of fun and learned a lot along the way. In the end the most valuable lesson for me this week was that Data Science isn’t only about the hard skills, but about the soft skills too. A well defined and scoped problem could’ve saved us a lot of time and I plan to use this approach in my next project at Metis.

Appendix: For Aspiring Data Scientists

Curious to learn more about how we approached the problem? Take a look at our GitHub repo here to see more about all of the data cleaning you’ll need to do on the MTA dataset and more.