How My Team of Undergrads Won a Data Science Competition for Graduate and Ph.D. Students
I think there are a few secrets to success in data science that I uncovered while doing the Texas A&M Institute of Data Science’s Competition. I’ll outline my experience and explain how these keys to success helped my team of undergrads beat out several teams of Masters and Ph.D. students.
The Competition Structure
The teams were handed out data from the Los Angeles Bike Share. It details all the trips between LA’s bike stations, giving their geolocation (start and stop), date and time, and the user’s payment plan (yearly subscriber/monthly subscriber/one trip only). We were asked to turn in a 20-page report, one month from the competition’s launch date. Seven groups would be selected from the report submissions to present to a group of academic and industry judges who would select the winners.
The Challenge Statement
“The central question is how have LA Bike commutes changed over from 2016 when the Metro Bike Share program began until today? In particular, contestants are asked to consider how revenue and trips for a typical day have changed over both location and time. Is the number of tickets and passports increasing in all three Regions? Has the mileage among one-way commuters increased, and how? Is the number of trips increasing? Team projects will be evaluated on elements of their analysis including methods used, level of depth, and correctness, as well as creativity and presentation skills.”
- Josiah Coad (Computer Science/Math)
- Chinmay Phulse (Computer Science)
- Sheelabhadra Dey (Computer Science)
- me (Statistics/Economics)
What the Other Teams Delivered
These statements are common to the other teams:
- They gave a report and only a report.
- The main focus of that report was a forecasting model for bike usage based on past bike usage.
- They used no outside data.
- In their presentations, they mostly talked about data cleaning, the selection of their model, and the optimization of hyper-parameters.
What My Team Delivered
We handed in a pdf (since we were required), but that pdf linked to an interactive web app that gave information and visualizations about all the models we developed. Our report focused on four things:
- A forecasting model for bike usage. Unlike other groups, ours recognized that the LA Bike Share was new and would be adding lots of stations. This would greatly affect the usage and prevent past data from predicting future growth if there was a sudden increase in the number of stations. While other groups were only using auto-regressive models, our models took into account the number of stations that would be open in the future.
- We modeled the number of bikes that would end up at each docking station at the end of the day. From there, we discussed how payment structure could be altered to incentivize users to bike from high-density stations to low-density stations.
- We scraped the LA Bike Share’s website for useful suggestions on where to put new stations. From this, we created a density map of suggestions and a word cloud of what the users wrote about. This revealed locations and types of locations that were popular.
- We created an algorithm that would predict the success of a new station, depending on where it was located. This was built on socio-economic data, the success of near-by stations, walkability, and distance to the LA Metro.
I think data scientists frequently forget the point of data science — to create value from data — instead they get too focused on the process. We not only offered many models, but we talked about how these impact the underlying business, and what are the actions that can/should be taken because of them. This is what someone that hires a data scientist really wants to hear about.
You’ve probably seen some version of the Venn-diagram below. By no means am I saying that I am a unicorn with incredible abilities, but I do think that collectively, our team is at the intersection of all three circles? By focusing on machine learning and lacking a use case, many groups found themselves at the coding-stats intersection. For all I know, the other teams had much more sophisticated models than we did, but when it coming up with use cases and delivering results that would work in practice, we shined. I assume that there were many groups that fit well into the coding-domain and stats-domain intersection, but that they likely struggled to put a report together without knowledge of the tools used in the other circle. When you do data science work, make sure you have a component of each of these circles to maximize the influence and adoption of your work.
Finally, I’ll say that our project was much more fun to look at. It doesn’t matter whether someone is in primary school or has a Ph.D. when someone has an interactive map in front of them, they play with it. Not everyone is going to understand the math behind your model. Even if they do understand, (and care) they usually don’t have the time to really dig into the “why” of the model. Everyone thought, does care about the business use case. Producing a compelling and interactive visualization can make a world of difference in getting someone to adopt your data science work. I’ve said this many times: “In data science, you are sometimes judged not on the sophistication of your work, but by how much the person you were presenting to understood.”