This article is about in-person data science competitions that are judged holistically, and not graded by some accuracy metric (like Kaggle). I have participated in two data science competitions (placed 4th and 1st), and I’ve mentored at Texas A&M’s first-ever datathon (3 out of the 4 teams I worked with won prizes). Consequentially, I have seen many projects, and I have noticed some trends between winners and losers. Here are my suggestions for improving your chances of winning.
Currently, I see a lot of data science projects that are just Jupyter notebooks. Unfortunately, these are frequently filled with code cells and have no markdown in between. It’s worth investing time in writing out quality notes about your project. It should take less than 5 minutes to learn all the basics of markdown (here’s a link to prove it). Whether your judge views your submission online or in person, they don’t have the time or patience to figure out what your code does. If a judge can’t understand your code, then they have to guess if it’s good or bad, and when forced to guess, they will default to assuming it’s bad. Essentially, you get graded on only what the judge understands, so your explanations are everything, especially if the judge is non-data-science-technical. This includes reporting metrics about accuracy. If the competition has clearly defined metrics to compare models with, report them, if not, then use mean absolute percent error, percent accuracy, or something else that is easily interpretable. Even if you are following these steps, you are likely not taking full advantage of all of Jupyter’s features. Not many people know you can turn a notebook into a slide-show style presentation, where each cell is a slide, it helps immensely if you want to present your work, whether online or in person.
Besides making good comments in your notebooks, if you’re submitting your work via Devpost, fill out comments there as well. Even some in-person data science competitions are graded only by online submissions. Devpost can also become a portfolio for your work — a large and quality portfolio is a great asset to have.
Putting your notebook on your GitHub or nbviewer is also a great addition to your resume. Besides giving you the boost of having a clickable link on your resume that a recruiter can view (they love that), having your GitHub full of projects at the top of your resume makes you look very professional. In my opinion a large GitHub can more than makeup for a short resume. If you have already put a few notebooks on GitHub, I’d recommend checking out R Markdown and learning how to publish something on RPubs. Again, this gives you the credibility of having a viewable project. While I think Python is the industry standard for data science, R has its uses, and diversifying your skill set is always a great move.
My last note about notebooks, try to learn something about plotting. It really adds something that other groups don’t have. Start with Matplotlib or ggplot, but if you want to go beyond that and work with plotly, seaborn, high charter, bokeh, or something else, that’s pretty. If you’re working with Matplotlib, I’d recommend adding a theme to the plots.
I lost (got 4th place) in my first ever data science competition when I watched the 1st place team present it seemed clear to me I was going lose. My team only had some pretty basic plots, which got us into the presentation stage of the competition. The winning group had a gif of a map of Chicago, which showed some of the actual paths taxies of the trips we were modeling. This had nothing really to do with machine learning, and everything to do with visualization. Truly, a data science competition is more about communication than anything else. To this day, their map was the thing I remember best about their (or anyone else’s) presentation. If you want to stick with notebooks, better visualizations can still be made with Matplotlib, and I would recommend checking out this article as a start.
So if you have implemented all the steps above and want to go beyond a basic notebook, I’d recommend starting with Shiny. Fairly quickly, you can make a working front end for your project. A fully interactive project is a great thing to present since it has a tangible feeling to it. It doesn’t matter if the person you are presenting to is a middle schooler or someone with a Ph.D., if you show them an interactive visualization, they will play with it. One team I mentored at a datathon made a Shiny app, didn’t use any machine learning in their project (though they did use plenty of data science skills), and ended up winning Goldman Sachs’s prize. Shiny is pretty easy to learn, but if you’d prefer to stick with Python, then flask and Django are nice tools for making something similar. If someone can interact with your work, they’re more likely to be interested, to remember it, and consider it a cut above the competition.