I was speaking with some students from Berkeley about my advice in my last post to build a portfolio of data projects. They asked for more details on how to do so, so here’s some further advice to keep in mind when you’re working on making a data portfolio project.
Use Real Data
Try to do something with real data rather than Kaggle or other pre-cleaned data. Data cleaning, prep and transformation is a real part of any data job. Plus, it’s hard to get noticed if you’re using data everyone else has.
Scrape Your Own Data
Webscraping is surprisingly easy, and a great way to get interesting data. I got a lot of mileage out of the readHTMLTable function in the R package XML. You can simply point it at a Wikipedia page or a sports statistics page, anything with an HTML table on it, and boom, you get the data. Similarly, BeautifulSoup or Scrapy in Python are absurdly easy to use. If you can see it on the web, you can get it!
Ask for Data
When you’re a student, people will sometimes give you data that would be harder to get other times. I asked for data from external sources a bunch of times when I was at Dartmouth, and I usually got more than I expected. You won’t get it unless you ask!
Use Publicly Accessible APIs
The other great way to get data is by using APIs. This seems like an intimidating method until you do it once, then it’s simple. Look for tutorials to copy and paste from.
Pick Interesting Data Over All Else
I find that the best portfolio projects are less about doing fancy modeling and more about working with interesting data. A lot of people do things with financial information or Twitter data; those can work, but the data isn’t inherently that interesting, so you’re working uphill. I remember a class project in MIDS where a team used rap song lyrics. It was awesome source material —they didn’t have to do anything crazy with it to make the project fascinating. Your job is just to let the data shine, rather than to necessarily add value yourself by complicating things too much.
Pick Something You’re Curious About, Not Something You Hope Will Be Impressive
A way to make sure your data is interesting is to pick something you can get visibly excited talking about. If it’s something that interests you, it’ll be more fun and you’re more likely to find interesting angles with it. Another thing you want is for the topic to be accessible — one way to make sure it’s accessible is that it’s accessible to you. Plus, when you’re trying to write Regex to strip out commas from some stubborn column, you’ll be less frustrated if you’re actually looking forward to seeing the answers hiding in your data.
Pick an Analysis That is Interesting Regardless of What You Find
Related to having interesting data, try to pick something that will be interesting no matter what the answer is. I did a project in college looking for the impact of fraud on neighboring nonprofits. It was only going to be interesting if there was an impact, but it turned out that there wasn’t.
Perfect the Visuals
Visuals matter a lot in portfolio projects (and the rest of life). It’s worth spending the time making the visuals beautiful, whether it’s with custom themes or simple interactivity. Don’t be afraid of simple line or bar charts — they’re easiest to understand, and people have to understand and be interested in your work before they’ll be impressed. If you want to do something fancier, build to it.
Keep the Text Short
Keep the text short. If you want to go into detail, do it in an appendix. People have short attention spans. On the same note: prepare a tl;dr version of your project, explaining it in one or two sentences.
Put the Code on Github
Put the code up on Github. Comment it and organize it well. Try to make the whole exercise, from downloading data to the visualizations and text, reproducible. R Markdown and/or iPython Notebooks are your friend here.
Productionize your Analysis If Possible
You get a lot of bonus credit for productionizing any model or data product. People thought our baseball analysis was cool, but they were really impressed by our Twitterbot that made predictions in real time. The Twitterbot took no time at all to write! Similarly, Slackbots, Facebook chat bots, and Reddit bots are stupid easy to write and are very impressive.
Make Your Data Interactive
Related to the above point, showing people data is cool, but letting them interact with it in almost any way gets you major bonus points. If you can turn your analysis into an interactive visualization, a quiz, a tool, or a customizable ranking, people will love it. Shiny in R or Flask in Python are super helpful here, and both surprisingly easy to use.
The project isn’t done when you post it publicly. Actively solicit feedback on it from friends who are and aren’t data people. Specifically ask what they found interesting and where they got lost. Don’t be afraid to keep adding on to or editing your projects after they’re published!
Notes on Specific Data Sources
A few great sources for data:
- Reddit (great API, interesting content, lots of data available)
- tumblr (great API)
- news sites (especially ones that publish their views — a potential dependent variable!)
- nonprofits (if you are willing to call and help them)
- City governments
- University websites (Berkeley has lots of data, of course! Or maybe scrape the course guide?)
Difficult sources of data, mostly because of restrictive APIs / anti-scraping policies:
Here are four links that I would use as inspiration for great data portfolio projects:
A Specific Project Idea
Finally, I’ll pitch a project I always wanted to do, but never was able to do: divorce rates among actors and actresses on Wikipedia. The biography pages have martial statuses and years for a lot of people (example). How often do major actors and actresses get divorced? How does that compare to politicians or musicians? You can cut this a thousand interesting ways — gender, film genres, age, having major award nominations or not, etc. When do they get married? How long do those marriages last? How big are the age gaps? How many result in children? Has it changed over time? I’ve found that people guess the divorce rate is 50% or higher, but I bet the real rate is much much lower than that, so it’s a project that’s likely going to deliver a surprising answer — always good for generating interest! You could even make a simple model to predict divorces among actors and actresses, and allow people to search for their favorite actor or actress to see what the model thinks of their chances.