A Few Sources for Data Sets and How to Choose One to Practice with

Diane Kierce
2 min readApr 24, 2017

--

As a newcomer to data science, I’ve wondered where to get a good data set to use to practice the tools I’m learning. Some recommendations I’ve heard include

  • government sources (Data.Seattle.gov has a lot, especially if you’re interested in crime data)
  • research universities (See, for example the database on wrongful convictions that is now hosted at the University of California, Irvine)
  • so-called “fact tanks” like the Pew Research Center, which collects, analyzes, and shares data on a variety of social and political topics
  • fivethirtyeight.com, Nate Silver’s site
  • TheUpshot, The New York Times’s site for all things data-related

This barely scratches the surface. There are a lot of data sets out there. Many organizations are collecting and sharing data. Some data sets are even fairly well-organized and nicely cleaned. (Some, but certainly not all or even most!) You can also collect your own data, but I’ll save that topic for another post.

Since there is no shortage of data available for skills practice, what is a good way to choose a data set for my next project? Personally, I think it makes sense to work within a subject area I know at least a little bit about so that I can focus on the data science aspects of the analysis rather than spending more of my time and energy on figuring out the subject matter. I also want to explore a topic that I find interesting and have lots of questions about so that my curiosity can motivate me to extend my skills as I try to answer my questions. For me, education, the entertainment industry, genetics, health care reform in the United States, and criminal justice reform are among the many topics I’d like to explore. So, it makes sense for me to look for data sets in these areas.

Narrowing it down to one topic from there is difficult, but it’s important to start somewhere. That’s my next challenge.

--

--