20 Weird & Wonderful Datasets for Machine Learning

Findings from my hunt for amazing datasets

5GB of toy figurines!

They say great data is 95% of the problem in machine learning. We saw first hand at Udacity that this is the case, with the amazing reception from the machine learning community when we open sourced over 250GB of driving data. But, finding interesting data is really hard, and actively holds the industry back from progress. In trying to learn more about this problem I searched far and wide, and cataloged just a sliver of the datasets I found.

In the hope that others might find this catalog useful, here’s 20 weird and wonderful datasets you could (perhaps) use in machine learning.

Caveat: I haven’t validated that all of these datasets are actually useful for machine learning (in terms of size or accuracy). Use your own judgement when playing with them (and check licenses)!

My favorite? The 80,000+ UFO reports dataset:

I’ve also been fascinated with the militarized interstates disputes dataset, which includes 200 years of international threats and conflicts. It includes the action taken, level of hostility, fatalities, and outcomes.

If you have any thoughts, questions, or datasets you’d like to share, I’d love to hear from you in Tweet-form. You can follow and message me at @olivercameron.

Co-Founder & CEO at Voyage. We’re delivering on the promise of self-driving cars. Y Combinator alum.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store