20 Weird & Wonderful Datasets for Machine Learning

Findings from my hunt for amazing datasets

Oliver Cameron
2 min readNov 8, 2016
5GB of toy figurines!

They say great data is 95% of the problem in machine learning. We saw first hand at Udacity that this is the case, with the amazing reception from the machine learning community when we open sourced over 250GB of driving data. But, finding interesting data is really hard, and actively holds the industry back from progress. In trying to learn more about this problem I searched far and wide, and cataloged just a sliver of the datasets I found.

In the hope that others might find this catalog useful, here’s 20 weird and wonderful datasets you could (perhaps) use in machine learning.

Caveat: I haven’t validated that all of these datasets are actually useful for machine learning (in terms of size or accuracy). Use your own judgement when playing with them (and check licenses)!

My favorite? The 80,000+ UFO reports dataset:

I’ve also been fascinated with the militarized interstates disputes dataset, which includes 200 years of international threats and conflicts. It includes the action taken, level of hostility, fatalities, and outcomes.

If you have any thoughts, questions, or datasets you’d like to share, I’d love to hear from you in Tweet-form. You can follow and message me at @olivercameron.

Want to learn more about machine learning? Sign up to my weekly newsletter on deep learning and self-driving cars: Transmission!

--

--

Oliver Cameron

Obsessed with AI. Built self-driving cars at Cruise and Voyage. Board member at Skyways. Y Combinator alum. Angel investor in 50+ AI startups.