25 Excellent Machine Learning Open Datasets

ODSC - Open Data Science
May 13 · 4 min read

Your machine learning program is only as good as your training sets. Data sets are an integral part of the quality of your machine learning, but you may not always have access to data behind closed walls or the budget to purchase (or rent) the key.

Don’t despair. There are plenty of data sets out there where you can train your machine learning for free. Here are our top 25 picks for open source machine learning datasets. Each one offers clean data with neat columns and rows so that your training sets run more smoothly. Let’s take a look.

25 Machine Learning Open Datasets To Get You Started

Each of these datasets can answer an interesting question based on your primary field. They’re already scrubbed and simple enough to run cleanly without leaving out too much info to be useful.

Natural Language Processing

  • Amazon Reviews: A collection of over 35 million reviews from the last 18 years. It includes things like ratings, reviews in plain text, and user information. It also contains complete product information for reference.

Sentiment Analysis

  • Standford Sentiment Treebank: Dataset containing sentiment notations for over 10,000 pieces of data from Rotten Tomatoes reviews rendered in HTML

[Related article: Google Dataset Search Launched to Help Analysts Scour Repositories]

Public Government Data

  • Data USA: A comprehensive overview of various sets of US public data in fun visualizations. It includes things like population, health, and jobs.

Finance and Economics

  • World Bank Open Data: Data concerning population demographics and key indicators for development.

Facial Recognition

  • Labeled Faces In The Wild: Common dataset for facial recognition training. It includes 13,000 cropped faces plus a subset of people with two different pictures within the dataset.

Image Datasets

  • Imagenet: Dataset containing over 14 million images available for download in different formats. It also includes API integration and is organized according to the WordNet hierarchy.

Health:

  • Healthdata.gov: a resource from the US federal government providing data to improve health outcomes for the US population.

Media

  • FiveThirtyEight Journalism: The numbers behind some of this journalism hub’s stories. Useful for visualizations and data stories.

Transportation

  • US National Travel and Tourism Office: provides trustworthy datasets with big pictures of the tourism industry, including things like inbound and outbound travel and international visitor data.

Speech

  • Flickr Audio Caption Corpus: 40,000 spoken captions from 8,000 images in a manageable size. It was initially designed for unsupervised speech pattern discovery.

Sound

  • FSD (Freesound): A collection of every day sounds collected by contribution under an open source license.

Dataset Aggregators

  • OpenDataSoft: 2600 data portals arranged in an interactive map formation or by country list. If you’re looking for it, chances are, it’s here.

[Related download: 20 Free ODSC Resources to Learn Machine Learning]

Getting Started With Machine Learning

This is by far not an exhaustive list of datasets. When you’re beginning your next data project, having a place to start based on the subject matter could help you cut down on your initial start time. These offer excellent information sets and are freely available for you to play with. So whether you have a project for your organization, or you’re experimenting with something on your own, there’s a dataset to get you started.


Original post here.

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.

ODSC - Open Data Science

Written by

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.