Famous Machine Learning Datasets You Need to Know

Published in

Data Science Bootcamp

7 min readFeb 18, 2019

Getting started with Machine Learning and Deep Learning as a beginner? Here are the datasets and details you need to know to not sound like a noob. As usual, our tutorial is beginner friendly. Please give up a like / clap!

What is the optimal dataset size?

A small data could have two-thousand images. The dividing line between a large data set and small data set is somewhat subjective. Overfitting is a concern when using transfer learning with a small data set. — Udacity Deep Learning

Also one arbitrary number people use is : 10,000 records is an optimal minimum dataset size.

What should you know about the MNIST dataset?

MNIST TL;DR:
- Famous introductory computer vision entry dataset, beginner dataset, starter or benchmark dataset
- each row is an image
- label column ranges from 0 to 9. There are ten classes
- Each image is 28x28 pixels
- Dimension sometimes written as 1x28x28 or 28x28x1, the extra dimension is the number of color channels. One means gray scale. Three is normal RGB three color channel.
- Each pixel value ranges from 0 to 255
- it’s common to flatten the image into a vector of length 784
- in grayscale, a black pixel is encoded as 0, a white pixel is encoded as 255

Tensorflow tutorial illustrates key dimension and information.

— — Work in progress — — — —

MNIST

Pronounced EM-NIST, perhaps the most well known, beginner friendly Hello World machine learning dataset. It’s the most frequently used basic benchmark datasets used in machine learning and deep learning. It is literally EVERYWHERE! It is often the most beginner friendly, gateway dataset. Think about it as the todolist hello-world app you need to build when learning the crutches of any programming language. MNIST is the crutch dataset for any classification and computer vision tutorial. MNIST used to be very useful for recognizing hand written digits / numbers, for example the US zip code.

There are 70,000 images. Each MNIST image is a handwritten digit between 0 and 9 0,1,2,3,4,5,6,7,8,9 a total of 10 digits (ten options, ten classes) written by public employees and census bureau workers and high school students. There are also ten labels zero, one… nine attached with each image. Each image is 28 pixels by 28 pixels, 28 pixel wide and 28 pixel high. It flattens to a length28*28 = 784 vector. You will see these numbers a lot in ML tutorials.

MNIST dataset is considered “solved”. Most modern ML and DL architecture can easily achieve above 95%+ result with minimal training. In the past there were some challenges: 1 looks like 7, 3 looks like 8, 4 looks like 9 in hand writing.

One potential short fall of MNIST is that: Note MNIST data is relatively clean. Digits are preprocessed to be nicely centered. A lot of modern models can achieve really good result, it won’t be obvious when the models are problematic such as the problem of overfitting. Nuanced issues like overfitting won’t be very obvious in this dataset.

Fashion MNIST TL;DR:
- Famous introductory computer vision entry dataset, beginner dataset, starter or benchmark dataset. More intermediate than MNIST. Non trivial.

Fashion MNIST

A more sophisticated alternative image dataset to MNIST handwritten digits, consisted of images of clothing such as pants, t-shirts etc. See all class labels below.

Contains 70,000 grayscale images in 10 categories, 28x28 pixel each (just like MNIST). Great starter dataset for getting started or learning Convolutional Neural Networks. In other words, Fashion MNIST is a great Hello World dataset for trying out Convolutional Neural Networks (CNN).

The original data was 28x28 pixel grayscale images, and they’ve been flattened to become 784 distinct columns in the csv file. The file also contains a column representing the index, 0 through 9, of the fashion item.

Fashion MNIST was introduced in this paper.

What are the ten classes in FashionMNIST ? What are the labels of Fashion MNIST? Each training and test example is assigned to one of the following labels:

0 T-shirt/top
1 Trouser
2 Pullover
3 Dress
4 Coat
5 Sandal
6 Shirt
7 Sneaker
8 Bag
9 Ankle boot

On Tensorflow you can use this vector for the labels.

class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat','Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

The pixel values range from 0 (black ) to 255 (brightest). So we can divide each value by 255 to normalize the values between 0 and 1 and make it easier to compute. Another way to normalize the pixel is between -1,1

train_images = (train_images — 127.5) / 127.5

Fashion-MNIST samples by Zalando

The nice 70,000 image volume generously allow us to use 60,000 images for training, 10,000 for evaluation.

CIFAR

CIFAR pronounced C-FAR (see-far). CIFAR10, CIFAR100, the number suffix refers to the number of classes/categories.

“Dataset of 50,000 32x32 color training images, labeled over 10 categories, and 10,000 test images.” “Dataset of 50,000 32x32 color training images, labeled over 100 categories, and 10,000 test images.” — Keras Dataset
“For this tutorial, we will use the CIFAR10 dataset. It has the classes: ‘airplane’, ‘automobile’, ‘bird’, ‘cat’, ‘deer’, ‘dog’, ‘frog’, ‘horse’, ‘ship’, ‘truck’. The images in CIFAR-10 are of size 3x32x32, i.e. 3-channel color images of 32x32 pixels in size.” — Pytorch tutorial

Each CIFAR is 32x32 pixel. There are 50,000 training images, and 10,000 label images. Each CIFAR10 image maps to one of ten labeled categories. Each CIFAR100 image map to one of 100 categories. ImageNet in comparison maps to 1000 categories.

$mkdir keras
$cd keras
$conda create -n keras_env
$conda activate keras_env
$conda install keras
Proceed ([y]/n)?$  y
$python3
>>> from keras.datasets import cifar10
Using TensorFlow backend.
>>> (x_train, y_train), (x_test, y_test) = cifar10.load_data()
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
>>> import numpy as np
>>> np.unique(y_train)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint8)

The above code snippet uses anaconda data science package management. Read our article on Anaconda cheatsheet. $ precedes a command line command. >>> precedes a python interactive command.

y_train, y_test: list of integer labels (1 or 0)

In no particular order the 10 classes are: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck.

Titanic Survivor Dataset

The perfect entry, beginner friendly, playground introduction dataset to compete on Kaggle. It is a binary classification task predicts 1, 0 whether a passenger survived or not.

Street View House Number (SVHN) Dataset

Contains Google Map Street View house number images, SVHN is amore sophisticated, non-trivial, colorful alternative to MNIST.

Boston Housing Dataset

Predict medium housing price in Boston, great for regression models.

Load Boston Housing Dataset in Sklearn

boston = load_boston()

IMDB Movie Review Dataset

For sentiment analysis, positive versus negative reviews. Natural language processing. Link to source.

IMDB movie dataset for classifications and NLP (Dataset source here)
50,000 reviews with ratings range from 1to 5 stars. Each data sample is a pair of data (review, rating).

“Large Movie Review Dataset
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more details.”

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

Paper

Great for training sentiment analysis models.

ImageNet

State of art largest image datasets and host of data competition, in which the top researchers compete. Measure top-5-error rate.

ImageNet can contain 20,000 classes but the most frequently referenced ImageNet competition usually uses 1000 classes.

Which 1000 classes? See the 1000 ImageNet classes number and label here.

COCO Dataset

You can search and view the COCO dataset here. http://cocodataset.org

“COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features: Object segmentation Recognition in context Superpixel stuff segmentation 330K images (>200K labeled) 1.5 million object instances 80 object categories 91 stuff categories 5 captions per image 250,000 people with keypoints”

You can also find a research paper that describes the coco dataset.

200K images.

Source 7

Companion, sharable flash card below:

COCO Dataset — Machine Learning Dataset (definition)
https://ml.learn-to-code.co/skillView.html?skill=ktaGM0BqcE5CvynGJFWo

Wikipedia word embedding for NLP

Available in many formats such as for SpaCy, Facebook Pytorch, Tensorflow … But lack first person story telling data examples. For example, the wikipedia wording and style is quite different from Tweets, which is filled with first-person conversations, hashtags and current events, which is often biased. Using the wikipedia trained embedding to predict tweets may result in inaccuracy.

University of Oxford Flower 102

As the name implies, the dataset is available via Oxford, and there are 102 class labels. Great as a Convolutional Neural Network starter. Images in folders. Source . Excellent for VGG. VGG transfer learning. Colored dataset.

Additional Resources

See how models perform on the MNIST dataset
Keras Datasets
UC Irvine Machine Learning Dataset Collection.

Source

http://cocodataset.org