Skip the Data Preprocessing! Accessing 12 Ready-to-Go Datasets

CIFAR, IMDB, Reuters, MNIST, & More

Andre Ye
Analytics Vidhya
5 min readMar 4, 2020

--

It’s handy when datasets can be accessed without having to download them from before. Often when datasets are taken straight from the source, the data needs to be converted, cleaned, and preprocessed. For large NLP datasets, the words need to be quantified, which can take a significant amount of time for larger datasets. In this article, I’ll outline how to load 12 datasets with Keras and Scikit-Learn that are preprocessed and are ready to be analyzed or fed into a machine learning model.

Note — make sure to have internet turned ‘on’ if downloading in an environment like Kaggle. The libraries retrieve their data from online, so they need the internet to work. Otherwise, it will throw an error.

CIFAR10 & CIFAR100

The Canadian Institute For Advanced Research (CIFAR-10) dataset contains 60,000 32 by 32 color images in 10 different classes. The 10 different classes are airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. There are 6,000 images of each class. The CIFAR-100 dataset has 100 different classes.

The CIFAR-10 and CIFAR-100 datasets are routinely used for evaluating image recognition deep learning methods. The CIFAR website has the datasets available for download, but require annoying un-pickle-ing and data conversion. Keras has this dataset easily accessible via:

IMDB Movie Reviews Sentiment Dataset

The IMDB Movie Reviews Sentiment Dataset consists of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes. Words are indexed by overall frequency in the dataset, so that for instance the integer “3” encodes the 3rd most frequent word in the data. This means there is no messy NLP preprocessing that needs to be done.

This dataset is often used to test natural language processing techniques due to its abundance of data, binary classification, and consistency in context.

The Keras library can assist —

Reuters News Topic Classification

The Reuters News Topic Classification Dataset of 11,228 newswires from Reuters, labeled over 46 topics. As with the IMDB dataset, each wire is encoded as a sequence of word indexes (similarly to the IMDB dataset).

MNIST Handwritten Digits

The MNIST database consists of 60,000 training images and 10,000 testing images of 28 by 28 pixel images of handwritten digits from 0 to 9.

Source

The MNIST database is the benchmark standard for testing image recognition. Keras has MNIST implemented in a format easy to load (the original MNIST database requires some preprocessing):

Fashion MNIST

The Fashion MNIST dataset consists of 60,000 28 by 28 grayscale images of 10 fashion categories, along with a test set of 10,000 images.

Source

This dataset can be used as a drop-in replacement for MNIST. The class labels are:

The Keras implementation of the MNIST Fashion dataset is:

Boston Housing Prices Regression

The Boston Housing Prices Dataset taken from the StatLib library which is maintained at Carnegie Mellon University. Samples contain 13 attributes of houses at different locations around the Boston suburbs in the late 1970s, and targets are the median values of the houses at a location.

The Boston Housing dataset is considered a benchmark dataset for regression algorithms.

Iris Plants Dataset

The famous Iris Plants Dataset consists of four features with measurements of the plant and a 3-class target of the species of iris and is maintained by the University of California Irvine Machine Learning Repository.

Diabetes Dataset

The Diabetes Dataset consists of ten baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements. These were obtained for each of 442 diabetes patients. The target is a quantitative measure of disease progression one year after baseline.

Wine Recognition Dataset

The UCI Wine Recognition Dataset consists of 13 quantitative measures of wine and a 3-class target value representing the type of wine. This famous dataset is another benchmark for multi-class classification algorithms.

Wisconsin Breast Cancer Diagnostic Dataset

The famous Wisconsin Breast Cancer Diagnostic Dataset consists of 30 numerical features describing a cancer cell with a final binary target diagnosis of malignant or benign. This dataset is a benchmark dataset for high-dimensionality and the use of PCA in assisting classification.

Olivetti Faces Dataset

The Olivetti Faces Dataset, collected by AT&T Laboratories Cambridge, is a set of 400 64 by 64 pixel images of 40 different people. The target is to identify the identity of the person. This dataset is especially helpful in evaluating the performance of image recognition algorithms in datasets with several classes and little training data.

I hope you enjoyed this article! If you did, feel free to check out some of my other work.

--

--