Demystifying 7 Public Datasets in Machine Learning

Dagang Wei
6 min readJan 28, 2024

--

Image generated by the author with DALL-E

This article is part of the series Demystifying Machine Learning.

Introduction

In the dynamic world of machine learning, datasets serve as the foundational blocks for developing and testing algorithms. These datasets range across various domains, such as image processing, natural language processing (NLP), regression, and classification, providing diverse challenges and opportunities for researchers and practitioners. Here, we explore 7 well-known public datasets, each from a different category, along with usage examples in Python.

Datasets

1. Iris — Multiclass Classification

The Iris dataset is a classic and widely used dataset in statistics and machine learning. It was introduced by the British statistician and biologist Ronald Fisher in 1936. The dataset is often used for teaching purposes in the fields of data analysis and machine learning, particularly for classification tasks and pattern recognition.

Key features of the Iris dataset include:

  • Sepal Length: the length of the sepal (in cm).
  • Sepal Width: the width of the sepal (in cm).
  • Petal Length: the length of the petal (in cm).
  • Petal Width: the width of the petal (in cm).

These measurements are taken for 150 iris flowers from three different species of iris: Setosa, Versicolor, and Virginica. Each species has 50 samples in the dataset.

The objective in most analyses using this dataset is to correctly classify the species of the iris plant based on the measurements of its sepals and petals. The simplicity and small size of the dataset make it ideal for explaining and experimenting with basic classification methods in data science and machine learning.

Example:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Load the dataset
iris = load_iris()
X, y = iris.data, iris.target
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Predict the species of an iris
species = model.predict([X_test[0]])
print(f"Predicted Species: {iris.target_names[species][0]}")

2. California Housing — Regression

The California Housing dataset is a popular dataset used in machine learning and data analysis. It contains data from the 1990 Census and includes details such as population, median income, median housing price, and more for each block group in California. Block groups are the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

Key features of the dataset include:

  • MedInc: median income in a block
  • HouseAge: median house age in a block
  • AveRooms: average number of rooms
  • AveBedrms: average number of bedrooms
  • Population: block population
  • AveOccup: average house occupancy
  • Latitude: house block latitude
  • Longitude: house block longitude
  • MedHouseVal: median house value for households within a block (target variable)

This dataset is often used for regression tasks, where the goal is to predict the median house value (MedHouseVal) based on other variables.

Here’s an example code snippet using scikit-learn in Python to load the California Housing dataset and train a linear regression model with it:

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np

# Fetch the dataset
california_housing = fetch_california_housing()

# Create a DataFrame
data = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
data['MedHouseVal'] = california_housing.target

# Split the data into training and testing sets
X = data.drop('MedHouseVal', axis=1)
y = data['MedHouseVal']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a linear regression model
model = LinearRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate the Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

3. MNIST (Handwritten Digits) — Image Recognition

The Modified National Institute of Standards and Technology (MNIST) dataset is a classic in the field of image processing. It consists of 70,000 small square 28×28 pixel grayscale images of handwritten single digits between 0 and 9.

Example:

import numpy as np
import matplotlib.pyplot as plt
from keras import layers, models

# Load the MNIST dataset
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Normalize the input images
x_train, x_test = x_train / 255.0, x_test / 255.0

# Build the neural network model
model = models.Sequential([
layers.Flatten(input_shape=(28, 28)),
layers.Dense(128, activation='relu'),
layers.Dropout(0.2),
layers.Dense(10, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

# Train the model
model.fit(x_train, y_train, epochs=5)

# Evaluate the model
model.evaluate(x_test, y_test, verbose=2)

# Predict the test set
predictions = model.predict(x_test)

# Pick an image from the test set
image_index = 1 # Change this to see different test images
plt.imshow(x_test[image_index], cmap=plt.cm.binary)
plt.title(f'Test image\nPredicted: {np.argmax(predictions[image_index])}')
plt.show()

4. IMDb Movie Reviews — NLP

The IMDb Movie Review dataset is popular for binary sentiment classification tasks. It consists of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative.

Example:

import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM
from keras.preprocessing.sequence import pad_sequences

# Load the IMDb dataset
num_words = 10000 # Consider the top 10,000 words in the dataset
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=num_words)

# Prepare the data
maxlen = 500 # Cut reviews after 500 words
X_train = pad_sequences(X_train, maxlen=maxlen)
X_test = pad_sequences(X_test, maxlen=maxlen)

# Build the model
model = Sequential()
model.add(Embedding(num_words, 32, input_length=maxlen))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

# Train the model
history = model.fit(X_train, y_train, epochs=10, batch_size=128, validation_split=0.2)

# Evaluate the model
evaluation = model.evaluate(X_test, y_test)
print(f'Test Loss: {evaluation[0]} / Test Accuracy: {evaluation[1]}')

5. CIFAR-10 — Object Recognition in Images

The CIFAR-10 dataset contains 60,000 32x32 color images in 10 different classes. The classes include objects like dogs, cats, airplanes, and trucks.

Example:

from keras.datasets import cifar10
from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.utils import to_categorical
import matplotlib.pyplot as plt
import numpy as np

# Load the dataset
(X_train, y_train), (X_test, y_test) = cifar10.load_data()

# Normalize the data
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255

# Convert class vectors to binary class matrices
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)

# Create the model
model = Sequential()
model.add(Flatten(input_shape=(32, 32, 3)))
model.add(Dense(512, activation='relu'))
model.add(Dense(10, activation='softmax'))

# Compile and train the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train, batch_size=32, epochs=10, validation_data=(X_test, y_test))

# Evaluate the model
scores = model.evaluate(X_test, y_test)
print(f"Accuracy: {scores[1]*100}%")

# Select a test image
test_image_index = 0 # You can change this index to test different images
test_image = X_test[test_image_index]
test_label = y_test[test_image_index]

# Predict the class of the test image
predicted_probs = model.predict(np.expand_dims(test_image, axis=0))
predicted_class = np.argmax(predicted_probs)

# Visualize the test image
plt.imshow(test_image)
plt.title(f"Predicted: {predicted_class}, True: {np.argmax(test_label)}")
plt.show()

Each of these datasets offers a unique challenge and learning opportunity in the field of machine learning. By experimenting with them, you can gain practical experience and understanding of various types of machine learning tasks.

6. UCI Adult Income — Classification

The Adult Income dataset, often simply referred to as the “Adult” dataset, is sourced from the UCI Machine Learning Repository. It’s used for a classification task to predict whether a person makes over $50K a year based on census data. It includes attributes like age, education, occupation, and hours per week.

Example:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
# Load the dataset
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
'hours-per-week', 'native-country', 'income']
data = pd.read_csv(url, names=columns)
# Preprocess the data
data = pd.get_dummies(data)
X = data.drop('income_ >50K', axis=1)
y = data['income_ >50K']
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a model
model = RandomForestClassifier()
model.fit(X_train, y_train)
# Predict income
predicted_income = model.predict([X_test.iloc[0]])
print("Income > $50K:", bool(predicted_income[0]))

7. 20 Newsgroups — Text Classification

The 20 Newsgroups dataset is a collection of approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups. It’s widely used for experiments in text applications of machine learning techniques, such as text classification and text clustering.

Example:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
# Load the dataset
data = fetch_20newsgroups()
categories = data.target_names
# Create a pipeline
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
# Train the model
model.fit(data.data, data.target)
# Predict the category of a new post
text = "Graphics cards in computers are getting faster."
predicted_category = model.predict([text])[0]
print("Predicted Category:", categories[predicted_category])

Conclusion

Working with these datasets can help practitioners and researchers not only to hone their technical skills but also to understand the subtleties and complexities of real-world data. Whether you’re a beginner looking to take your first steps in machine learning or a seasoned professional seeking to test new algorithms, these datasets offer a wealth of resources for learning, experimentation, and innovation in this ever-evolving field.

--

--