Clothing Dataset

5,000+ Images of Clothes in Public Domain

Alexey Grigorev

Published in

Data Science Insider

5 min readOct 21, 2020

Two months ago I asked the community to help me collect a dataset with clothes. And now it’s ready!

You can download it here:

Full size: https://www.kaggle.com/agrigorev/clothing-dataset-full
Resized: https://github.com/alexeygrigorev/clothing-dataset
A subset with 10 most popular classes, resized: https://github.com/alexeygrigorev/clothing-dataset-small

In total, we have collected 5,000 images of 20 types of clothes. The dataset is released under a public domain license (CC0). This means that anyone can use this data for any purpose, also commercial.

For example:

Training a model for self-education
Creating a tutorial or a course (free or paid)
Writing a book

In this post, we’ll tell you more about the dataset. In particular, we’ll cover:

Different ways of collecting data: networking, Yandex.Toloka, and Tagias
Labeling the data
Accessing and using the dataset

Let’s start!

Dataset collection

We used three different ways to collect the dataset:

Toloka — a crowdsourcing platform
A networking crowdsourcing initiative on social media
Tagias — a company specializing in data collection

Toloka

1,400 images (28% of the entire dataset) were collected using Toloka — a crowdsourcing platform. The task was simple: take a picture of some clothes and upload it.

The instructions for the task (translated from Russian)

We also experimented with Amazon Mechanical Turk, but creating tasks with image uploads is a lot more difficult on Amazon. In Toloka, setting up an image collection form requires no programming at all.

The results from Toloka contained:

Valid images that actually were taken by the workers
Images of clothes downloaded from the Internet
Images of random objects, not clothes
A lot of duplicates

To check if an image was downloaded from the Internet, we used Google.

We used the “Search Google for image” feature from Google Chrome

Collecting data from Toloka required careful analysis of the results and in the end, we discarded more than 40% of the submitted images. That’s why we decided to try other options.

Network crowdsourcing

One option was to use the power of social media. We asked the network on LinkedIn and Twitter to contribute to the dataset. To do it, we created a call-for-action article and shared it.

To collect the data, we set up two simple Airtable forms. We explicitly asked the participants to confirm that the images are their own and they are willing to share them under CC0.

A form in Airtable for uploading images.

To give additional motivation to upload images, we gave away 3 copies of Machine Learning Bookcamp to the top 3 contributors. In the end, 32 people submitted their images, with 600 images in total.

Tagias

The campaign on social media got a lot of attention and we got contacted by multiple companies. One of them, Tagias, agreed to contribute to the project and tremendously helped with data collection.

Workers at Tagias take pictures themselves and don’t scrape them from the Internet. There’s an internal validation process, so we didn’t need to verify that the images are genuine. Only a couple of provided images weren’t suitable, and our feedback was taken into account.

In total, Tagias contributed 3000 images, which is 60% of the entire dataset.

Labeling

After collecting enough images, we needed to label them. We did it manually without crowdsourcing using a simple annotation tool based on IPython widgets.

While labeling, we sometimes made mistakes. To correct these mistakes, we used a simple idea based on training a neural network:

First, train a model with a high learning rate for a couple of epochs on all the data.
Next, apply the model to the training data and check where it makes mistakes.
If the network is correct and there’s a mistake in data, correct it.

Accessing the data

Full dataset

There are multiple ways to download the dataset:

Original images: https://www.kaggle.com/agrigorev/clothing-dataset-full
Resized images: https://github.com/alexeygrigorev/clothing-dataset

The summary statistics about the dataset from Kaggle

The dataset contains 20 classes:

T-Shirt (1011 items)
Long Sleeve (699 items)
Pants (692 items)
Shoes (431 items)
Shirt (378 items)
Dress (357 items)
Outwear (312 items)
Shorts (308 items)
Hat (171 items)
Skirt (155 items)
Polo (120 items)
Undershirt (118 items)
Blazer (109 items)
Hoodie (100 items)
Body (69 items)
Top (43 items)
Blouse (23 items)

We marked images of children clothes with a special flag “kids”:

True (476 items)
False (4927 items)

Some of the items are still labeled “Not sure”, “Others”, or “Skip”, and there could be labeling errors. The corrections are welcome. The best way to submit a correction is to create a pull request via GitHub: https://github.com/alexeygrigorev/clothing-dataset

Top-10 subset

Images of some classes don’t appear very often in the dataset. This means that training a model on this data is difficult — to train a meaningful model, we need at least 100–200 images of each class.

That’s why, for educational purposes, we created a subset of the full dataset that covers only the top-10 classes:

T-shirt (928 items)
Long Sleeve (576 items)
Pants (559 items)
Shirt (345 items)
Shoes (297 items)
Dress (288 items)
Shorts (257 items)
Outwear (246 items)
Hat (149 items)
Skirt (136 items)

For this dataset, we excluded images of clothes for children.

To make it simpler, we already resized the images, so it’s faster to load them while training a model. The dataset is already split into train, validation and test.

You can download it from https://github.com/alexeygrigorev/clothing-dataset-small.

This dataset was used in Chapter 7 of Machine Learning Bookcamp. You can check a code example in the book’s GitHub repository.

Now it’s your turn! Tell us about your ideas how you’d like to use it!

Summary

We collected over 5,000 images and for that, we relied on three sources: Tagias, Toloka, and networking
Submissions from Toloka required a lot of extra analysis
Labeling was done manually using IPython widgets and we corrected labeling mistakes using a simple neural network.
The dataset is available in two parts: a full dataset and a subset (top-10 classes).

Acknowledgments

We’d like to thank

Kenes Shangereyev and Tagias.com for helping with 3000 images
All the 32 people who contributed their images to the dataset via the forms
Everyone who supported the initiative by engaging with the announcements on social media