Build a Deep Learning dataset (Part 2)

Jonathan Hui
Mar 1, 2018 · 5 min read

The success of a Deep Learning project depends on the quality of your dataset. In the 2nd part of this series, we explore the core issues to build a good training dataset.

The 6-part series for “How to start a Deep Learning project?” consists of:

· Part 1: Start a Deep Learning project.
· Part 2: Build a Deep Learning dataset.
· Part 3: Deep Learning designs.
· Part 4: Visualize Deep Network models and metrics.
· Part 5: Debug a Deep Learning Network.
· Part 6: Improve Deep Learning Models performance & network tuning.

Dataset

Garbage in and garbage out. Good data trains good models.

Public and academic datasets

For research projects, search for established public datasets. Those datasets have cleaner samples and published model performance that you can baseline on. If you have more than one options, select the one with the highest quality samples relevant to your problems.

Custom datasets

For real-life problems, we need samples originated from the problem domains. Try to locate public datasets first. The efforts to build a high-quality custom dataset is rarely discussed properly. If none is available, search where you can crawl the data. Usually, there are plenty suggestions, but the data quality is often low and requires a lot of cleanup. Spend quality time to evaluate all your options and select the most relevant before crawling samples.

A high-quality dataset should contain:

  • balanced taxonomy.
  • sufficient amount of data.
  • high quality information in data and labels.
  • minimum data and label’s errors.
  • relevant to your problems.
  • diversify.

Do not crawl all your data at once. We often crawl website samples by tags and categories for data relevant to our problem. Train and test samples in your model and refine the crawled taxonomy from lessons learned. Then cleanup your crawled data significantly. Otherwise, even with the best model designs, it will still fall short of human-level performance. Danbooru and Safebooru are two very popular sources of Anime characters. But some deep learning applications prefer Getchu for better quality drawings. We can download drawings from Safebooru using a set of tags. We examine the samples visually and run tests to analyze the errors (the samples that perform badly). Both the model training and the visual evaluation provide further information to refine our tag selections. With continue iterations, we learn more and build our samples gradually. Download files to different folders according to the tags or categories such that we can merge them later based on our experience. Clean up samples. Use a classifier to further filter samples not relevant to your problem. For example, remove all drawings if the characters are too small. Some Anime projects use Illustration2Vec to estimate tags by extracting vector features for further fine-tuned filtering. Smaller projects rarely collect as many samples comparing with academic datasets. Apply transfer learning if appropriate.

I re-visit the progress on the Manga colorization when I write this article. We did not spend much time on PaintsChainer when we started the project. But I am glad to play them a visit.

The left drawing is provided by PaintsChainer and the right is the drawing colored by the machine. Definitely, this is product-ready quality.

We decided to test it with some of our training samples. It is less impressive. Fewer colors are applied and the style is not correct.

Since we trained our model for a while, we knew what drawings will perform badly. As expected, it has a hard time for drawings with entangled structures.

This illustrates a very important point: choose your samples well. As a product offering, PaintsChainer makes a smart move on focusing the type of scratches that they excel. To proof that I use a clean line art picked from the internet. The result is impressive again.

There are a few lessons learned here. There is no bad data, just the data is not solving your needs. Focus on what your product wants to offer. As samples’ taxonomy increased, it is much harder to train and to maintain output quality.

In early development, we realize some drawings have too many entangled structures. Without significantly increasing the model capacity, those drawings produce little values in training and better be left off. It just makes the training inefficient.

Trim out irrelevant data. You will get a better model.

To recap,

  • Use public dataset if possible.
  • Find the best site(s) for high quality and diversify samples.
  • Categorize samples into folders. Merge data based on the lessons learned.
  • Analyze errors and filter out samples irrelevant to your real-life problem.
  • Build your samples iteratively.
  • Balanced the number of samples for each class.
  • Shuffle your samples before training.
  • Collect sufficient samples. If not, apply transfer training.

Part 3

People spend a lot of time in building and tuning a model. This article counterbalances the important of a clean dataset. In Part 3: Deep Learning designs, we finally look at some key design decisions in a DL project.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade