Nerd For Tech
Published in

Nerd For Tech

Easily Split Your Directory Into Train, Validation, and Testing Format

Photo by Zac Wolff on Unsplash

Splitting datasets into data train, validation, and testing is a common way to deal with overfitting or underfitting in case models deployed in productions. In this article, Mr Samarth Agrawal told great about splitting datasets that consists of a data frame in CSV files. I realized that the article couldn’t be implemented into a dataset built from several directories.

Before we got into the problem, Mr Samarth Agrawal cited the work of Jason Brownlee’s article about the description of the training, validation, and testing dataset. Because the meaning is easy to understand, I will not change anything from it.

  • Training Dataset: The sample of data used to fit the model.
  • Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.
  • Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.

After knowing why we should divide our dataset into three types of data, let’s go into the problem.

The Problem

In this article, I would like to share about splitting the dataset that built from several directories, like this format.

Class1/
· File1
· File2
· …
Class2/
· File1
· File2
· …
Class3/
· File1
· File2
· …

Since this is not in CSV format, like in that article, we can’t use the same method from him. But, the concept is the same. We want to split it into three directories like this.

Train/
Class1/
· File1
· File2
· …
Class2/
· File1
· File2
· …
Class3/
· File1
· File2
· …
Test/
Class1/
· File1
· File2
· …
Class2/
· File1
· File2
· …
Class3/
· File1
· File2
· …
Val/
Class1/
· File1
· File2
· …
Class2/
· File1
· File2
· …
Class3/
· File1
· File2
· …```

The Answer

Without interfering with the number of data, we could divide it into three directories like the above. Okay, the magic uses `split_folders` that could easily be installed using pip.

pip install split-folders

After that, you just need to run this code.

Real Case

Okay, the answer section will only be theory if we don’t implement it in a real dataset. For this case, I already created simple dinosaurs dataset and posted it on Kaggle. Click this link to go into the dataset.

There are five classes of dinosaurs in the dataset that we need to classify. Using the code from the answer section, let’s implemented it into our code. The directory format is just the same as the problem section. Let’s divide it to get the targeted form: train, test, and val. This gif will explain it well.

I encourage you to create a model on your own using this dataset. There will be no reward by creating a new model, but you could practice making a new model using a new type of dataset, not just cat and dog or another common dataset.

Conclusion

Congratulations, you already know how to split directories into training, testing, and validation format. There is no secret; we only need a split-folders package, and splitting directories will not become our problem. Ask them in the comment section or email if you have any questions. Thanks for reading.

References

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store