An elegant way to split a Dataset into Train, Validation, and Test splits in Shell Script

Published in

The Owl

2 min readJun 9, 2023

Researchers and Engineers alike, often use train_test_split from scikit-learn library or rely on existing split datasets for their use.

What if I told you, the same can be done in a few lines of code using shell scripting?

Or batch Scripting (as done in the link below)

An elegant way to split a Dataset into Train, Val, and Test splits in Batch Script

In the post linked below, we can see how to split a dataset into the train, validation, and test splits using Shell…

medium.com

Firstly, let’s look at the case, where we are dealing with a simple task like segmenting Pets from Images, or Lungs from an X-Ray scan.

In this case, we can assume that all the images are in a single folder, and we are to divide the images into 3 folders, ‘train’, ‘val’, and ‘test’.

Let’s see how we can do that.

The above code randomly chooses a number of files and moves them to the ‘train’ folder. Then it randomly chooses a number of files and moves them to the ‘val’ folder. Then finally, it moves the remaining files to the ‘test’ folder.

Now, as we are dealing with a segmentation task here, we also need to move the segmentation masks with the same filename to the respective folders, as well.

Let’s see how to do that too.

We need to do the above for all three splits.

Now, what if the image folder contains class subfolders?

As is generally the case in datasets like CIFAR-10, CIFAR-100, etc.

We can do that by recursively iterating over the files in each sub-folder of the image directory and moving them to either of ‘train’, ‘val’, or ‘test’ folders.

We can also follow the same procedure as in the single image folder case for each class subfolder and achieve the same result.

The above examples worked fine in Google Colab when trying to split the Oxford PET dataset. However, an issue may occur due to the usage of ls the inside shell script, where you may see the error

ls: /Users/…/*.py: No such file or directory

A solution to this issue can be found here.

OR

Simply change the following line

total_files=$(ls -1 “$image_folder_path” | wc -l)

total_files=$(ls -1 “$image_folder_path”/* | wc -l)

Clap and Share if you like it. And Follow for more!!

An elegant way to split a Dataset into Train, Validation, and Test splits in Shell Script

What if I told you, the same can be done in a few lines of code using shell scripting?

An elegant way to split a Dataset into Train, Val, and Test splits in Batch Script

In the post linked below, we can see how to split a dataset into the train, validation, and test splits using Shell…

Now, what if the image folder contains class subfolders?

OR

Written by Siladittya Manna