An elegant way to split a Dataset into Train, Validation, and Test splits in Shell Script
Researchers and Engineers alike, often use train_test_split
from scikit-learn
library or rely on existing split datasets for their use.
What if I told you, the same can be done in a few lines of code using shell scripting?
Or batch Scripting (as done in the link below)
Firstly, let’s look at the case, where we are dealing with a simple task like segmenting Pets from Images, or Lungs from an X-Ray scan.
In this case, we can assume that all the images are in a single folder, and we are to divide the images into 3 folders, ‘train’, ‘val’, and ‘test’.
Let’s see how we can do that.
The above code randomly chooses a number of files and moves them to the ‘train’ folder. Then it randomly chooses a number of files and moves them to the ‘val’ folder. Then finally, it moves the remaining files to the ‘test’ folder.
Now, as we are dealing with a segmentation task here, we also need to move the segmentation masks with the same filename to the respective folders, as well.
Let’s see how to do that too.
We need to do the above for all three splits.
Now, what if the image folder contains class subfolders?
As is generally the case in datasets like CIFAR-10, CIFAR-100, etc.
We can do that by recursively iterating over the files in each sub-folder of the image directory and moving them to either of ‘train’, ‘val’, or ‘test’ folders.
We can also follow the same procedure as in the single image folder case for each class subfolder and achieve the same result.
The above examples worked fine in Google Colab when trying to split the Oxford PET dataset. However, an issue may occur due to the usage of ls
the inside shell script, where you may see the error
ls: /Users/…/*.py: No such file or directory
A solution to this issue can be found here.
OR
Simply change the following line
total_files=$(ls -1 “$image_folder_path” | wc -l)
to
total_files=$(ls -1 “$image_folder_path”/* | wc -l)
Clap and Share if you like it. And Follow for more!!