Create a .lst file for machine learning image classification on AWS Sagemaker

David O'Dell
4 min readApr 5, 2020

--

Photo by National Cancer Institute on Unsplash

I was taking a course on machine learning and the challenge was to build an image classifier to predict whether or not an uploaded image was X or Y. If you’re into this kind of thing (computer vision and ML are a guilty pleasure of mine), then you’ve of course heard of the cats vs. dogs dataset. Our course gave us a well formatted .lst file that we could run straight through in AWS Sagemaker.

Well, I finished that one and decided to go with the Blood Cell dataset, which didn’t come with a nicely formatted .lst file that Sagemaker could ingest. As you can imagine, most of the larger datasets are highly customized and required a lot of architecture to perform well (for the creators). Us end users are grateful, but often need to go the extra step and make our own labeling schema to fit whichever model building platform we will use. The course I’m in now is all about learning the Sagemaker ML stack which consists of

  • s3
  • Sagemaker
  • Lambda
  • API endpoint

The blood cell dataset located here: https://www.kaggle.com/paultimothymooney/blood-cells

Comes with a .csv format label guide and what I need is a simple LST format, which, is super simple but in this case, I had to make a tiny script to export it. I figure I’m going to encounter this often enough and wanted to share this. The LST format is super simple.

  • It’s just an index number you make up (#1 file, #2 file, #3 file…), separated by a tab,
  • then an “identifier-digit”, this would be 1 or 0 depending on how you set your training job up. For example, 0 would be cats and 1 would be dogs. Or, in my case, 0 would be eosinophil blood cells and 1 would be lymphocytes.
  • then the s3 file path containing the name of the image, for example, my file structure in s3 for my sagemaker job is like this:
images/
---bloodcells/
------training_file.lst
------test_file.lst
------eosinophil/
----------image383829.jpeg
----------image382828.jpeg
------lymphocyte/
----------image3224.jpeg
----------image84848.jpeg

Let’s say we have 1000 images, but the images aren’t named sequentially, they’re sort of randomly named, that’s OK. The most important thing is the index AND the identifier are present, this is what ML programs will use in their training job.

95  0  image_XEE_332.jpg
96 0 image_WRW78.jpg

This reads “image number 95 of 1000, identifier 0, image is named image_XEE_332.jpg”

The problem you’ll encounter is that you need to make a file that conforms to this based on the files in a folder you have. In linux bash this is super easy, just list the file names, echo a counter and output in a tab delimited file .lst file.

Simply copy and paste this script into your images dataset folder and change the:

  • path name of your two categories of images
  • final merged training file name you want to create
  • identifier (use 0 or 1 for this folder of similar images)
cd YOUR DATA SET IMAGE FOLDER FILE PATHvi convert_to_lst.shPASTE THIS CONTENT:########################
#!/bin/bash
A_file_path="images/TRAIN/eosinophil"
A_identifier="0"
B_file_path="images/TRAIN/lymphocyte"
B_identifier="1"
output_file_name="blood_cells_train.lst"## output identifier and file name then append to same file, make the index lastdeclare -a arrayA_file_list=$(find $A_file_path -type f | rev | cut -d'/' -f-2 | rev | sort -n)for A_file_name in $A_file_list;
do
array+=("$A_identifier $A_file_name")
done
B_file_list=$(find $B_file_path -type f | rev | cut -d'/' -f-2 | rev | sort -n)for B_file_name in $B_file_list;
do
array+=("$B_identifier $B_file_name")
done
# prints array one element per line
#printf '%s\n' "${array[@]}"
## add an index key at the beginning over every element of output and write to new filelen=${#array[@]}
i=0
while [ $i -lt $len ]; do
echo "$i ${array[$i]}"
let i++
done | tr [:blank:] \\t > $output_file_name

##########################

Save the file, change permissions and run it:

chmod 775 convert_to_lst.sh./convert_to_lst.sh

This will output in this format:

get the top 100 lines:
head -n 100 EOSINOPHIL/eosinophil_train.lst
0 0 eosinophil/_6_6851.jpeg
1 0 eosinophil/_35_6075.jpeg
2 0 eosinophil/_67_9460.jpeg
3 0 eosinophil/_56_8186.jpeg
4 0 eosinophil/_52_4363.jpeg
5 0 eosinophil/_4_9077.jpeg
6 0 eosinophil/_11_5174.jpeg
7 0 eosinophil/_29_1794.jpeg
8 0 eosinophil/_56_1906.jpeg
9 0 eosinophil/_73_5052.jpeg

and further down the file you’ll see the other category has their own unique index:

tail -n 10 blood_cells_train.lst 
4970 1 lymphocyte/_25_5536.jpeg
4971 1 lymphocyte/_23_1463.jpeg
4972 1 lymphocyte/_21_5322.jpeg
4973 1 lymphocyte/_16_3612.jpeg
4974 1 lymphocyte/_6_2867.jpeg
4975 1 lymphocyte/_12_6000.jpeg
4976 1 lymphocyte/_24_9949.jpeg
4977 1 lymphocyte/_13_9863.jpeg
4978 1 lymphocyte/_4_7596.jpeg
4979 1 lymphocyte/_7_7646.jpeg

As long as you get your image files into a linux directory and run this script, it will merge two categories of image file references into one training .lst file, extremely fast. This was around 5000 image references and it doesn’t blink.

Say you only want small training file, just head or tail what you want and append it to a new file, here’s a couple of commands to make a training file of both categories of only 50 elements each:

head -n 50 blood_cells_train.lst > blood_cells_train_100.lst## then append the tail to the file you just made:tail -n 50 blood_cells_train.lst >> blood_cells_train_100.lst

If you want to pick out a range of items from your file, try getting say the 20 items from line 400 to 421 and using them as the test elements:

sed -n '400,421 p'  blood_cells_train.lst > blood_cells_test_20.lst

Voila, hope you get some use out of this script!

--

--