Parsing File Names Using Regular Expressions

I just went through the 1st lecture of fast.ai v3 (2018 Part 1) taught by Jeremy Howard at the Data Institute at USF. Halfway through the lecture, a technique that on first sight appeared to be complex and incomprehensible was used. But upon further inspection, I realised that it was a simple yet incredibly elegant way to get the job done. Here, I would like to share what I just learned.

The problem at hand

The lecture described an image classification problem where wanted to classify images that belonged to 37 separate dog and cat species. The problem is, all of our image files have not been separated for us (every single one of our training images is in one folder), save for the fact that the prefix of their file names actually indicate which class each image belongs to.

data/oxford-iiit-pet/images/american_bulldog_146.jpg
data/oxford-iiit-pet/images/german_shorthaired_137.jpg
data/oxford-iiit-pet/images/japanese_chin_139.jpg
data/oxford-iiit-pet/images/great_pyrenees_121.jpg
data/oxford-iiit-pet/images/Bombay_151.jpg

The above is a snippet of what the file directories look like.

The human way of doing things

If we were working with a human colleague, we’d ask for help to separate the images by giving some elaborate instructions. First off, we would tell him that we are only interested in the words after the last forward slash(/) in each directory. Next, we would probably tell him to check that the file name ends with ‘jpg’. Then, we would say that before the ‘.jpg’ file suffix, there would a numerical ID that we are not interested in. Finally, we would say that whatever remains on the left (not inclusive of the underscore that separates the label and the ID) is the label for a particular image.

The computer way of doing things

But just how would we give a computer the same instructions? Turns out that regular expressions(RegEx) provides us with exactly what we need. This technique is so nifty that all we need to tell the computer is /([^/]+)_\d+.jpg$ . This line alone tells the computer everything we would have needed to say to a human colleague.

On first sight, it does look a little intimidating, but I will break it down bit by bit, explaining what each portion means starting from the very end of that RegEx! From now on I will bold the portion of the RegEx that I am referring to.

/([^/]+)_\d+.jpg$
The dollar sign at the very end simply means that that is the end of text we are interpreting.

/([^/]+)_\d+.jpg$
Just before the end of the text, we want to check if there exists an exact set of characters, i.e. ‘jpg’. This checks if our files are of the right format.

/([^/]+)_\d+.jpg$
The RegEx \d refers to numerical digits and the plus (+) sign that comes after it means that there may be arbitrarily many digits. This looks for the numerical ID of the images (that we are not interested in).

/([^/]+)_\d+.jpg$
The underscore (_) here simply means that we’d expect to see an underscore before a set of digits.

/([^/]+)_\d+.jpg$
The round brackets here mean that now we are defining a group of characters. Within the group, we define a set of characters with the square brackets notation []. And this set is interested in every possible character except the forward slash (/) character, this is defined by ^/ , where ^ has a ‘negation’ effect. Like before, the plus sign (+) means that there are arbitrarily many characters (that are not forward slashes). To summarise the above, ([^/]+) is looking for a group of characters that do not contain forward slashes.

/([^/]+)_\d+.jpg$
The forward slash here simply means that our search ends the moment we hit a forward slash (assuming we start from the very end).

Now that we know all that, we can work out the fact that by using ‘/([^/]+)_\d+.jpg$’ to filter ‘data/oxford-iiit-pet/images/american_bulldog_146.jpg’, we would simply get a final result of ‘/american_bulldog_146.jpg’ which is what we want!

We can also identify the label of a particular image, since the group characters that we previously defined (the one that doesn’t contain forward slashes) is essentially the name of the label of an image. All of the above can be done in Python with the following code.

import re
string = 'data/oxford-iiit-pet/images/american_bulldog_146.jpg'
pat = r'([^/]+)_\d+.jpg$'
pat = re.compile(pat)
print(pat.search(string).group(1))
>american_bulldog

Conclusion

Regular expressions are one of those things that will no doubt make life easier for Data Scientists and Machine Learning practitioners. Although it initially seemed scary, I’m glad I put in the effort to learn what RegEx(s) actually mean. Thanks for reading!