So you don’t want to use regex…

For one of my general assembly projects, a webscraper I wrote about earlier, I had to extract a salary number from a string. This sounds like a simple enough python task, especially if you know regular expressions, but at the time I was very reluctant to learn regular expressions.

So instead I created a complicated but effective function.

Let’s break this down line by line

Takes a string value in a pandas data frame and splits it into a tuple of each individual word in the spring separated by spaces.

‘$75,000 — $80,000 per year → (‘$75,000’ ,’$80,000' ,’per’ , ‘year’)

Now we loop over each item in the string:

This removes the ‘$’ and ‘,’ by replacing them with an empty string.

This is the part of the function that actually does the magic.

First, we try to append the item (t) that resulted from the split to a list (l), but as a float value. Since only digits can be converted to floats (1,2,3, etc), we’ll get a ValueError when it tries to append ‘per’ or ‘year’ as a float. The except and pass statement tells the computer to skip over that item in the list if it returns a value error. This lets our function without interruption and produce a list of just the numbers.

The salary data I was extracting sometimes had a single number, like $75,000 and sometimes had a range. This last section checks if we have a list of two or one, and then returns the mean of the numbers if it’s two, and returns only the first item in the list if it’s just one.

It’s not nearly as elegant as using regex but it still works and isn’t that what matters in the end?

(Sometimes, I guess?)