Have messy text data? Clean it with simple lambda functions.

One of the most exciting things about data science is it’s versatility. You can do projects about anything that interests you. So my latest project is a really awesome one. I’m training a model to predict the era in which a poem was written based on it’s style and word usage.

In order to build the model, I had to get poems. It was surprisingly difficult to collect a suitable corpus of poetry, so in the end, I built a webscraper to collect them from a poetry website. It worked well, but gave me very messy text. Normally, I’d remove the HTML tags in the webscraper, but in this situation, that would have smushed the lines of my poems together and caused me to lose some words. So I decided to clean it myself rather than suffer that loss.

The process ended being much less difficult than I expected. Using a combination of the convenient pandas .apply() method, which applies a function to a whole column of your DataFrame, and some lambda functions, I was able to clean all the text quickly and efficiently.

A lambda function lets you quickly define an anonymous function to perform some task. This can be very helpful for on the fly operations that you only need to do once. By using .apply() on the column and running a lambda function inside it, you can quickly change each row in the whole column.

This is what my DataFrame looked like after a bunch of operations I did to remove all the HTML tags:

Astronomers are nomads.

Each row contains the whole text of the poem, but when the DataFrame prints, it only shows the beginning. You can see that some words in the poems are capitalized, and there are some commas. When analyzing text, you most often want to make everything lowercase, because to a computer, the same word with different capitalization is a different word. So we’ll make everything lowercase and remove the punctuation and the numbers.

First, to make everything lowercase, I used this:

df.poem = df.poem.apply(lambda x: x.lower())

The apply() method performs the specified operation on the entire poem column. The lambda function says for every ‘x’, do the method x.lower(), which makes the letters lowercase. Here’s the result:

Winter comes!

Great! But we still have the punctuation and numbers. For these, we’ll have to do something a bit different. There is a .translate() method in the ‘str’ module which can translate characters based on a table you feed it or delete any characters you specify. In this situation, I’ll use it to delete all the punctuation and then the numbers. To delete the punctuation, I used this line of code:

df.poem = df.poem.apply(lambda x: x.translate(None, string.punctuation))

string.punctuation is a list of all the symbols commonly used for punctuation. ‘None’, means that I didn’t enter a table for translation. This line deletes any punctuation characters from each poem in the entire ‘poem’ column. Here is what it looks like after that step:

Fareweel ye bughts!

Excellent! Now we do the same process to remove numbers:

df.poem = df.poem.apply(lambda x: x.translate(None, string.digits))

And here it is:

This is the final datum.

With three simple lines of code, we transform the poems from their regular form, with capital letter, punctuation, and numbers, to the pure lowercase text, with no other characters. Here are all the lines together:

df.poem = df.poem.apply(lambda x: x.lower())
df.poem = df.poem.apply(lambda x: x.translate(None, string.punctuation))
df.poem = df.poem.apply(lambda x: x.translate(None, string.digits))

As a final note, these functions are very efficient, and can work on large data sets very quickly. Happy munging!