Photo by Bench Accounting on Unsplash

Python — Cleaning Messy Text Data With Lambda Functions

Chaim Gluck
4 min readSep 5, 2017

--

One of the most exciting things about data science is it’s versatility. You can do projects about anything that interests you. So my latest project is a really awesome one. I’m training a model to predict the era in which a poem was written based on it’s style and word usage.

In order to build the model, I had to get poems. It was surprisingly difficult to collect a suitable corpus of poetry, so in the end, I built a webscraper to collect them from a poetry website. It worked well, but gave me very messy text. Normally, I’d remove the HTML tags in the webscraper, but in this situation, that would have smushed the lines of my poems together and caused me to lose some words. So I decided to clean it myself rather than suffer that loss.

The process ended being much less difficult than I expected. Using a combination of the convenient pandas .apply() method, which applies a function to a whole column of your DataFrame, and some lambda functions, I was able to clean all the text quickly and efficiently.

A lambda function lets you quickly define an anonymous function to perform some task. This can be very helpful for on the fly operations that you only need to do once. By using .apply() on the column and running a lambda function inside it, you can quickly change each row in the whole column.

--

--

Chaim Gluck

Freelance Data Scientist. Published by Towards Data Science. Say hello at www.linkedin.com/in/chaimgluck