Thinking Fast
Published in

Thinking Fast

NLP as Data Cleansing for Data Science

Or How NLP Became My New Mop

Photo by No Revisions on Unsplash

When I was younger my mom used to force me to clean. I used to clean my room, the bathrooms, the kitchen. You name it, I cleaned it. I remember thinking “I can’t wait to have my own place, so I can take a break every now and again from all the cleaning.”

Fast forward 20 years and there I was, sitting at my desk spending countless hours cleaning. Albeit the nature of my cleaning has changed rather significantly. Instead of scrubbing toilets and washing dishes, I found myself cleaning data. Lots of cleaning of data.

It is no surprise to those who work in the field when we hear statistics like “60%+ of a data scientists time is spent cleaning, wrangling, and engineering data. Not modeling it.” The shear volume of data hasn’t meant that somehow all that data got cleaner. No quite the opposite actually. As data volumes have increased so too have we found new and interesting problems with the quality of that data.

In short, cleaning data is a lifestyle. And no other area has experienced this more than the area of Natural Language Processing. Natural language is messy, context specific, prone to human errors, and full of hidden meaning of ambiguity. But this challenge hasn’t stopped data scientists from trying to leverage its vast source to solve a myriad of problems.

Indeed, there are thousands of applications just waiting for the write data scientist to grab the data, CLEAN it, and sanitize it for other people so as to be more valuable to the end user. Take the internet for example, it is probably the single largest source of natural language data ever collected. It’s depth means there is hidden value just waiting to be unlocked with the right set of NLP tools and give to the right users.

Over the years working with the NLP toolkit, I have learned a few tricks for more quickly attempting to extract meaning from natural language data with some useful cleaning tools. Here is a list of some of my most valuable tricks:

1. Basic cleaning with NLTK:

a. Whitespace stripping

b. Stopword removal

c. Character removal

d. Lemmatization

2. Extracting useful information

a. Identifying useful bigrams with PMI collocation measures (see code snippet for details)

b. Extracting nouns for topic analysis

3. More advanced

a. Topic modeling with LDA, NMF, and Graph Community Detection

b. BERTopic Modeling

c. Word embeddings

d. Entity Detection

Like engaging to learn about data science, career growth, life, or poor business decisions? Sign up for my newsletter here and get a link to my free ebook.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store