1 Line of Code is all you need for Text Preprocessing

Kaleem Ullah Qasim
3 min readDec 26, 2021

--

Let’s be honest we all love to see how our NLP model performs on text :)

Text Preprocessing is critical for the accuracy of the results and model overall performance. Research has proved that unprocessed and noisy text data leads to low precision of machine learning models.

I recently worked on a text dataset of user reviews from 11 countries. I needed to preprocess the data to increase the model prediction for text classification into different categories of labels.

I wasn’t planning to spend so much time preprocessing data and writing Python functions to remove (stopwords, HTML tags, brackets etc.) manually. After digging into PyPi , I found an incredible python package that does all these tasks mentioned earlier in a few lines of code.

Texthero

Texthero is a Python package for rapidly working with text data. It offers NLP developers a tool that allows them to quickly analyze any text-based dataset, as well as a robust pipeline for cleaning and representing text data.

Below is a simple use case example on covid19 data from Kaggle.

Install and impot of texthero reading the CSV
Dataframe with noisy text

The One liner

Preprocess with with line
Output

Pipelines Of Texthero

Texthero is designed from the bottom up to work with Pandas Dataframes and Series, that is one of its central principles.

The following is the clean method’s default pipeline:

  1. fillna(s) Use empty spaces to replace values that haven’t been assigned.
  2. lowercase(s) Lowercase the text in all dataframe.
  3. remove_digits() Remove all digits .
  4. remove_punctuation() Remove all string punctuation (!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~).
  5. remove_diacritics() Remove all accents of text.
  6. remove_stopwords() Remove all stop words.
  7. remove_whitespace() Remove all white space between words.

We can also pass a custom pipeline as argument to clean

Custom Pipeline
Custom Pipeline Output

You can use text hero as per your requirement, and it will just do what you ask this incredible package to do.

Other Features

Besides preprocessing the text data, Texthero also has other unique features.
You can do data visualisation, keyphrases and keywords extraction, named entity recognition, F-IDF, term frequency, custom word-embeddings, clustering (K-means, Meanshift, DBSAN and Hierarchical), topic modelling and interpretation. This package aims to write a few lines of code to do the heavy lifting. In this story, I only discuss the preprocessing part, as talking about other features will go out of the scope of this story.
You can check all the documentation of text hero at https://pypi.org/project/texthero/ .

Topic Modeling with Texthero https://pypi.org/project/texthero/
Topic Modeling with Texthero

Limitation

There are some limits of texthero; it only works well on English data, developers of this package are working on other language support.
You can also contribute by simply starting an Issue on github to the development of the package if you speak any language nativity.

Reference

Thanks For Reading, Follow Me For More

--

--

Kaleem Ullah Qasim

NLP Researcher at SWJTU Chengdu China | Love Python Automation | Working on Business Intelligence Systems