Introducing NLPretext, a unified framework to facilitate text preprocessing

Published in

Artefact Engineering and Data Science

3 min readFeb 22, 2021

Working on NLP projects? Tired of always looking for the same silly preprocessing functions on the web, such as removing accents from French posts? Tired of spending hours on Regex to efficiently extract email addresses from a corpus? NLPretext got you covered!

NLPretext overview

NLPretext is composed of 4 modules: basic, social, token and augmentation. Each of them includes different functions to handle the most important text preprocessing tasks.

Basic preprocessing

The basic module is a catalogue of transversal functions that can be used in any use case. They allow you to handle:

bad whitespaces in a text, end of line characters
encoding issues
special characters such as currency symbols, numbers, punctuation marks, latin and non-latin characters
emails and phone numbers

from nlpretext.basic.preprocess import replace_emailsexample = "I have forwarded this email to obama@whitehouse.gov"
example = replace_emails(example, replace_with="*EMAIL*")print(example)# "I have forwarded this email to *EMAIL*"

Social preprocessing

The social module is a catalogue of handy functions that can be useful when processing social data, such as:

hashtags extraction/removal
emojis extraction/removal
mentions extraction/removal
html tags cleaning

from nlpretext.social.preprocess import extract_emojisexample = "I take care of my skin 😀"
example = extract_emojis(example)print(example)# [':grinning_face:']

Token preprocessing

The token module helps you clean your text on a token level. First you can load a tokenizer to split your sentence into tokens. Then you can:

remove stopwords
remove small words
remove tokens with special characters

from nlpretext.token.preprocess import remove_stopwordsexample = [“I”, “like”, “when”, “you”, “move”, “your”, “body”]
example = remove_stopwords(example, lang="en")print(example)# ['I', 'move', 'body']

Text augmentation

The augmentation module helps you to generate new texts based on your given examples by modifying some words in the initial ones and to keep associated entities unchanged, if any, in the case of NER tasks. If you want words other than entities to remain unchanged, you can specify it within the stopwords argument. Modifications depend on the chosen method, the ones currently supported by the module are substitutions with synonyms using Wordnet or BERT from the nlpaug library.

from nlpretext.augmentation.text_augmentation import augment_textexample = "I want to buy a small black handbag please."
entities = [{'entity': 'Color', 'word': 'black', 'startCharIndex': 22, 'endCharIndex': 27}]
example = augment_text(example, method=”wordnet_synonym”, entities=entities)print(example)# "I need to buy a small black pocketbook please."

Create your end to end pipeline

Default pipeline

Our library provides a Preprocessor object to efficiently pipe all preprocessing operations.

If you need to keep all elements of your text and perform minimum cleaning, use the default pipeline. It normalizes whitespaces and removes newlines characters, fixes unicode problems and removes recurrent artifacts from social data such as mentions, hashtags and HTML tags.

from nlpretext import Preprocessortext = "I just got the best dinner in my life @latourdargent !!! I  recommend 😀 #food #paris \n"
preprocessor = Preprocessor()
text = preprocessor.run(text)print(text)# "I just got the best dinner in my life !!! I recommend"

Custom pipeline

If you have a clear idea of what preprocessing functions you want to pipe in your preprocessing pipeline, you can add them in your own Preprocessor.

from nlpretext import Preprocessor
from nlpretext.basic.preprocess import (normalize_whitespace, remove_punct, remove_eol_characters, remove_stopwords, lower_text)
from nlpretext.social.preprocess import remove_mentions, remove_hashtag, remove_emojitext = "I just got the best dinner in my life @latourdargent !!! I  recommend 😀 #food #paris \n"preprocessor = Preprocessor()
preprocessor.pipe(lower_text)
preprocessor.pipe(remove_mentions)
preprocessor.pipe(remove_hashtag)
preprocessor.pipe(remove_emoji)
preprocessor.pipe(remove_eol_characters)
preprocessor.pipe(remove_stopwords, args={'lang': 'en'})
preprocessor.pipe(remove_punct)
preprocessor.pipe(normalize_whitespace)text = preprocessor.run(text)print(text)# "dinner life recommend"

NLPretext installation

To install the library please run

pip install nlpretext

You can find the github repository here and the library documentation here

Co-authors:

Rafaëlle Aygalenq, Bruce Delattre, Amale El Hamri, Hugo Vasselin