Speeding up text pre-processing using Dask

Published in

MindOrks

8 min readFeb 25, 2019

Text preprocessing is one of the most important and time consuming steps of NLP. It is also considered to be a hard part of your project:

Also the performance of your models depends on the quality of your text pre-processing. So, it becomes important that the pre-processing can be sped up so that you can iterate quickly!

By the end of this post, you will have sped up your text pre-processing by 2x!

Enter Dask

Dask is a Python library that, among other things, helps you perform operations on DataFrames, and Lists in parallel. How?

Dask can take your DataFrame or List, and make multiple partitions of it, and perform same operation on each of the partition in parallel, and then combine back the results.

Install dask using the instructions here. For the purpose of this post, you only need dask[dataframe].

Code and See

So, now let us dive into the details of pre-processing using Dask, and comparisons with vanilla implementations.

The code is available here as a Jupyter Notebook.

For the purpose of this post, I will be using 20 Newsgroups Dataset. The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

Plan of Attack

So, how do we approach this? Here’s a plan of attack:

Load the data using sklearn
Write functions for the above mentioned steps
Process text without Dask
Process text using Dask

The functions that we write for pre-processing will be shared for implementations between with Dask and without Dask.

Loading data

sklearn provides method to download and load the dataset.

from sklearn.datasets import fetch_20newsgroupsdata = fetch_20newsgroups(subset=”train”)print(“Number of text samples {}”.format(len(data.data)))>>> Number of text samples 11314

Let us have a look at some text samples

>>> data.data[0]
"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

A quick look at the text gives us some hints as to what pre-processing we need to do:

Removing Email Addresses: Since email addresses are not adding any usable information, we will do away with them.
Removing Newline characters: Newline characters like \n , \t are not required since they are just noise.
Tokenization: We need to tokenize the our text into words because a word is our unit of information. Read more about tokenization here.
Removing Stop Words: Stop Words are very frequently used words that we can do without. Read more about stopwords here.
Lemmatization: Lemmatization is the process of reducing inflections to their lemmas. In essence, this will reduce the word to its base form. So a the phrase horsing around will be reduced to horse around .

Loading the data into a DataFrame

Now we will load the data into a DataFrame, so that we can visualize the text and perform operations easily.

import pandas as pddf = pd.DataFrame()
df = df.assign(text=data["data"]).assign(target=data["target"])
df.head()

Now that we have loaded data in dataframe, let us write the functions that we need to do the preprocessing.

Removing Email Addresses

We define a function remove_emails that takes in a string and removes any email addresses from the text.

import redef remove_emails(text):
    '''
    Remove any email address in text with
    
    The `regex` can catch any number of email addresses in the text. Regex can be tried here: https://regex101.com/r/ZjgyLc/2    '''
    regex =  r'\S*@\S*\s?'
    return re.sub(regex, '', text)remove_emails(data.data[0])

>>> From: (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n

Removing Newline characters

def remove_newlinechars(text):
    '''
    Substitute any newline chars with a whitespach
    
    The `regex` can be tried at: https://regex101.com/r/2fImPz/1/
    '''
    regex = r'\s+'
    return re.sub(regex, ' ', text)test_text = remove_newlinechars(test_text)
print(test_text)>>> From: (where's my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15 I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. Thanks, - IL ---- brought to you by your neighborhood Lerxst ----

Tokenization

We break the sentence down into word tokens, and remove any token that is not alphanumeric( eg: -, ()

import nltkdef tokenize(text):
    '''
    Tokenize text
    '''
    tokens = nltk.word_tokenize(text)
    
    return list(
        filter(lambda word: word.isalnum(), tokens)
    )test_text = tokenize(test_text)
print(test_text)>>> ['From', 'where', 'my', 'thing', 'Subject', 'WHAT', 'car', 'is', 'this', 'Organization', 'University', 'of', 'Maryland', 'College', 'Park', 'Lines', '15', 'I', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'I', 'saw', 'the', 'other', 'day', 'It', 'was', 'a', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', '70s', 'It', 'was', 'called', 'a', 'Bricklin', 'The', 'doors', 'were', 'really', 'small', 'In', 'addition', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'This', 'is', 'all', 'I', 'know', 'If', 'anyone', 'can', 'tellme', 'a', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'Thanks', 'IL', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'Lerxst']

Removing Stopwords

We extend nltk's stopwords list with common words that we see in the text: from , subject , summary , keywords , article

from nltk.corpus import stopwordsstop_words = stopwords.words("english")## Add some common words from text
stop_words.extend(["from","subject","summary","keywords","article"])def remove_stopwords(words):
    '''
    Remove stop words from the list of words
    '''
    
    filtered = filter(lambda word: word not in stop_words, words)
    
    return list(filtered)test_text = remove_stopwords(test_text)
print(test_text)>>>['thing', 'car', 'organization', 'university', 'maryland', 'college', 'park', 'lines', '15', 'wondering', 'anyone', 'could', 'enlighten', 'car', 'saw', 'day', 'sports', 'car', 'looked', 'late', 'early', '70s', 'called', 'bricklin', 'doors', 'really', 'small', 'addition', 'front', 'bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'production', 'car', 'made', 'history', 'whatever', 'info', 'funky', 'looking', 'car', 'please', 'thanks', 'il', 'brought', 'neighborhood', 'lerxst']

Lemmatization

For lemmatization, I would be using spaCy .

import spacynlp = spacy.load("en_core_web_sm")def lemmatize(text, nlp=nlp):
    
    doc = nlp(" ".join(text))
    
    lemmatized = [token.lemma_ for token in doc]
    
    return lemmatizedtest_text = lemmatize(test_text,nlp)
print(test_text)>>> ['th', 'car', 'organization', 'university', 'maryland', 'college', 'park', 'line', '15', 'wonder', 'anyone', 'could', 'enlighten', 'car', 'saw', 'day', 'sport', 'car', 'look', 'late', 'early', '70', 'call', 'bricklin', 'door', 'really', 'small', 'addition', 'front', 'bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'make', 'history', 'whatev', 'info', 'funky', 'look', 'car', 'please', 'thank', 'il', 'bring', 'neighborhood', 'lerxst']

Processing the text without Dask

Since we wrote the functions that accept text , we can simply use chains of map on the text column of the dataframe. Simultaneously, let us also measure the time take to process the text.

import timet0 = time.time()def clean_text(df):
    '''
    Take in a Dataframe, and process it
    '''
    df["cleaned_text"] = df.text.map(lambda text:text.lower()).map(remove_emails).map(remove_newlinechars).map(remove_stopwords).map(lemmatize)
    return dfdf = clean_text(df)
t1 = time.time()
print("Time to process without Dask {}".format(t1-t0))>>> Time to process without Dask 258.16533493995667

This way we are not using parallelization, and hence there is a scope of improvement in processing times. That’s where dask comes in.

Processing text with Dask

So, now is the time to bring in the big guns!

But before we can use execute the processing with Dask, we need do some setup:

Creating a Dask DataFrame and Partitioning it( details in a while )
Writing wrappers to our processor functions so that we can use them with Dask

Creating a Dask DataFrame

Okay, so we have a dataframe, but before Dask can work with it, we need to define a Dask DataFrame , which is container for our dataframe, but is capable of breaking down the dataframe into multiple parts, and processing them in parallel.

import dask.dataframe as ddfdask_dataframe = ddf.from_pandas(df, npartitions=6)

We can make a Dask dataframe from an existing pandas dataframe, using the from_pandas function. Also, we can define how many partitions/slices we want of the dataframe using the npartitions argument.

The npartitions parameter determines the number of parts of dataframe that will be processed in parallel. Each partition is will be a pandas dataframe.

Mapping Partitions

On a Pandas Dataframe, we can use map function directly on columns of the dataframe. In Dask, we have partitions/slices of the Dataframe to work with. So, Dask provides a function map_partitions .

map_partitions accepts a function , and applies this function to each partition. Just a slight catch here, the function should accept a Dataframe. Unlike the map in Pandas, which is applied to a column, map_partitions is applied to a dataframe slice.

We have already implemented clean_text function, and can use it for mapping partitions.

t0 = time.time()
result = dask_dataframe.map_partitions(clean_text, meta=df)
df = result.compute()
t1 = time.time()
print("Time to process with Dask {}".format(t1-t0))>>> Time to process with Dask 136.15019989013672

That’s almost half the time compared to pandas dataframe: A 2x speedup!

For npartitions=2 , my laptop took 166 seconds.

It is advisable to set npartitions to the cpu count of your processor, instructions for which can be found on this SO thread.

In essence, with a couple of extra lines of code, you can speed up your text pre-processing by a factor of 2x. This has helped me run experiments very quickly, and iterate fast!

Conclusion

Dask is a very powerful tool to parallelize your processing, and other tasks. It can even handle Dataframes that do not fit in your memory! You can also distribute processing to a cluster of machines!

Comment what use cases Dask solves for you.

Thanks for reading. So long, and thanks for all the fish!

Follow me on Github, Twitter, LinkedIn.