Speeding up text pre-processing using Dask
Text preprocessing is one of the most important and time consuming steps of NLP. It is also considered to be a hard
part of your project:
Also the performance of your models depends on the quality of your text pre-processing. So, it becomes important that the pre-processing can be sped up so that you can iterate quickly!
By the end of this post, you will have sped up your text pre-processing by 2x!
Enter Dask
Dask is a Python library that, among other things, helps you perform operations on DataFrames, and Lists in parallel. How?
Dask can take your DataFrame or List, and make multiple partitions of it, and perform same operation on each of the partition in parallel, and then combine back the results.
Install dask using the instructions here. For the purpose of this post, you only need dask[dataframe].
Code and See
So, now let us dive into the details of pre-processing using Dask, and comparisons with vanilla implementations.
The code is available here as a Jupyter Notebook.
For the purpose of this post, I will be using 20 Newsgroups Dataset. The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
Plan of Attack
So, how do we approach this? Here’s a plan of attack:
- Load the data using
sklearn
- Write functions for the above mentioned steps
- Process text without Dask
- Process text using Dask
The functions that we write for pre-processing will be shared for implementations between with Dask and without Dask.
Loading data
sklearn
provides method to download and load the dataset.
from sklearn.datasets import fetch_20newsgroupsdata = fetch_20newsgroups(subset=”train”)print(“Number of text samples {}”.format(len(data.data)))>>> Number of text samples 11314
Let us have a look at some text samples
>>> data.data[0]
"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"
A quick look at the text gives us some hints as to what pre-processing we need to do:
- Removing Email Addresses: Since email addresses are not adding any usable information, we will do away with them.
- Removing Newline characters: Newline characters like
\n
,\t
are not required since they are just noise. - Tokenization: We need to tokenize the our text into words because a word is our unit of information. Read more about tokenization here.
- Removing Stop Words: Stop Words are very frequently used words that we can do without. Read more about stopwords here.
- Lemmatization: Lemmatization is the process of reducing inflections to their lemmas. In essence, this will reduce the word to its base form. So a the phrase
horsing around
will be reduced tohorse around
.
Loading the data into a DataFrame
Now we will load the data into a DataFrame, so that we can visualize the text and perform operations easily.
import pandas as pddf = pd.DataFrame()
df = df.assign(text=data["data"]).assign(target=data["target"])
df.head()
Now that we have loaded data in dataframe, let us write the functions that we need to do the preprocessing.
Removing Email Addresses
We define a function remove_emails
that takes in a string and removes any email addresses from the text.
import redef remove_emails(text):
'''
Remove any email address in text with
The `regex` can catch any number of email addresses in the text. Regex can be tried here: https://regex101.com/r/ZjgyLc/2 '''
regex = r'\S*@\S*\s?'
return re.sub(regex, '', text)remove_emails(data.data[0])
>>> From: (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n
Removing Newline characters
def remove_newlinechars(text):
'''
Substitute any newline chars with a whitespach
The `regex` can be tried at: https://regex101.com/r/2fImPz/1/
'''
regex = r'\s+'
return re.sub(regex, ' ', text)test_text = remove_newlinechars(test_text)
print(test_text)>>> From: (where's my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15 I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. Thanks, - IL ---- brought to you by your neighborhood Lerxst ----
Tokenization
We break the sentence down into word tokens, and remove any token that is not alphanumeric( eg: -
, (
)
import nltkdef tokenize(text):
'''
Tokenize text
'''
tokens = nltk.word_tokenize(text)
return list(
filter(lambda word: word.isalnum(), tokens)
)test_text = tokenize(test_text)
print(test_text)>>> ['From', 'where', 'my', 'thing', 'Subject', 'WHAT', 'car', 'is', 'this', 'Organization', 'University', 'of', 'Maryland', 'College', 'Park', 'Lines', '15', 'I', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'I', 'saw', 'the', 'other', 'day', 'It', 'was', 'a', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', '70s', 'It', 'was', 'called', 'a', 'Bricklin', 'The', 'doors', 'were', 'really', 'small', 'In', 'addition', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'This', 'is', 'all', 'I', 'know', 'If', 'anyone', 'can', 'tellme', 'a', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'Thanks', 'IL', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'Lerxst']
Removing Stopwords
We extend nltk's
stopwords list with common words that we see in the text: from
, subject
, summary
, keywords
, article
from nltk.corpus import stopwordsstop_words = stopwords.words("english")## Add some common words from text
stop_words.extend(["from","subject","summary","keywords","article"])def remove_stopwords(words):
'''
Remove stop words from the list of words
'''
filtered = filter(lambda word: word not in stop_words, words)
return list(filtered)test_text = remove_stopwords(test_text)
print(test_text)>>>['thing', 'car', 'organization', 'university', 'maryland', 'college', 'park', 'lines', '15', 'wondering', 'anyone', 'could', 'enlighten', 'car', 'saw', 'day', 'sports', 'car', 'looked', 'late', 'early', '70s', 'called', 'bricklin', 'doors', 'really', 'small', 'addition', 'front', 'bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'production', 'car', 'made', 'history', 'whatever', 'info', 'funky', 'looking', 'car', 'please', 'thanks', 'il', 'brought', 'neighborhood', 'lerxst']
Lemmatization
For lemmatization, I would be using spaCy
.
import spacynlp = spacy.load("en_core_web_sm")def lemmatize(text, nlp=nlp):
doc = nlp(" ".join(text))
lemmatized = [token.lemma_ for token in doc]
return lemmatizedtest_text = lemmatize(test_text,nlp)
print(test_text)>>> ['th', 'car', 'organization', 'university', 'maryland', 'college', 'park', 'line', '15', 'wonder', 'anyone', 'could', 'enlighten', 'car', 'saw', 'day', 'sport', 'car', 'look', 'late', 'early', '70', 'call', 'bricklin', 'door', 'really', 'small', 'addition', 'front', 'bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'make', 'history', 'whatev', 'info', 'funky', 'look', 'car', 'please', 'thank', 'il', 'bring', 'neighborhood', 'lerxst']
Processing the text without Dask
Since we wrote the functions that accept text
, we can simply use chains of map
on the text column of the dataframe. Simultaneously, let us also measure the time take to process the text.
import timet0 = time.time()def clean_text(df):
'''
Take in a Dataframe, and process it
'''
df["cleaned_text"] = df.text.map(lambda text:text.lower()).map(remove_emails).map(remove_newlinechars).map(remove_stopwords).map(lemmatize)
return dfdf = clean_text(df)
t1 = time.time()
print("Time to process without Dask {}".format(t1-t0))>>> Time to process without Dask 258.16533493995667
This way we are not using parallelization, and hence there is a scope of improvement in processing times. That’s where dask comes in.
Processing text with Dask
So, now is the time to bring in the big guns!
But before we can use execute the processing with Dask, we need do some setup:
- Creating a Dask DataFrame and Partitioning it( details in a while )
- Writing wrappers to our processor functions so that we can use them with Dask
Creating a Dask DataFrame
Okay, so we have a dataframe, but before Dask can work with it, we need to define a Dask DataFrame
, which is container for our dataframe, but is capable of breaking down the dataframe into multiple parts, and processing them in parallel.
import dask.dataframe as ddfdask_dataframe = ddf.from_pandas(df, npartitions=6)
We can make a Dask dataframe from an existing pandas dataframe, using the from_pandas
function. Also, we can define how many partitions/slices we want of the dataframe using the npartitions
argument.
The npartitions
parameter determines the number of parts of dataframe that will be processed in parallel. Each partition is will be a pandas dataframe.
Mapping Partitions
On a Pandas Dataframe, we can use map
function directly on columns of the dataframe. In Dask, we have partitions/slices of the Dataframe to work with. So, Dask provides a function map_partitions
.
map_partitions
accepts a function
, and applies this function to each partition. Just a slight catch here, the function
should accept a Dataframe. Unlike the map
in Pandas, which is applied to a column, map_partitions
is applied to a dataframe slice.
We have already implemented clean_text
function, and can use it for mapping partitions.
t0 = time.time()
result = dask_dataframe.map_partitions(clean_text, meta=df)
df = result.compute()
t1 = time.time()
print("Time to process with Dask {}".format(t1-t0))>>> Time to process with Dask 136.15019989013672
That’s almost half the time compared to pandas dataframe: A 2x speedup!
For npartitions=2
, my laptop took 166 seconds.
It is advisable to set npartitions
to the cpu count of your processor, instructions for which can be found on this SO thread.
In essence, with a couple of extra lines of code, you can speed up your text pre-processing by a factor of 2x. This has helped me run experiments very quickly, and iterate fast!
Conclusion
Dask is a very powerful tool to parallelize your processing, and other tasks. It can even handle Dataframes that do not fit in your memory! You can also distribute processing to a cluster of machines!
Comment what use cases Dask solves for you.
Thanks for reading. So long, and thanks for all the fish!
Follow me on Github, Twitter, LinkedIn.