Natural Language Processing: Text Preprocessing and Vectorizing at Rocking Speed with RAPIDS cuML

Simon Andersen
RAPIDS AI
Published in
9 min readJul 14, 2020
Photo by freestocks on Unsplash

Text preprocessing on GPUs is coming to RAPIDS cuML! This is very exciting as efficient string operations are known to be a difficult problem with GPUs. Based on the work by the RAPIDS cuDF team, which enables string manipulation on GPUs, we kickstarted a series of natural language processing (NLP) transformers with our GPU version of scikit-learn’s CountVectorizer and TfidfVectorizer.

TfidfVectorizer is the base building block of many NLP pipelines. It is a simple technique to vectorize text documents — i.e. transform sentences into arrays of numbers — and use them in subsequent tasks.

In this blog post, we discuss how to use the feature_extraction.text package in RAPIDS cuML to perform fast text vectorizing on GPUs. The cuML implementation ofTfidfVectorizer on GPUs can provide a speedup of up to 5x relative to scikit-learn TfidfVectorizer on CPUs. And this is only going to get faster!

Although the current speedup is not as big as some of our recent blogs (600x k-nearest neighbors, 500x SVM), the real gain comes from the fact that you can now run an entire NLP pipeline on GPUs. There’s no need to start on CPU for preprocessing and then move to GPU for training. Indeed, running your data back and forth through CPU and GPU is a massive performance loss. In the accompanying notebook, we show a pipeline that runs in 34.44 seconds on GPU end-to-end, while on CPU it takes 10 minutes and 40 seconds, or about an 18x speedup. With a GPU-accelerated TfidfVectorizer, large-scale NLP pipelines become more practical, as you can now feed your data to extremely fast estimators directly from GPUs.

We will start by describing a basic NLP pipeline and briefly explain the math behind Term Frequency — Inverse Document Frequency (TF-IDF), followed by an explanation of why sparsity matters. Finally, we will demonstrate how to use the TfidfVectorizer on a real-world dataset with an end-to-end NLP pipeline.

Basic NLP Pipeline

In a typical NLP pipeline, you often want to preprocess your text data, vectorize it, train an estimator, and evaluate the results. The TfidfVectorizer estimator is a common starting point to preprocess and vectorize text data.

Preprocess

It’s during the preprocessing step that we will:

  • normalize our data (for instance, convert all characters to lower cases),
  • remove noise (stop words, punctuation, too rare or too common terms),
  • and tokenize input documents (a token is a list of characters separated by a delimiter, if the delimiter is space, a token is a word).

Vectorize

Only then can we vectorize our documents, which means turning tokens into meaningful numbers. There are many ways to represent tokens in documents. Let’s briefly discuss three commonly used transformers, focusing on approaches available in scikit-learn: CountVectorizer, HashVectorizer, and TfidfVectorizer.

  • The most naive approach is to count the number of occurrences of each token within a document — this is what the CountVectorizer does.
  • Another common approach is to replace each token by a hashed value (HashVectorizer). This has the advantage of using a very low amount of memory and therefore being very scalable to big datasets, but it doesn’t give an interpretation to each token and can potentially have collisions.
  • We can also compute the TF-IDF score (using TfidfVectorizer), which we will detail in the next section.

Train and Evaluate

The final steps of a typical NLP pipeline are to train an estimator on the vectorized documents for a particular task and then evaluate the results.

TF-IDF Overview

TF-IDF was introduced to solve the shortcomings of the more naive approach of counting occurrences of each term (CountVectorizer), where very common terms end up having very large values and uncommon terms (which are usually a good discriminant between documents) can get lost in the noise. TF-IDF is a technique used to weight terms according to the importance of those terms within the document and corpus. Words that are frequent in a document but not across the corpus tend to have high TF-IDF score. In practice, the TF-IDF score is the product of the term frequency and the inverse document frequency.

Term Frequency

This is the number of times each term appears in a document divided by the total number of words in the document.

Where the term frequency of word j in document i is the number of times the word j appears in document i (n_ij) divided by the total number of words in document j.

Inverse Document Frequency

The log of the number of documents divided by the number of documents that contain the word.

This increases the weight of rare words across all documents in the corpus. Note that when we compute the TF-IDF for every word in every document of a corpus, it will form a matrix of shape (documents * vocabulary). Here is an example of applying TF-IDF on a corpus of two documents:

From an implementation perspective, the CountVectorizer estimator is used to preprocess the data and to count the number of times each term appears in each document and the TfidfTransformer computes the TF-IDF weights of each document. Combine them both to get the TfidfVectorizer.

Why Does Sparsity Matter?

Typically, a text dataset composed of real data will grow in vocabulary at a rate of roughly 0.1 * total number of words (see Heaps’ law). This means that a corpus composed of 5M words will have about 500K unique tokens. For reference, a typical dataset of 300K tweets will have roughly 8M words and therefore a vocabulary size of about 800K.

As you can see, vocabulary size can get pretty big as the dataset size increases. As we saw earlier, the shape of the TF-IDF matrix is (documents_number * vocabulary_size). Meaning that for each document, we will have a TF-IDF score for each term of the vocabulary. For our example dataset of 300K tweets, this would mean a resulting matrix of shape (300K * 800K) which is pretty big considering the mean number of words per tweet is 26. A matrix this big will not fit in a typical GPU’s memory. Now you might have noticed that the overwhelming majority of elements of the matrix are 0s. Indeed, each row of our matrix represents a document, and the vast majority of documents will only use a small subset of the vocabulary. The rest are zeros. Thankfully, that’s exactly what sparse matrices are built for. The sparse format allows you to store huge matrices when the majority of values are 0 by representing data in a compressed way. This is why it is very important to handle sparsity for our TfidfVectorizer.

Let’s Put This into Practice

As always with cuML, you can pretty much replace your scikit-learn imports by `import cuml as sklearn` and your existing NLP workflow will now be running on GPU. (If you’ve never used cuML before and would like to start with the basics, we suggest starting with some of the introductory notebooks from the documentation.) Let’s try our new GPU implementation of TfidfVectorizer on a real dataset and compare performance with scikit-learn on CPUs.

Taking Advantage of Sparsity

Our first dataset is composed of 8M tweets related to the COVID-19, or about 3Gb text data (credit goes to Shane Smith for gathering the data from Twitter and publishing it on Kaggle). We’ll do some simple keyword searches using cosine similarity (a simple measure of the distance between two document vectors).

Let’s first load the dataset into a dataframe:

Note that we are filtering to keep only tweets in the English language for simplicity. This leaves us with almost 5M tweets.

We can then preprocess and vectorize the data using TfidfVectorizer:

Which takes 25.9 seconds on an NVIDIA V100–32GB GPU. In comparison, scikit-learn’s TfidfVectorizer takes 2 minutes and 54.6 seconds on CPUs. Notice the shape of the matrix: (4827372, 5435706). In other words, 4.8M by 5.4M floats. This is huge! And this will certainly not fit in GPU memory unless we store this data as a sparse matrix. This demonstrates once more the value of sparse matrices for real-world datasets: their unique properties mean they can be represented with far less memory than dense matrices.

Now that we have our TF-IDF matrix, we want to query it with some keywords to find the most relevant documents. To do this, we will rely on the fact that our TfidfVectorizer has the capability to vectorize any document according to the vocabulary from the input corpus. We will, therefore, vectorize our query of keywords to get a corresponding vector in the same space as our TF-IDF matrix. Then, all we have to do is to compute the distance between our query vector and each document of the TF-IDF matrix. The document with the smallest distance (or highest similarity) to our query is the most relevant document to our keywords. We chose to use the cosine similarity. When both vectors are normalized, the cosine similarity is just the dot product of those vectors which correspond to the cosine of the angle between the two vectors.

Let’s try it out on our previous pizza-pineapple example. We have two documents (“Pizza celery pizza” and “Celery pineapple”) and two corresponding TF-IDF vectors: [0.33, 0., 0.94] and [0.57, 0.81, 0.]. The dot product of those vectors is 0.1881, meaning that they are very different. In opposition, the maximum value for cosine similarity is 1, meaning that the vectors are the same. Here is an efficient way to compute the cosine similarity between a vector and each row of a matrix, first normalize both the vector and matrix according to the L2 norm and take the dot product between the matrix and the vector:

Here we make use of the very specialized cuML function `csr_row_normalize_l2` which takes advantage of the GPU to efficiently normalize a sparse matrix. Because our TF-IDF matrix is sparse, the cosine similarity is blazing fast!

Let’s make our life easier and create a helper function to put all of this together, allowing us to search the TF-IDF matrix for any text query:

Note that by default, the resulting matrix from the TfidfVectorizer is already L2-normalized so there is no need to do this a second time.

Let’s try our simple keyword search with a few queries:

Try It Out

If you want to go further, make sure to check out the Jupyter notebook containing all the code. You’ll notice an additional section at the end containing clustering workflows (kmeans and t-SNE), where we try to find clusters in our tweets to see if we can discover general topics related to the COVID-19. Here is a teaser of the clustering workflow:

Wrapping Up

Text preprocessing and vectorizing is crucial to many traditional NLP pipelines. With cuML’s TfidfVectorizer it is now possible to perform those operations directly on GPU which enable major performance boost to existing scikit-learn NLP pipelines. With minimal effort, we were able to run the entire pipeline on GPU 18 times faster than on CPU (see accompanying notebook). Other vectorizers are in the works, starting with HashingVectorizer, which will help a great deal with distributed pipelines.

--

--

Simon Andersen
RAPIDS AI

Freelance Machine Learning Engineer. Previously at NVIDIA, RAPIDS.ai