Show Me The Word Count

Published in

RAPIDS AI

7 min readJul 30, 2019

By: Vibhu Jawa, Nick Becker, David Wendt, and Randy Gelhausen

Since the dawn of Big Data, analytics platforms have competed for the fastest and most succinct word-count code. SlideShare presentations for both batch and streaming engines often proudly pasted their implementations on the first few slides. They showcased the simplicity and demonstrated the capabilities of the tools.

String support in RAPIDS is pretty new: it first arrived a little more than 6 months ago. Since then, we open-sourced the implementation and began the first phases of cuDF integration. With the nvtext module, we’re now tackling larger, more complex text processing building blocks that enable GPU accelerated NLP work. While we’ve been able to do basic word counts for a while, useful analysis of real text data involves preprocessing that we wanted to get right before we published examples.

Text processing is a foundational technology of modern data science. In this post, we demonstrate faster end-to-end text analytics with GPU acceleration. We use the Project Gutenberg data set to demonstrate high performance tokenization, word counting, and vectorization, and we show how to extend this to interesting clustering analyses.

By the end of this post, you’ll be able to clean a 19-million-line text dataset in 2.3 seconds on a single NVIDIA GV-100 and use it to find similar authors based on their writing.

Accelerated Text Processing with GPUs

In most workflows involving textual data, you need to clean and filter the text before running much analysis. That typically involves operations like:

Removing punctuation
Lower or upper casing
Removing stop words
And various other tasks, depending on dataset & application.

In our example, the data is a collection of 3,036 English books by 142 authors, a cleaner subset of the Project Gutenberg corpus. Each book is a text file on disk. Reading 3,000 relatively small files is a non-ideal data-loading case. Making matters worse, the file content doesn’t contain book title or author metadata. Thus, we read the content into a list, prepending the filename, then pass it to the GPU into a cuDF DataFrame. While Dask DataFrame’s read_csv supports including such metadata as a column, cuDF can’t support that yet.

With our data in memory, we’re ready to start our preprocessing. Depending on the size of the dataset, these tasks can take a long time, even on powerful boxes with many core CPUs. And say you wanted to speed up this process and port your work to a GPU. Before RAPIDS, you would have had to write a whole lot of CUDA C++ to handle those tasks on a GPU. With cuStrings and its integration into cuDF, RAPIDS provides a straightforward and high-level Python API familiar to data engineers, data scientists, and NLP practitioners. We’ll demonstrate it by example.

Removing Punctuation and Lower Casing

Consider the strings “Hey,” “hey!”, and “hey”. They’re all basically the word “hey”. Regardless of casing and attached punctuation marks, you’d usually want to count them as the same word.

We’ll first remove punctuation marks. nvtext provides a multi string replace function to make this operation fast and straightforward. We can even chain it together with the lower function.

Removing Stop Words

In this workflow, we are looking for the distinctive words used by various authors. If we keep commonly occurring words (stop-words) our count vectors will skew towards them and we would lose more distinguishing features in the noise. We’ll use nltk’s standard library of stop-words and remove them with nvtext.replace_tokens

Other Tasks: Removing Variable Length Whitespace

Sometimes, after removing words, we can end up with consecutive whitespace characters that we want to standardize to avoid impacting our analysis. Removing spaces is still slightly more involved than it would ideally be. We’ll soon make it better with nvtext.normalize_spaces, but for now, we can use a regular expression.

Putting It All Together

Whew, we’re done with the data cleaning steps. There was lots of code above, but how fast does all that logic run when you combine it all? Let’s compare a GV100 with 2-16-core CPUs (64 virtual cores).

GV-100 run

Dask-CPU 64 Virtual Cores

Concluding the pre-processing steps, for the whole 19 million line dataset, we’ve cleaned it in 2.31 seconds on a single NVIDIA GV-100. This is 6x faster than the dual 16-core CPU (64 virtual core) implementation. Want to try it yourself? Click here to experiment.

Seriously, Show me the Word Count!

Before counting the words, we need to split the lines into individual tokens. Given some of the lines are long (i.e. have many tokens) in many datasets, it’s possible to run into an out of memory situation. To handle this efficiently, we created the tokenize method in nvtext. It generates a single column of all tokens used in the input strings, reducing memory usage significantly. It’s easy to use, and works like this:

Once we’ve tokenized the text, we just need to count the words. We can use cuDF’s groupby functionality to do this. We’ll wrap both the tokenizing and groupby word counting into a function for convenience.

Now, let’s calculate and compare the word counts for Albert Einstein and Charles Dickens, to see if we find any interesting patterns.

Einstein talks about relativity, theory, and body. Dickens is relatively more interested in telling readers a good story: once, upon, time, and one.

Surprising to no one.

Finding Similar Authors with UMAP and KNN

Ok, GPUs can pre-process text and count words fast. So what?

As a motivating example, in high school, I loved reading Raymond Chandler’s crime noir detective stories. A few years ago, I happened upon his Wikipedia entry and learned of his contemporary and friend Dashiell Hammett. I quickly read every Hammett story I could get my hands on.

Wouldn’t it be nice if I didn’t need to rely on an author’s life story to find other authors I’d enjoy? Good news! In this section, we’ll use our word vectors, dimensionality reduction, and K-Nearest-Neighbors to do just that.

This approach is generally called bag-of-words, and it actually works pretty well. The intuition is straightforward: documents that share many words in common are more likely to be similar than documents that have very few words in common.

To enable the bag of words, we first need to transform our count vectors into a standardized, aligned format. Once that’s done, we’ll be ready for downstream analytics.

Creating Encoded Word Vectors

We need to convert the word-count series into directly into count vectors. Eventually, nvtext will support this directly, but for now, we show how to implement it manually using nvcategory and a GPU accelerated Numba kernel. We believe this will be useful to NLP practitioners experimenting with custom pre-processing logic who don’t want to spend time Cythonizing or otherwise writing performance-oriented native Python extensions.

We do this in three steps. First, we calculate word counts for all of the authors.

Next, we encode the counts Series using the top 20,000 most frequent words in the dataset which we calculated earlier. We use nvcategory to do this encoding.

Finally, we align the word count vectors to a standard form and order. Each column corresponds to a word_id from the top 20k words, and each row corresponds to the count vector for that author

normalizing the array by sum of counts

Clustering Authors using K-Nearest Neighbors:

One of the simplest approaches to clustering is K-Nearest Neighbors, and it works well with the count vectors we just created. Let's fit a k-nearest neighbors model on the count encoding array and find some similar author:

Let’s find the nearest neighbors to Albert Einstein

Let’s find the nearest neighbors to ‘Charles Dickens’.

The nearest neighbor to ‘Albert Einstein’ is ‘Thomas Carlyle’ who was a mathematician but looks like the nearest to ‘Charles Dickens’ is ‘Winston Churchill’ which doesn’t seem right, let’s try to fix that.

Reducing Dimensionality with UMAP

We are in a high dimensional space with 20,000 dimensions, so a euclidian distance based nearest neighbor is not the best way forward as in high dimensional spaces. Let’s try to fix this by reducing dimensionality using Uniform Manifold Approximation and Projection (UMAP) which will use map our count vectors down to 3 dimensions using cuml.umap.

Let's fit a KNN model on the dimension-reduced dataset

Let’s now look at the neighbors on this manifold.

Albert Einstein Neighbors:

Charles Dickens Neighbors:

So, in this latent space, ‘Charles Dickens’ and ‘Agatha Christie’ are neighbors, both of whom loved writing stories.

Conclusion

Our first blog post on strings was titled, “Real Data Has Strings, Now So Do GPUs”. That was true then and now. What’s changed is today we have deeper string support integrated into cuDF, meaning that you can mix and match string munging operations with traditional tabular grouped aggregations. We also now have the nvtext module which enables more optimized NLP style string preprocessing on GPUs, and we support passing cleaned data directly into cuML’s GPU accelerated algorithms.

The takeaway is that you can get large end-to-end speedups on a single GPU vs powerful multi-core CPU machines.

On this dataset, we saw speedups of 24x vs a 12 virtual core cpu and 6x over a 64 virtual core CPU in text pre-processing.

Speedups like these give data scientists and engineers the time to spend more time iterating in feature engineering, and more time finding the best parameters and hyperparameters for their machine learning models. NLP is a wide-world. Come check out cuStrings on GitHub and let us know which parts of it you’d like to see run faster!

Want to get started with RAPIDS? Check out cuDF on Github and let us know what you think! You can download pre-built Docker containers for our 0.8 release from NGC or Dockerhub to get started or install it yourself via Conda. Need something even easier? You can quickly get started with RAPIDS in Google Colab and try out all the new things we’ve added with just a single push of a button.

Don’t want to wait for the next release to use upcoming features? You can download our nightly containers from Dockerhub or install via Conda to stay at the tip of our development branch.

Notebook Link