NLP and Text Processing with RAPIDS: Now Simpler and Faster

Vibhu Jawa
RAPIDS AI
Published in
3 min readAug 11, 2020

By: Vibhu Jawa and Randy Gelhausen

TL;DR: Google famously noted that “speed isn’t just a feature, it’s the feature,” This is not only true for search engines but all of RAPIDS. In this blog, we will showcase performance improvements for string processing across cuDF and cuML, which enables acceleration across diverse text processing workflows.

Introduction:

In our previous post, we showed basic text pre-processing with RAPIDS. Since then, we have come a long way in speed improvements, memory reductions, and API simplification.

Here’s what we’ll cover in this blog:

  1. Built-in, Simplified String and Categorical Support
  2. GPU TextVectorizers: Leaner and Meaner
  3. Accelerating Diverse String Workflows

Built-in Support for Strings and Categoricals

Goodbye, cuStrings, nvStrings, and nvCategory! We hardly knew ye. Our first couple of posts about string manipulation on GPUs involved separate, specialized libraries for working with string data on the device and required significant expertise to integrate with other RAPIDS libraries like cuDF and cuML. Since then, we open-sourced, re-architected, and migrated those string and text-related features into more user-friendly DataFrame APIs as part of cuDF. Along the way, we also adopted the “Apache Arrow” format for cuDF’s string representation, resulting in substantial memory savings and speedups.

Top-Level Simplified Categorical Support

With nvCategory retired and categorical dtype support inbuilt into cuDF, Categorization is now straightforward. Check out the comparison between the previous vs. updated code below.

Old Categorization using `nvcategory`

Old Categorization using Nvcategory

Updated Categorization using inbuilt categorical `dtype`

Updated Categorization with inbuilt support

Example Workflow:

As a concrete, non-toy example of these improvements, consider our recently updated Gutenberg corpus analysis notebook. Previously we had to (slowly) jump through a few hoops, but no longer!

With our improved Pandas string API coverage, we not only have simpler code, but we also get double the performance. We took 2.31s previously, now we only take 1.05s, pushing our overall speedup against Pandas to 151x.

Check out the comparison between the previous vs. updated notebooks below.

Previous:

Pre-Processing using nvtext+nvstrings

Updated:

Updated Pre-Processing with the latest API

GPU TextVectorizers: Leaner and Meaner

We recently launched the feature.text subpackage in cuML by adding Count and TF-IDF vectorizers, kickstarting a series of natural language processing (NLP) transformers on GPUs.

Since then, we have added hashing vectorizer (20x faster than scikit-learn) and improved our existing Count/TF-IDF vectorizer performance by 3.3x and memory by 2x.

Hashing Vectorizer Speed Up vs Sklearn

In our recent NLP blog, we analyzed 5 million COVID related tweets by first vectorizing them using TF-IDF and then clustering and searching in the vector space. With our recent improvements (GitHub 2554, 2575, 5666), we have improved that TF-IDF vectorization of that workflow on both memory and run time fronts.

  • Peak memory usage decreased from 19 GB to 8 GB.
  • Run time improved from 26s to 8 s, pushing our overall speed up to 21x over scikit-learn

All the above improvements mean that your TF-IDF work can scale much further.

Ongoing Work to Further Scale-Out TF-IDF Across Multiple Machines:

We are currently working on adding support for distributed multi-GPU TF-IDF Transformer which, when used with, will give you a distributed vectorized matrix. This will enable end to end acceleration of distributed text processing pipelines as that matrix can be fed into distributed machine learning models like cuml.dask.naive_bayes.

Accelerating Diverse String Workflows

We are adding more string functionality like character_tokenize, character_ngrams, ngram_tokenize, filter_tokens, filter_aphanum as well as adding higher-level text-processing API’s like GPU-accelerated BERT tokenizer, text vectorizers, helping enable more complex string and text manipulation logic like you find in real-world NLP applications.

Stay tuned for a followup post where we put all these features through their paces in a specialized NLP benchmark. In the meantime, try RAPIDS in your NLP work on Google Colab or blazingsql notebooks, see our documentation docs page, and if you see something missing, we welcome feature requests on GitHub!

--

--