Accelerating TF-IDF for Natural Language Processing with Dask and RAPIDS

Published in

RAPIDS AI

3 min readSep 16, 2021

Term frequency-inverse document frequency (TF-IDF) is a scoring measure widely used in information retrieval (IR) or summarization used to reflect how relevant a term is in a given document. TF-IDF is a widely used Natural Language Processing (NLP) technique in document retrieval, summarization, text classification, and document clustering. (For more details check out this blog post).

In this blog post, we will show how to use NVIDIA GPUs to accelerate TF-IDF end-to-end pipelines using RAPIDS and Dask. We also compare the RAPIDS and Dask GPU-based approach with a multi-CPU Apache Spark approach and a single CPU scikit learn approach.

Apache Spark is commonly used to do TF-IDF at scale; for reference check out Oracle’s A-team blog, where they benchmarked a multi CPU Apache Spark-based pipeline with scikit-learn. This blog post extends that comparison to include a GPU-based solution.

Workflow Details

The workflow across all the three pipelines looks like the following:

Code

Dataset

For benchmarking, we used the PC products subset of the Amazon Customer Reviews Dataset. The compressed parquet files on disk are ~11GB containing about 21 million reviews.

Results

It took RAPIDS+Dask only 25 seconds to read, clean up, and vectorize the entire dataset, while Spark took 482 seconds, and scikit-learn took 2,115 seconds.

Thus the end-to-end latency with RAPIDS cuML+Dask on 6 NVIDIA V100 GPUs sees a 19.25x improvement over Apache Spark on 40 physical cores (80 virtual cores of an Intel Xeon 2698) and an 84x improvement over scikit-learn.

Check out the below table for details on how much each component was accelerated:

The benchmarking code, setup, and details can be found at the link.

Wrap Up

This workflow is just one example of leveraging GPUs for accelerating end-to-end natural language processing. Accelerated TF-IDF can be used to make NLP pipelines like document indexing, summarization, and text clustering much faster. Do let us know how you are going to use RAPIDS to accelerate your workflows!

Our team at RAPIDS has been hard at work constantly adding features and improving performance. See our documentation docs page, and if you see something missing, we welcome feature requests on GitHub!