Real Data has Strings. Now, so do GPUs.

Nick Becker
RAPIDS AI
Published in
3 min readDec 7, 2018

By Nick Becker and Randy Gelhausen

This content is now out of date. For the latest documentation on RAPIDS strings support, please refer to the docs. 

In an ideal world, data would show up for analysts and data scientists in neat rows and columns, compressible and separated into distinct fields that are all addressable by field names in the exact structure needed for their analysis.

Unfortunately, that’s a dream world that few people have the pleasure of working in. For everyone else, there are CSVs, JSON, XML, and worse formats, embedding a mix of numerical data and strings, which often encapsulate multiple distinct bits of information between delimiters.

Image from @kdnuggets, created by Jon Carter

Furthermore, many high-value datasets are complex enough to include strings of various types, shapes, and sizes. Standard file formats and parsers aren’t enough to get at the finest grained details inside data fields themselves. Data preparation and cleaning are now essential stages of any data science workflow, where functions for string splitting, tokenizing, replacing characters, and more are frequently needed.

Hello, cuStrings & nvStrings

NVIDIA recently released a beta version of a new library called cuStrings. Built on CUDA, cuStrings provides a GPU-accelerated columnar interface for manipulating and performing parallel operations on arrays of strings. Given that string data is so commonly embedded alongside numeric values, cuStrings functionality is a natural fit for the data science challenges RAPIDS is designed to address.

To strengthen the RAPIDS ecosystem, NVIDIA is releasing nvStrings, open source Python bindings for the cuStrings library. nvStrings is available as a conda package and on GitHub.

Now, you can use nvStrings and cuDF together to read CSV files of mixed types, then perform common string operations such as splitting, replacing substrings, changing character case, padding, etc. When your strings are sufficiently cleaned and prepped, they can be hashed into a numeric representation and added as columns in a DataFrame. Going forward, the RAPIDS team is working to integrate nvStrings typed columns natively into cuDF.

Text Data is Complicated

Take the Beers, Breweries, and Beer Reviews dataset on Kaggle. The CSV file with metadata about each type of beer includes your typical numerical IDs, location codes, and entity names.

After a quick review, several potential problems jump out immediately:

  1. state and country codes are abbreviations; they should probably be replaced with full country and state names.
  2. style is a classic example of an overloaded field. If you want to group beer types and compare average review scores, well … you’re out of luck, unless you’re looking with an eye towards styles as specific as Japanese Rice Lager. This field is begging to be split into sub-fields.
  3. 99% of the content of “notes” is unhelpful (“No notes at this time.”). A simple N/A or empty string would suffice.
  4. The retired field uses “t” and “f”, but most ML algorithms need Boolean values represented as 0 or 1.

Any given dataset with text fields likely needs some level of string preprocessing to handle these kinds of issues before you can run an algorithm on the data to generate insights.

nvStrings Walkthrough

The Beers dataset also provides millions of user-submitted reviews. These reviews of different beers provide rich information about people’s preferences. Analyzing these reviews can reveal nuanced patterns to power recommendation systems, marketing materials, and sales forecasts. One basic use case is to compare the most common words used in reviews associated with high ratings and those in reviews associated with low ratings.

Below, we walk through an example Jupyter Notebook illustrating how nvStrings and cuDF can be used together to perform this kind of analysis and highlight the usage of common data preparation and cleaning functions. To get started working with nvStrings and cuDF, head over to the RAPIDS Getting Started page.

Conclusion

While this is a fairly basic analysis, all of these insights could still serve as launchpads for deeper analysis to understand how consumers respond to different drinks. nvStrings and RAPIDS cuDF are the foundation of GPU-accelerated workflows involving text in structured data, and we’re excited to be releasing this to the open source community.

Try the RAPIDS container today (on NVIDIA GPU Cloud or Docker Hub) that ships with nvStrings, or install from conda. Let us know on GitHub if you run into issues.

--

--