RAPIDS release blog 22.08

Sophie Watson
RAPIDS AI
Published in
7 min readSep 8, 2022

Improved performance, new features, extended functionality and greater stability

The latest release of RAPIDS is out. RAPIDS 22.08 is packed full of updates that will enable you to do more, more quickly when running your data science workloads on NVIDIA GPUs.

Key updates in RAPIDS 22.08 include:

  • expanded support for reading and writing data in cuDF and libcudf
  • a new cuGraph Dataset API which allows you to create and inspect Graph objects more easily

This RAPIDS release also includes robust code updates to provide a more stable and performative experience across all libraries, and all platforms.

NVIDIA GTC, the Global AI Conference
Before we dive into discussing new features and functionality, we wanted to let you know about NVIDIA GTC that is taking place online from September 19–22nd 2022. This free event is a great forum to hear directly from both RAPIDS users and developers. Be sure to check out these sessions to be the first to hear about our plans, and to find out how RAPIDS is being used to drive impact across industries.

Latest blogs
The RAPIDS community has also published a couple of fantastic blogs during this release:

  • Find out how to run large scale graph analytics with Memgraph and NVIDIA in this blog post written by our friends at Memgraph
  • Our cuML team wrote an awesome introduction to Naive Bayes in this blog, showing you how to implement each of the Naive Bayes variants, and explaining when you should use them

New Features

Expanded support for reading and writing data in cuDF and libcudf
Many users have asked for the ability to write strings and lists of int8 values as binary, when working with Parquet. As a result, in this release we made it possible for incoming data that is either a string or a LIST of INT8/UINT8 to be encoded as Parquet binary if the set_output_as_binary flag is set.

For RAPIDS users working in the Hadoop ecosystem, ORC is a commonly used file format. The cuDF ORC writer previously supported writing DataFrames to ORC file with snappy compression. In the 22.08 release, the team has added experimental support for using zlib compression when writing to ORC, by passing compression = ZLIB to the function.

import os
os.environ['LIBCUDF_NVCOMP_POLICY'] = 'ALWAYS'
import cudf, cupy
df = cudf.DataFrame({'a': cupy.random.randint(0, 100, 100_000)})
df.to_orc('test.orc', compression='ZLIB')

Create Graphs from sample datasets in one line with the new cuGraph Dataset API
The cuGraph 22.08 release introduces the new — still experimental — datasets API. Much like similar APIs in Scikit-Learn, Pytorch and NetworkX, the cuGraph datasets API allows users to quickly access popular, predefined datasets for use in demos, experiments, tests, examples, or any other application that calls for graph data that isn’t user-specific.

Consider the very popular Zachary’s Karate Club dataset. Prior to the new datasets API, in order for a user to experiment with cuGraph using this dataset, they would first have to download it to a location accessible to their machine. Then they would manually read the downloaded file into a cuDF DataFrame by assigning names to the individual columns, which also requires them to know the CSV file’s delimiter and data types used. Only then could they create a cuGraph Graph object using the assigned column names, before finally running an analysis algorithm such as Pagerank.

The datasets API greatly simplifies this process. One import statement and one line of code is all that is needed before running the analysis algorithm. The user does not even need to know details such as the delimiter field or data types, or manually name and specify columns, in order to achieve this:

Previous

>>> import cugraph
>>> import cudf
>>>
>>> # Assume karate.csv has been downloaded to the path specified
>>> datafile = "/path/to/karate.csv"
>>> gdf = cudf.read_csv(datafile,
... delimiter=' ',
... names=['src', 'dst', 'weight'],
... dtype=['int32', 'int32', 'float32'])
>>> G = cugraph.Graph()
>>> G.from_cudf_edgelist(gdf, source='src', destination='dst')
>>>
>>> # Find the top three "important" vertices
>>> cugraph.pagerank(G).sort_values('pagerank', ascending=False)[0:3]
pagerank vertex
16 0.100917 33
17 0.096999 0
18 0.071692 32

New

>>> import cugraph
>>> from cugraph.experimental import datasets
>>>
>>> G = datasets.karate.get_graph(fetch=True)
>>>
>>> # Find the top three "important" vertices
>>> cugraph.pagerank(G).sort_values('pagerank', ascending=False)[0:3]
pagerank vertex
16 0.100917 33
17 0.096999 0
18 0.071692 32

In addition to the Karate Club dataset, the datasets API includes ‘dolphins’, a small 62-vertex graph representing a community of bottlenose dolphins, and ‘polbooks’, a 105-vertex graph on books about US politics.

The datasets API is also useful for users who would like to make their own datasets easy to use by others. This is done by providing a YAML file with required meta-data. See the .yaml files in the datasets/metadata directory in the cugraph repo for examples, and create a Dataset instance as shown in the datasets package.

All of the cuGraph example notebooks have been updated to use this sleek new API, so be sure to check them out to see examples of how to speed up your graph processing in Python.

Updates to RAFT: Reusable Accelerated Functions and Tools.
The RAFT library contains widely-used algorithms and primitives for data science and machine learning. Historically, the RAFT library has provided behind the scenes support for cuML and cuGraph but, in this 22.08 release the RAFT project has begun to get a set of its own end-user-focused, independent deliverables! The first is a C++ implementation of our new IVF-FLAT Approximate Nearest Neighbors algorithm. This is an important new feature as it is faster than FAISS, which is pretty much the standard library for high performance ANN on GPUs. And stay tuned because the team is in the process of integrating it into FAISS!

Speed up your 2D and 2D Euclidean distance transforms in cuCIM
cuCIM 22.08 introduces GPU-accelerated Euclidean distance transforms for 2D images and 3D volumes. For any image with a set of seed points, you can now use distance_transform_edt to compute the Euclidean distance from each background point to the nearest seed points. You can also get back coordinate images that give the x, y (and z in 3D) coordinate of the nearest seed point to each point in the image:

import cupy as cp
from cucim.core.operations.morphology import distance_transform_edt
import math
shape = (200, 200)
size = math.prod(shape)
ntrue = .001 * size
p_true = ntrue / size
p_false = 1 - p_true
# generate a sparse set of 2D background points
cp.random.seed(123)
image = cp.random.choice([0, 1], size=shape, p=(p_false, p_true))
distances, coords = distance_transform_edt(
image == 0, return_distances=True, return_indices=True)

The colored Voronoi diagram in Figure 1 was generated by post processing the output to illustrate the “regions” closest to each individual seed point.

Figure 1. Coordinate images based on post processing regions closest to the seeds shown in the first pane.

Performance improvements across the board

This release includes a set of new updates to speed up your code.

cuDF now allows specification of default bit width
Previously, cuDF had a conservative, fixed 64 bit width default. With the 22.08 release, cuDF now allows users to control the default bit width for integer and floating types. It provides three options: ‘None’, ‘32bit’ and ‘64bit’ (when set as None, the result dtype will be aligned with what pandas constructs).

This default is used when type inference is needed, including CSV/JSON readers when dtypes are not specified, for cuDF constructors, and when materializing a range index.

These updates create a user interface that is now consistent with `pandas`.

In the example below, read_csv typically uses int64 and float64 as default for the data. However, by setting the default_integer_bitwidth and default_float_bitwidth accordingly, values are read as int32 and float 32 instead:

>>> cudf.set_option("default_integer_bitwidth", 32)
>>> cudf.set_option("default_float_bitwidth", 32)
>>> df = cudf.read_csv(“test.csv”)
>>> df
a b
0 1 2.0
1 2 3.0
2 3 4.0
>>> df.dtypes
a int32
b float32
dtype: object

cuGraph leveraging pylibcugraph for better performance
cuGraph release 22.08 also improves performance in the higher-level cuGraph Python library. Updates to Uniform Neighbor Sampling, BFS, Core Number and Pagerank enable these algorithms to leverage the pylibcugraph API to reuse internal graph representations between calls when possible, rather than reconstruct them each time. As well as improving performance, these changes also remove large portions of legacy code to decrease both maintenance and build time.

cuSpatial accelerated GeoPandas DataFrames
If you’re using GeoPandas DataFrames you should see much faster results,particularly when loading in data. This is due to a significant refactoring and improved support for the GeoArrow DataFrame specification.

Conclusion

The RAPIDS 22.08 release gives users improved performance, new features, extended functionality and greater stability. To learn more about how you can benefit, don’t miss the opportunity to sign up for the RAPIDS talks at GTC. Whether you’re just starting out with RAPIDS, or you’re a seasoned community member, you’re not going to want to miss John Zedlewski’s Deep Dive into RAPIDS for accelerated Data Science and Data Engineering.

As always, we’re looking forward to hearing from you on how you are using the new capabilities in RAPIDS. As always, reach us on GitHub, follow us on Twitter, and check out our documentation and getting started resources.

Need enterprise support? NVIDIA global support is available for RAPIDS with the NVIDIA AI Enterprise software suite.

--

--